[00:00:01] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) I'm a bit confused now. I thought that was the question we talked about in today's meeting. [00:00:55] legoktm: I prefer it outside too but that's what's in the handbook https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers [00:01:23] does something rely on that exact format? [00:01:26] I doubt it... [00:04:50] yeah, I think we should change it in both places thcipriani would that be okay? [00:05:01] both in the handbook and the scripts [00:06:40] PROBLEM - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:28] PROBLEM - LVS linkrecommendation eqiad port 4005/tcp - Link Recommendation- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4005: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:08:42] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw2410.codfw.wmnet [00:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:50] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw2411.codfw.wmnet [00:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:54] RECOVERY - LVS linkrecommendation eqiad port 4005/tcp - Link Recommendation- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 193 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:10:55] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw2411.codfw.wmnet [00:10:58] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw2411.codfw.wmnet [00:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:00] (03PS1) 10Legoktm: conftool-data: Document which servers are only pooled as jobrunner/videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/679022 (https://phabricator.wikimedia.org/T279100) [00:16:01] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Legoktm) 05Open→03Resolved uh, that's right, my bad >.< I submitted a documentation patch just... [00:23:48] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "confirmed with conftctl" [puppet] - 10https://gerrit.wikimedia.org/r/679022 (https://phabricator.wikimedia.org/T279100) (owner: 10Legoktm) [00:24:10] (03CR) 10Legoktm: [C: 03+2] conftool-data: Document which servers are only pooled as jobrunner/videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/679022 (https://phabricator.wikimedia.org/T279100) (owner: 10Legoktm) [00:24:45] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) documentation patch +1, confirmed that's how it is now. Thanks. I could also be wrong, if w... [00:27:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:00] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 56658080 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:06:24] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 37000 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:18:16] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:30:34] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:43:32] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cron.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:57:22] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:45:44] (03PS1) 10Andrew Bogott: wmcs-policy-tests.py: add Designate tests [puppet] - 10https://gerrit.wikimedia.org/r/679083 (https://phabricator.wikimedia.org/T279845) [02:49:56] !log andrew@deploy1002 Started deploy [horizon/deploy@ef844a1]: fix for T276963 [02:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:07] T276963: Horizon: add doc links and discouragement to the 'server groups' UIs - https://phabricator.wikimedia.org/T276963 [02:50:24] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-policy-tests.py: add Designate tests [puppet] - 10https://gerrit.wikimedia.org/r/679083 (https://phabricator.wikimedia.org/T279845) (owner: 10Andrew Bogott) [02:54:07] !log andrew@deploy1002 Finished deploy [horizon/deploy@ef844a1]: fix for T276963 (duration: 04m 10s) [02:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:36] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.046 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:18:08] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 3.622 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:46] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:50:20] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [04:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:58] (03PS1) 10Marostegui: mariadb: Decommission db1076 [puppet] - 10https://gerrit.wikimedia.org/r/679138 (https://phabricator.wikimedia.org/T274752) [05:04:45] !log root@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1076.eqiad.wmnet [05:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:45] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [05:07:47] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [05:07:50] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [05:07:51] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [05:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:24] PROBLEM - Check systemd state on mw2265 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:46] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1076.eqiad.wmnet [05:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:04] 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 (10Marostegui) a:05Marostegui→03wiki_willy [05:16:08] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:19:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1076 [puppet] - 10https://gerrit.wikimedia.org/r/679138 (https://phabricator.wikimedia.org/T274752) (owner: 10Marostegui) [05:25:42] (03PS1) 10Marostegui: db1177: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/679149 (https://phabricator.wikimedia.org/T275633) [05:28:54] (03CR) 10Marostegui: [C: 03+2] db1177: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/679149 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [05:30:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1177 with minimal weight on s8 for the first time T275633', diff saved to https://phabricator.wikimedia.org/P15313 and previous config saved to /var/cache/conftool/dbconfig/20210414-052959-marostegui.json [05:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:09] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:25:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1177 with minimal weight on s8 for the first time T275633', diff saved to https://phabricator.wikimedia.org/P15314 and previous config saved to /var/cache/conftool/dbconfig/20210414-062549-marostegui.json [06:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:01] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:49:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:52:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:58:31] (03PS1) 10Ayounsi: Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583) [06:59:33] (03CR) 10jerkins-bot: [V: 04-1] Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi) [07:02:55] (03PS2) 10Ayounsi: Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583) [07:03:36] (03CR) 10jerkins-bot: [V: 04-1] Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi) [07:04:43] (03PS3) 10Ayounsi: Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583) [07:06:14] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:54] (03CR) 10Ayounsi: [C: 03+2] Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi) [07:07:37] (03Merged) 10jenkins-bot: Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi) [07:15:30] 10SRE, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10ayounsi) [07:22:40] !log push pfw policy - T280059 [07:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/677514 (owner: 10Jbond) [07:31:42] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:34:14] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Bump memory/cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/679240 (https://phabricator.wikimedia.org/T279411) [07:36:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Bump memory/cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/679240 (https://phabricator.wikimedia.org/T279411) (owner: 10Alexandros Kosiaris) [07:37:07] 10SRE, 10Sustainability: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10MoritzMuehlenhoff) >>! In T128592#6996726, @Legoktm wrote: > Is it even possible for IRC to be active-active? Doesn't the client have to maintain a connection with a single server, and if that s... [07:38:31] (03Merged) 10jenkins-bot: linkrecommendation: Bump memory/cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/679240 (https://phabricator.wikimedia.org/T279411) (owner: 10Alexandros Kosiaris) [07:40:50] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [07:40:50] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [07:40:50] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [07:40:51] !log swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836 [07:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:17] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [07:41:33] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [07:41:33] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [07:41:33] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [07:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:01] !log imported chartmuseum_0.13.1-1 to buster-wikimedia [07:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:32] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [07:42:33] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [07:42:33] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [07:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:43] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=codfw [07:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:13] 10SRE, 10ops-codfw, 10User-fgiunchedi: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T279245 (10fgiunchedi) 05Open→03Resolved Thank you @papaul, all good [07:51:10] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=codfw [07:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:29] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=eqiad [07:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:09] !log restarting blazegraph + updater on wdqs1013 [07:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:48] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [07:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:03] !log depooling wdqs1013 - catching up on lag [07:57:05] ryankemper: ^ [07:57:10] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.087 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:20] PROBLEM - WDQS high update lag on wdqs2001 is CRITICAL: 1.305e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:58:48] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:59:27] !log depooling wdqs2001 - catching up on lag [07:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:20] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:06] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [08:01:36] RECOVERY - Check systemd state on gerrit1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:45] (03PS1) 10Muehlenhoff: Remove kraz [puppet] - 10https://gerrit.wikimedia.org/r/679250 [08:01:47] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [08:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:00] PROBLEM - WDQS high update lag on wdqs2001 is CRITICAL: 1.304e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:04:38] PROBLEM - WDQS high update lag on wdqs2004 is CRITICAL: 1.307e+05 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:05:51] (03PS2) 10Effie Mouzeli: hieradata: enable onhost memcached socket on all mw clusters [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115) [08:05:56] !log depooling wdqs2004 - catching up on lag [08:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:22] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=eqiad [08:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:50] ACKNOWLEDGEMENT - WDQS high update lag on wdqs2001 is CRITICAL: 1.302e+05 ge 4.32e+04 Gehel catching up on lag after data reload https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:06:50] ACKNOWLEDGEMENT - WDQS high update lag on wdqs2004 is CRITICAL: 1.307e+05 ge 3600 Gehel catching up on lag after data reload https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:07:08] !log updated chartmuseum to 0.13.1 on charmuseum1001, chartmuseum2001 [08:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:34] (03PS1) 10Alexandros Kosiaris: conftool: Create a shared jobrunner_videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/679258 (https://phabricator.wikimedia.org/T279100) [08:16:15] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=(wtp1033.eqiad.wmnet|wtp1032.eqiad.wmnet) [08:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:25] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: cluster=videoscaler,service=apache2,name=mw2395.codfw.wmnet [08:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:31] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: cluster=videoscaler,service=apache2,name=mw2394.codfw.wmnet [08:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:58] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) I 've gone a bit overboard and created https://gerrit.wikimedia.org/r/679258 that uses YA... [08:28:45] (03PS1) 10Jbond: P:debmonitor::client: update debmon-client systemd::timer [puppet] - 10https://gerrit.wikimedia.org/r/679263 [08:33:39] (03PS1) 10Jbond: P:debmonitor::server: drop systemd-catafrom gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268 [08:34:50] (03CR) 10jerkins-bot: [V: 04-1] P:debmonitor::server: drop systemd-catafrom gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268 (owner: 10Jbond) [08:35:38] (03PS2) 10Jbond: P:debmonitor::server: drop systemd-cat for gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268 [08:36:30] PROBLEM - LVS linkrecommendation eqiad port 4005/tcp - Link Recommendation- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4005: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:36:35] (03PS3) 10Jbond: P:debmonitor::server: drop systemd-catafrom gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268 [08:36:48] (03CR) 10Filippo Giunchedi: "nits inline but LGTM otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond) [08:36:53] (03CR) 10Filippo Giunchedi: [C: 03+1] check_https_client_auth_puppet: add new icinga check [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond) [08:38:52] RECOVERY - LVS linkrecommendation eqiad port 4005/tcp - Link Recommendation- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 193 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:39:10] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jijiki) >>! In T279100#6997273, @akosiaris wrote: > I 've gone a bit overboard and created https://g... [08:40:21] jouncebot: now [08:40:21] No deployments scheduled for the next 2 hour(s) and 19 minute(s) [08:40:24] jouncebot: next [08:40:24] In 2 hour(s) and 19 minute(s): [[Backport windows|European mid-day backport window]]
'''''' (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T1100) [08:40:31] !log Stagging on mwdebug1002 [08:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:35] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679263 (owner: 10Jbond) [08:44:36] !log Run scap pull on mwdebug1002 [08:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:06] 10SRE, 10Patch-For-Review: migrate services from bast1002 to bast1003 - https://phabricator.wikimedia.org/T276399 (10MoritzMuehlenhoff) 05Open→03Resolved bast1003 has now fully replaced bast1002. The decom task for bast1002 is T280110 [08:48:09] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10MoritzMuehlenhoff) [08:48:12] 10SRE, 10Patch-For-Review: migrate services from bast1002 to bast1003 - https://phabricator.wikimedia.org/T276399 (10MoritzMuehlenhoff) [08:50:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:52:35] (03PS1) 10Muehlenhoff: Remove bast1002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/679273 [08:53:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:53:27] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission for hosts bast1002.wikimedia.org [08:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:15] (03CR) 10Muehlenhoff: [C: 03+2] Remove bast1002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/679273 (owner: 10Muehlenhoff) [08:58:26] (03PS2) 10Jbond: P:debmonitor::client: update debmon-client systemd::timer [puppet] - 10https://gerrit.wikimedia.org/r/679263 [08:58:56] (03PS1) 10Jbond: debmonitor-client: Improve retry logic [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 [09:00:02] RECOVERY - mediawiki-installation DSH group on wtp1033 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:02:20] (03CR) 10Volans: "I just noticed that I forgot to add a timeout to those requests here, so we should probably duplicate what's in wmflib to have both behavi" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond) [09:03:15] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [09:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:54] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast1002.wikimedia.org [09:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:28] PROBLEM - Query Service HTTP Port on wdqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:04:58] (03CR) 10Volans: "> Patch Set 1:" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond) [09:05:00] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:16] PROBLEM - WDQS SPARQL on wdqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:05:30] (03CR) 10jerkins-bot: [V: 04-1] debmonitor-client: Improve retry logic [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond) [09:06:28] !log T267927 depool `wdqs2001` following data transfer (catching up on lag) [09:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:40] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [09:06:41] (03CR) 10Volans: [C: 03+1] "LGTM, thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679268 (owner: 10Jbond) [09:06:45] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, once fixed the duplicated line, +1." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [09:07:29] (03PS1) 10Urbanecm: Don't allow query and cookie hacks to enable topic subscriptions [extensions/DiscussionTools] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/678994 (https://phabricator.wikimedia.org/T280082) [09:07:46] (03PS3) 10Effie Mouzeli: hieradata: enable onhost memcached socket on all mw clusters [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115) [09:07:52] (03CR) 10Urbanecm: [C: 03+2] "train blocker" [extensions/DiscussionTools] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/678994 (https://phabricator.wikimedia.org/T280082) (owner: 10Urbanecm) [09:07:54] (03CR) 10Effie Mouzeli: hieradata: enable onhost memcached socket on all mw clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [09:08:14] (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: drop systemd-catafrom gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268 (owner: 10Jbond) [09:08:24] (03CR) 10Jbond: [C: 03+2] P:debmonitor::client: update debmon-client systemd::timer [puppet] - 10https://gerrit.wikimedia.org/r/679263 (owner: 10Jbond) [09:09:14] (03PS4) 10Jbond: P:debmonitor::server: drop systemd-catafrom gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268 [09:09:24] (03PS5) 10Jbond: P:debmonitor::server: drop systemd-catafrom gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268 [09:09:34] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission for hosts kraz.wikimedia.org [09:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: enable onhost memcached socket on all mw clusters [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [09:10:30] (03CR) 10Volans: "question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond) [09:10:46] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [09:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:54] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:04] !log T267927 depooled `wdqs1004` following data transfer (catching up on lag), current round of data transfers is done so there shouldn't be any left to depool [09:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:14] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [09:14:18] ryankemper: do you know why wdqs1003 is complaining? [09:14:20] RECOVERY - mediawiki-installation DSH group on wtp1032 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:14:22] (03CR) 10Jbond: check_https_client_auth_puppet: add new icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond) [09:14:30] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=udpmxircecho site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:15:51] (03Merged) 10jenkins-bot: Don't allow query and cookie hacks to enable topic subscriptions [extensions/DiscussionTools] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/678994 (https://phabricator.wikimedia.org/T280082) (owner: 10Urbanecm) [09:16:10] gehel: not sure about either 1003 or 1010, neither should be related to the transfers [09:16:25] !log restarting blazegraph on wdqs1003 [09:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:33] ^ the udpmxircecho should be harmless, will have a look soon [09:19:50] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kraz.wikimedia.org [09:19:57] 10SRE, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review, 10User-notice: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: `kraz.wikimedia.org` - kraz.wikimedia.org (**PASS**) - Dow... [09:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:23] (03PS1) 10Muehlenhoff: Update NAT exceptions for kraz -> irc1001/irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/679278 [09:20:28] (03PS1) 10Filippo Giunchedi: admin: add lmeintrup [puppet] - 10https://gerrit.wikimedia.org/r/679279 (https://phabricator.wikimedia.org/T279531) [09:22:07] !log depooling wdqs1003 - corrupted data after data reload [09:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:35] !log repooling wdqs1013, catched up on lag [09:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:15] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/DiscussionTools/includes/Hooks/HookUtils.php: e4b2d93dcf86a336314ed09fd37844edb16f4f30: Dont allow query and cookie hacks to enable topic subscriptions (T280082) (duration: 01m 24s) [09:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:24] T280082: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'mediawikiwiki.discussiontools_subscription' doesn't exist (10.64.16.7) - https://phabricator.wikimedia.org/T280082 [09:24:24] (03CR) 10Jbond: [C: 03+2] P:debmonitor::Server: drop absented resource [puppet] - 10https://gerrit.wikimedia.org/r/677514 (owner: 10Jbond) [09:24:32] (03PS5) 10Jbond: P:debmonitor::Server: drop absented resource [puppet] - 10https://gerrit.wikimedia.org/r/677514 [09:25:15] (03CR) 10Volans: "> Patch Set 1:" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond) [09:27:19] (03PS1) 10Filippo Giunchedi: admin: add hnordeen [puppet] - 10https://gerrit.wikimedia.org/r/679280 (https://phabricator.wikimedia.org/T280073) [09:27:40] !log disable puppet on all mediawiki servers to merge 676580 [09:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:02] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable onhost memcached socket on all mw clusters [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [09:29:02] !log depooling wdqs1004 - corrupted data after data reload [09:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:17] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) >>! In T279100#6997312, @jijiki wrote: >>>! In T279100#6997273, @akosiaris wrote: >> I 'v... [09:32:25] (03CR) 10Volans: check_https_client_auth_puppet: add new icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond) [09:33:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1177 with minimal weight on s8 for the first time T275633', diff saved to https://phabricator.wikimedia.org/P15316 and previous config saved to /var/cache/conftool/dbconfig/20210414-093305-marostegui.json [09:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:15] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [09:36:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 20%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15317 and previous config saved to /var/cache/conftool/dbconfig/20210414-093642-root.json [09:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:45] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [09:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:37] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.046 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:38:38] PROBLEM - Query Service HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:38:51] PROBLEM - Memcached on parse2001 is CRITICAL: connect to address 10.192.0.182 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:38:59] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:18] PROBLEM - Memcached on parse2005 is CRITICAL: connect to address 10.192.0.186 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:39:18] PROBLEM - Memcached on parse2009 is CRITICAL: connect to address 10.192.16.25 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:39:29] PROBLEM - Memcached on parse2007 is CRITICAL: connect to address 10.192.16.22 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:40:02] that is me ^ [09:40:14] sorry [09:40:19] PROBLEM - Memcached on parse2008 is CRITICAL: connect to address 10.192.16.24 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:40:49] PROBLEM - Memcached on parse2010 is CRITICAL: connect to address 10.192.16.206 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:40:57] PROBLEM - Memcached on parse2013 is CRITICAL: connect to address 10.192.32.197 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:41:01] RECOVERY - Check systemd state on mw2265 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:41:11] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:18] PROBLEM - Memcached on parse2020 is CRITICAL: connect to address 10.192.48.153 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:42:22] PROBLEM - Memcached on parse2014 is CRITICAL: connect to address 10.192.32.198 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:42:25] PROBLEM - Memcached on parse2017 is CRITICAL: connect to address 10.192.48.150 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:42:35] PROBLEM - Memcached on parse2016 is CRITICAL: connect to address 10.192.48.149 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:42:39] PROBLEM - Memcached on parse2018 is CRITICAL: connect to address 10.192.48.151 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:42:43] PROBLEM - Memcached on parse2015 is CRITICAL: connect to address 10.192.32.199 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:42:47] PROBLEM - Memcached on parse2004 is CRITICAL: connect to address 10.192.0.185 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:42:47] PROBLEM - Memcached on parse2012 is CRITICAL: connect to address 10.192.32.196 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:43:22] (03PS5) 10David Caro: ceph: add ceph repo and parameter to all client modules [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) [09:43:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:44:11] PROBLEM - Memcached on parse2019 is CRITICAL: connect to address 10.192.48.152 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:46:47] PROBLEM - Memcached on parse2003 is CRITICAL: connect to address 10.192.0.184 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:46:51] ACKNOWLEDGEMENT - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:51] ACKNOWLEDGEMENT - Query Service HTTP Port on wdqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.002 second response time Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:46:52] ACKNOWLEDGEMENT - WDQS SPARQL on wdqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.046 second response time Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 [09:46:52] .wikimedia.org/wiki/Wikidata_query_service/Runbook [09:46:53] ACKNOWLEDGEMENT - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:54] ACKNOWLEDGEMENT - Query Service HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.001 second response time Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:46:55] ACKNOWLEDGEMENT - WDQS SPARQL on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.052 second response time Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 [09:46:56] .wikimedia.org/wiki/Wikidata_query_service/Runbook [09:51:45] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Add an internal release [deployment-charts] - 10https://gerrit.wikimedia.org/r/679287 [09:51:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 30%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15318 and previous config saved to /var/cache/conftool/dbconfig/20210414-095146-root.json [09:51:47] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Cleanup production release [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 [09:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:55] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [09:54:08] (03PS2) 10Muehlenhoff: Remove kraz [puppet] - 10https://gerrit.wikimedia.org/r/679250 [09:54:33] 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [09:57:08] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for HNordeen - https://phabricator.wikimedia.org/T280073 (10fgiunchedi) p:05Triage→03Medium [10:02:13] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:32] (03PS1) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 [10:04:34] (03PS1) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 [10:04:54] (03CR) 10Marostegui: mariadb: Promote db1159 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/678801 (https://phabricator.wikimedia.org/T276448) (owner: 10Marostegui) [10:04:56] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1159 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/678801 (https://phabricator.wikimedia.org/T276448) (owner: 10Marostegui) [10:05:19] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:19] (03CR) 10Kosta Harlan: [C: 03+1] linkrecommendation: Cleanup production release [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris) [10:05:29] (03CR) 10David Caro: [C: 03+2] ceph.codfw1: enable ceph octopus repo [puppet] - 10https://gerrit.wikimedia.org/r/677583 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [10:05:47] (03CR) 10Kosta Harlan: [C: 03+1] linkrecommendation: Add an internal release [deployment-charts] - 10https://gerrit.wikimedia.org/r/679287 (owner: 10Alexandros Kosiaris) [10:06:27] (03CR) 10jerkins-bot: [V: 04-1] P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond) [10:06:30] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond) [10:06:49] (03CR) 10Muehlenhoff: [C: 03+2] Remove kraz [puppet] - 10https://gerrit.wikimedia.org/r/679250 (owner: 10Muehlenhoff) [10:06:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 40%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15319 and previous config saved to /var/cache/conftool/dbconfig/20210414-100649-root.json [10:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:00] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [10:08:14] (03PS2) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 [10:08:24] (03PS2) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 [10:09:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29024/console" [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond) [10:10:18] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679287 (owner: 10Alexandros Kosiaris) [10:11:24] 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:11:43] (03CR) 10Volans: "I'm not familiar with the send_mail puppetization but +1 for the approach." [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond) [10:11:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29025/console" [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond) [10:12:05] 10SRE, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review, 10User-notice: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff kraz has been replaced by two Buster instances (irc1001.wikimedia.org and irc2001.wik... [10:12:50] (03Merged) 10jenkins-bot: linkrecommendation: Add an internal release [deployment-charts] - 10https://gerrit.wikimedia.org/r/679287 (owner: 10Alexandros Kosiaris) [10:13:15] (03CR) 10Volans: "> Patch Set 2: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond) [10:14:18] ACKNOWLEDGEMENT - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:18] ACKNOWLEDGEMENT - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:20] moritzm: is there a reason not to point irc.wm.o to both 1001 and 2001 instead of having one standby? [10:15:35] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [10:18:00] (03CR) 10Jbond: [V: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond) [10:21:43] In 10 minutes we are restarting m1 master (etherpad, librenms, backups, bacula...) T276448 [10:21:44] T276448: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 [10:21:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15320 and previous config saved to /var/cache/conftool/dbconfig/20210414-102153-root.json [10:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:05] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [10:22:22] (03PS1) 10Elukey: Add kafka-logging1001 to term kafka in analytics-in4/6 [homer/public] - 10https://gerrit.wikimedia.org/r/679296 [10:23:26] XioNoX: around for a quick cr? :) [10:23:31] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/679296 [10:25:25] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: Upgrading ceph to octopus [10:25:28] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: Upgrading ceph to octopus [10:25:31] (03PS3) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 [10:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:34] (03PS3) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 [10:25:36] (03PS1) 10Jbond: check_cumin_aliases: ensure script exits 1 on error [puppet] - 10https://gerrit.wikimedia.org/r/679297 [10:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:38] (03PS4) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 [10:26:40] (03PS2) 10Jbond: check_cumin_aliases: ensure script exits 1 on error [puppet] - 10https://gerrit.wikimedia.org/r/679297 [10:26:42] (03PS4) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 [10:26:48] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond) [10:27:14] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:14] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond) [10:28:41] (03PS3) 10Jbond: check_cumin_aliases: ensure script exits 1 on error [puppet] - 10https://gerrit.wikimedia.org/r/679297 [10:28:44] (03PS5) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 [10:28:45] (03CR) 10jerkins-bot: [V: 04-1] P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond) [10:28:48] akosiaris: around for the failover? [10:29:48] jynus kormat ready? [10:29:53] I am here [10:30:15] here [10:30:23] Good, I am going to go ahead [10:30:29] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond) [10:30:33] !log Failover m1 from db1080 to db1159 - T276448 [10:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:43] T276448: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 [10:31:09] done [10:31:11] checking services [10:31:14] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:23] etherpad works for [10:31:28] me [10:31:29] (03PS6) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 [10:31:48] moritzm jbond42 switchover done, can you check cas/pki? [10:31:49] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1034.eqiad.wmnet', 'wtp1035.eqiad.wmnet', 'wtp1036.eqia... [10:32:06] marostegui: perfect! [10:32:17] works for me too btw [10:32:23] \o/ [10:32:34] librenms seems tobe working too [10:32:56] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond) [10:33:00] kormat: orchestrator needs cleaning up to remove the old heartbeat, I can do that later, not urgent [10:33:07] ack [10:33:21] is there something missing, other than dbbackups? [10:33:39] jynus: I am checking racktables and rt [10:33:49] So only backups I think [10:34:23] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:23] there is a rebase conflict, let me do it manually [10:34:29] oki [10:34:51] (03PS7) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 [10:35:29] kormat: orchestrator cleaned up, all good now [10:35:42] kormat: we should probably include this step on the failover checklist, as this always be needed [10:35:45] mmm strange, when I downloaded it didn't conflict [10:35:48] marostegui: CAS/IDP works fine,I just forced a new login with my U2F validation (which is fetched from mysql) [10:35:48] marostegui: yeah [10:35:58] (03PS5) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 [10:36:04] (03PS2) 10Jcrespo: dbbackups: Update backup metadata host db1080->db1159 [puppet] - 10https://gerrit.wikimedia.org/r/668449 (https://phabricator.wikimedia.org/T276448) [10:36:13] (03PS3) 10Jcrespo: dbbackups: Update backup metadata host db1080->db1159 [puppet] - 10https://gerrit.wikimedia.org/r/668449 (https://phabricator.wikimedia.org/T276448) [10:36:14] moritzm: thanks :* [10:36:45] marostegui, can I get a quick +1 up there to double check the new primary db name? [10:36:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 60%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15321 and previous config saved to /var/cache/conftool/dbconfig/20210414-103659-root.json [10:37:06] sure [10:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:08] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [10:37:36] I will run a backup and check alerts after merging it [10:37:41] thanks [10:38:39] (03CR) 10Marostegui: [C: 03+1] dbbackups: Update backup metadata host db1080->db1159 [puppet] - 10https://gerrit.wikimedia.org/r/668449 (https://phabricator.wikimedia.org/T276448) (owner: 10Jcrespo) [10:39:26] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:26] (03CR) 10Kosta Harlan: [C: 03+1] "What will deploying this do to the currently running linkrecommendation-production-load-datasets-1618390800-np4ch container?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris) [10:39:28] (03PS6) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 [10:39:31] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Update backup metadata host db1080->db1159 [puppet] - 10https://gerrit.wikimedia.org/r/668449 (https://phabricator.wikimedia.org/T276448) (owner: 10Jcrespo) [10:39:32] running puppet on alert1001, which will take a bit [10:40:28] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond) [10:41:12] (03CR) 10Kosta Harlan: linkrecommendation: Add an internal release (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/679287 (owner: 10Alexandros Kosiaris) [10:42:36] Majavah: see https://phabricator.wikimedia.org/T128592#6996726 [10:43:00] marostegui, db1080 will be decommissioned? [10:43:02] RECOVERY - Check systemd state on mw1386 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:06] (03PS1) 10Marostegui: db1080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/679307 [10:43:07] jynus: yes, but not today [10:43:11] sure [10:43:13] jynus: in a week or more if needed [10:43:31] I think we will only be sure the alerts are no longer pointing to db1080 then [10:43:37] in case there is some logic bug [10:44:08] I was thinking about waiting a whole week, would that work for you or you need more? [10:44:15] yeah, no rush [10:44:16] (03CR) 10Marostegui: [C: 03+2] db1080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/679307 (owner: 10Marostegui) [10:44:24] I was just commenting on the checks I can do for now [10:44:31] oki [10:44:35] I will run a new backup too [10:44:39] thanks [10:44:55] but it will take a few hours until it writes the results to the db, so taht will have to wait too [10:45:07] no worries [10:45:17] so far, everything looks good [10:45:19] once you are happy with it, let me know so I can close the task (or close it yourself) [10:46:03] wait, let me run puppet on cumin hosts, as otherwise they will try to write to the read only db [10:46:10] 10ops-eqiad, 10decommission-hardware: decommission bast1002.wikimedia.org - https://phabricator.wikimedia.org/T280110 (10MoritzMuehlenhoff) [10:46:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/679279 (https://phabricator.wikimedia.org/T279531) (owner: 10Filippo Giunchedi) [10:48:19] (03PS4) 10Muehlenhoff: Remove dbmonitor1001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/678799 (https://phabricator.wikimedia.org/T224589) [10:50:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905) (owner: 10Legoktm) [10:50:44] (03CR) 10Alexandros Kosiaris: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris) [10:51:23] (03CR) 10Jbond: [C: 03+1] "As the user is asking for access to turnilo they need approval from their manager and Analytics (Andrew Otto)," [puppet] - 10https://gerrit.wikimedia.org/r/679280 (https://phabricator.wikimedia.org/T280073) (owner: 10Filippo Giunchedi) [10:51:31] (03CR) 10Jbond: [C: 04-1] admin: add hnordeen [puppet] - 10https://gerrit.wikimedia.org/r/679280 (https://phabricator.wikimedia.org/T280073) (owner: 10Filippo Giunchedi) [10:51:34] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:52:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 70%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15322 and previous config saved to /var/cache/conftool/dbconfig/20210414-105202-root.json [10:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:13] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [10:53:04] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for HNordeen - https://phabricator.wikimedia.org/T280073 (10jbond) @Ottomata are you able to approve access to Turnilo for HNordeen [10:54:09] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:29] PROBLEM - Memcached on mw1315 is CRITICAL: connect to address 10.64.16.196 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:56:35] PROBLEM - Memcached on mw2401 is CRITICAL: connect to address 10.192.0.65 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:57:01] PROBLEM - Memcached on mw1345 is CRITICAL: connect to address 10.64.32.57 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:57:09] PROBLEM - Memcached on mw1343 is CRITICAL: connect to address 10.64.32.55 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:57:11] PROBLEM - Memcached on mw2402 is CRITICAL: connect to address 10.192.0.66 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:57:17] PROBLEM - Memcached on mw1340 is CRITICAL: connect to address 10.64.32.52 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:57:27] PROBLEM - Memcached on mw1339 is CRITICAL: connect to address 10.64.32.51 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:57:37] PROBLEM - Memcached on mw2405 is CRITICAL: connect to address 10.192.0.70 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:57:47] PROBLEM - Memcached on mw1290 is CRITICAL: connect to address 10.64.16.55 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:57:47] PROBLEM - Memcached on mw1346 is CRITICAL: connect to address 10.64.32.58 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:57:49] PROBLEM - Memcached on mw1344 is CRITICAL: connect to address 10.64.32.56 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:57:49] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Merging, I 'll deploy this once the job running right now is done." [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris) [10:59:31] (03CR) 10David Caro: [C: 03+2] ceph: add ceph repo and parameter to all client modules [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [10:59:37] (03CR) 10Marostegui: [C: 03+1] "I will clean up the grants file" [puppet] - 10https://gerrit.wikimedia.org/r/678799 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff) [10:59:43] PROBLEM - Memcached on mw1341 is CRITICAL: connect to address 10.64.32.53 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:59:43] (03CR) 10David Caro: [C: 03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [10:59:50] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1034.eqiad.wmnet with reason: REIMAGE [10:59:53] PROBLEM - Memcached on mw1363 is CRITICAL: connect to address 10.64.48.205 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) [[Backport windows|European mid-day backport window]]
'''''' deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:23] PROBLEM - Memcached on mw1356 is CRITICAL: connect to address 10.64.48.198 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [11:00:27] PROBLEM - Memcached on mw2396 is CRITICAL: connect to address 10.192.0.60 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [11:00:27] PROBLEM - Memcached on mw2404 is CRITICAL: connect to address 10.192.0.68 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [11:00:35] PROBLEM - Memcached on mw1362 is CRITICAL: connect to address 10.64.48.204 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [11:00:35] PROBLEM - Memcached on mw1361 is CRITICAL: connect to address 10.64.48.203 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [11:00:35] PROBLEM - Memcached on mw1377 is CRITICAL: connect to address 10.64.48.219 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [11:00:42] (03CR) 10Alexandros Kosiaris: linkrecommendation: Add an internal release (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/679287 (owner: 10Alexandros Kosiaris) [11:00:56] (03Merged) 10jenkins-bot: linkrecommendation: Cleanup production release [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris) [11:00:56] is that a monitoring issue? [11:01:45] PROBLEM - Memcached on mw1376 is CRITICAL: connect to address 10.64.48.218 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [11:01:47] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1035.eqiad.wmnet with reason: REIMAGE [11:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:00] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1034.eqiad.wmnet with reason: REIMAGE [11:02:01] jynus: likely related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/676580 [11:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:19] ok, thanks [11:02:29] but not sure if expected or not [11:02:47] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [11:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:59] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [11:02:59] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [11:02:59] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [11:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:45] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1036.eqiad.wmnet with reason: REIMAGE [11:03:48] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [11:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:00] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [11:04:00] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [11:04:00] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [11:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:12] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1035.eqiad.wmnet with reason: REIMAGE [11:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:48] (03PS3) 10Arturo Borrero Gonzalez: gridengine: set grid-configurator source files to use new domain name [puppet] - 10https://gerrit.wikimedia.org/r/678043 (https://phabricator.wikimedia.org/T277653) (owner: 10Bstorm) [11:04:50] (03PS3) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) [11:05:53] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [11:06:02] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [11:06:02] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [11:06:03] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [11:06:03] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [11:06:07] (03PS4) 10Jbond: check_https_client_auth_puppet: add new icinga check [puppet] - 10https://gerrit.wikimedia.org/r/678844 [11:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:12] (03CR) 10Jbond: "updated thanks" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond) [11:06:14] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1036.eqiad.wmnet with reason: REIMAGE [11:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:58] (03CR) 10Volans: [C: 03+1] "I didn't tested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/679297 (owner: 10Jbond) [11:07:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 80%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15323 and previous config saved to /var/cache/conftool/dbconfig/20210414-110706-root.json [11:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:19] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [11:07:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29026/console" [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond) [11:10:24] PROBLEM - cassandra-c SSL 10.64.0.150:7001 on restbase1021 is CRITICAL: SSL CRITICAL - Certificate restbase1021-c valid until 2021-05-14 11:10:13 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:10:27] PROBLEM - cassandra-b SSL 10.64.16.123:7001 on restbase1024 is CRITICAL: SSL CRITICAL - Certificate restbase1024-b valid until 2021-05-14 11:10:21 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:10:27] PROBLEM - cassandra-b SSL 10.64.0.149:7001 on restbase1021 is CRITICAL: SSL CRITICAL - Certificate restbase1021-b valid until 2021-05-14 11:10:12 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:10:37] PROBLEM - cassandra-a SSL 10.64.0.148:7001 on restbase1021 is CRITICAL: SSL CRITICAL - Certificate restbase1021-a valid until 2021-05-14 11:10:11 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:10:41] PROBLEM - cassandra-c SSL 10.64.48.186:7001 on restbase1027 is CRITICAL: SSL CRITICAL - Certificate restbase1027-c valid until 2021-05-14 11:10:31 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:11:01] PROBLEM - cassandra-b SSL 10.64.48.181:7001 on restbase1026 is CRITICAL: SSL CRITICAL - Certificate restbase1026-b valid until 2021-05-14 11:10:27 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:11:01] PROBLEM - cassandra-a SSL 10.64.16.118:7001 on restbase1023 is CRITICAL: SSL CRITICAL - Certificate restbase1023-a valid until 2021-05-14 11:10:17 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:11:04] 10SRE, 10Performance-Team, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.1; 2021-04-13), and 2 others: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki) [11:11:05] PROBLEM - cassandra-b SSL 10.64.0.106:7001 on restbase1020 is CRITICAL: SSL CRITICAL - Certificate restbase1020-b valid until 2021-05-14 11:10:09 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:11:05] PROBLEM - cassandra-b SSL 10.64.16.119:7001 on restbase1023 is CRITICAL: SSL CRITICAL - Certificate restbase1023-b valid until 2021-05-14 11:10:18 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:11:08] (03CR) 10Muehlenhoff: "That's actually intentional, though. See https://github.com/wikimedia/puppet/commit/c22aeac15940e20af4a6bfdb64ae9e7e1775cc49" [puppet] - 10https://gerrit.wikimedia.org/r/679297 (owner: 10Jbond) [11:11:21] PROBLEM - cassandra-a SSL 10.64.16.114:7001 on restbase1022 is CRITICAL: SSL CRITICAL - Certificate restbase1022-a valid until 2021-05-14 11:10:14 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:11:43] PROBLEM - cassandra-c SSL 10.64.48.128:7001 on restbase1025 is CRITICAL: SSL CRITICAL - Certificate restbase1025-c valid until 2021-05-14 11:10:25 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:11:54] 10SRE, 10serviceops, 10Patch-For-Review: Migrate onhost memcached to use a unix socket - https://phabricator.wikimedia.org/T273115 (10jijiki) 05Open→03Resolved a:03jijiki [11:12:03] PROBLEM - cassandra-a SSL 10.64.48.180:7001 on restbase1026 is CRITICAL: SSL CRITICAL - Certificate restbase1026-a valid until 2021-05-14 11:10:26 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:12:04] PROBLEM - cassandra-c SSL 10.64.16.120:7001 on restbase1023 is CRITICAL: SSL CRITICAL - Certificate restbase1023-c valid until 2021-05-14 11:10:19 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:12:11] is that for robh ^ ? [11:12:23] nah, they're routine expiries [11:12:28] I'll handle them [11:12:37] PROBLEM - cassandra-c SSL 10.64.16.124:7001 on restbase1024 is CRITICAL: SSL CRITICAL - Certificate restbase1024-c valid until 2021-05-14 11:10:22 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:12:56] (03CR) 10Ayounsi: [C: 03+1] "Is there a task?" [homer/public] - 10https://gerrit.wikimedia.org/r/679296 (owner: 10Elukey) [11:13:07] ok [11:13:09] PROBLEM - cassandra-a SSL 10.64.0.105:7001 on restbase1020 is CRITICAL: SSL CRITICAL - Certificate restbase1020-a valid until 2021-05-14 11:10:08 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:13:27] PROBLEM - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is CRITICAL: SSL CRITICAL - Certificate restbase1026-c valid until 2021-05-14 11:10:28 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:13:35] hnowlan: thanks, is there a way to not have routine IRC alert flood? :) [11:13:53] (03CR) 10Volans: [C: 03+1] "LGTM, as always check_http is hard to parse so I can't exclude typos, but the logic seems good." [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond) [11:14:15] (03CR) 10Muehlenhoff: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/678799 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff) [11:14:39] (03CR) 10Muehlenhoff: [C: 03+2] Remove dbmonitor1001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/678799 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff) [11:14:53] XioNoX: they expire once every 2 years so I'm tempted to say no - but this isn't ideal I agree [11:15:04] PROBLEM - cassandra-a SSL 10.64.0.101:7001 on restbase1019 is CRITICAL: SSL CRITICAL - Certificate restbase1019-a valid until 2021-05-14 11:10:05 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:15:27] how long are they in WARNING state? [11:15:40] maybe can improve ways to catch that before it becomes critical [11:15:41] PROBLEM - cassandra-b SSL 10.64.16.115:7001 on restbase1022 is CRITICAL: SSL CRITICAL - Certificate restbase1022-b valid until 2021-05-14 11:10:15 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:17:04] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/679297 (owner: 10Jbond) [11:17:20] PROBLEM - cassandra-c SSL 10.64.16.116:7001 on restbase1022 is CRITICAL: SSL CRITICAL - Certificate restbase1022-c valid until 2021-05-14 11:10:16 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:18:14] (03CR) 10Jbond: [V: 03+1 C: 03+2] check_https_client_auth_puppet: add new icinga check [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond) [11:20:17] PROBLEM - cassandra-a SSL 10.64.48.184:7001 on restbase1027 is CRITICAL: SSL CRITICAL - Certificate restbase1027-a valid until 2021-05-14 11:10:29 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:21:16] (03Abandoned) 10Jbond: Switch debmonitor to Envoy (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/625890 (owner: 10Muehlenhoff) [11:22:00] (03CR) 10Jbond: "noticed this come up as a merge conflict, superceeded now by I584cc371938ed4c0cfd22e7e6e9d1cbefeb0df76 so boldly abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/625890 (owner: 10Muehlenhoff) [11:22:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 90%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15325 and previous config saved to /var/cache/conftool/dbconfig/20210414-112211-root.json [11:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:21] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [11:25:06] (03CR) 10Muehlenhoff: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/625890 (owner: 10Muehlenhoff) [11:25:31] RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor1002 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 639 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [11:25:38] !log regenerated certificates for restbase1019/restbase102[0-7] [11:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096 (s5,s6) kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15326 and previous config saved to /var/cache/conftool/dbconfig/20210414-112619-marostegui.json [11:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:02] volans: I think it's 2 months for WARNING [11:27:55] PROBLEM - cassandra-a SSL 10.64.48.126:7001 on restbase1025 is CRITICAL: SSL CRITICAL - Certificate restbase1025-a valid until 2021-05-14 11:10:23 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:28:07] PROBLEM - cassandra-a SSL 10.64.16.122:7001 on restbase1024 is CRITICAL: SSL CRITICAL - Certificate restbase1024-a valid until 2021-05-14 11:10:20 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:29:33] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [11:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:53] volans: already in https://phabricator.wikimedia.org/T225140 :) [11:30:08] XioNoX: ehehe :D [11:30:26] PROBLEM - cassandra-b SSL 10.64.48.127:7001 on restbase1025 is CRITICAL: SSL CRITICAL - Certificate restbase1025-b valid until 2021-05-14 11:10:24 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:31:38] !log Upgrade kernel on db1096 (s5, s6) [11:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:13] oh, hadn't seen that ticket - having tasks for these in WARNING would be great [11:32:56] part of the problem with this spam was that 9 hosts were done at the same time in the distant past, if it was just one host it wouldn't be so spammy [11:33:53] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Migrate default nework policies (default-network-policy-conf.yaml) to GlobalNetworkPolicies - https://phabricator.wikimedia.org/T280125 (10JMeybohm) [11:33:55] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:18] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Migrate default nework policies (default-network-policy-conf.yaml) to GlobalNetworkPolicies - https://phabricator.wikimedia.org/T280125 (10JMeybohm) p:05Triage→03Low [11:34:43] PROBLEM - cassandra-b SSL 10.64.48.185:7001 on restbase1027 is CRITICAL: SSL CRITICAL - Certificate restbase1027-b valid until 2021-05-14 11:10:30 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:34:53] PROBLEM - cassandra-c SSL 10.64.0.146:7001 on restbase1020 is CRITICAL: SSL CRITICAL - Certificate restbase1020-c valid until 2021-05-14 11:10:10 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [11:35:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: Repool db1096:3315 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15327 and previous config saved to /var/cache/conftool/dbconfig/20210414-113557-root.json [11:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15328 and previous config saved to /var/cache/conftool/dbconfig/20210414-113714-root.json [11:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:28] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [11:37:51] (03CR) 10Muehlenhoff: P:debmonitor::client: migrate timer::job to use send_mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond) [11:38:45] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:52] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [11:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:01] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Set resource requests and limits for calico PODs - https://phabricator.wikimedia.org/T277877 (10JMeybohm) This is not exactly looking great on the staging clusters as we can see heavy throttling. The current assumption is that this is caused by the very... [11:41:15] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [11:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: Repool db1096:3316 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15329 and previous config saved to /var/cache/conftool/dbconfig/20210414-114216-root.json [11:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:52] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10MoritzMuehlenhoff) [11:43:54] 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:44:35] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Tendril and dbtree are now running on a new Buster instance dbmonitor1002.wikimedia.org ith PHP 5.6 packages from sury.org (since Tendril... [11:44:35] RECOVERY - cassandra-a SSL 10.64.0.105:7001 on restbase1020 is OK: SSL OK - Certificate restbase1020-a valid until 2023-04-14 11:20:37 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [11:45:15] RECOVERY - cassandra-b SSL 10.64.0.106:7001 on restbase1020 is OK: SSL OK - Certificate restbase1020-b valid until 2023-04-14 11:20:40 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [11:45:53] RECOVERY - cassandra-c SSL 10.64.0.146:7001 on restbase1020 is OK: SSL OK - Certificate restbase1020-c valid until 2023-04-14 11:20:42 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [11:47:51] (03PS1) 10Marostegui: site.pp: Specify the old m1 master [puppet] - 10https://gerrit.wikimedia.org/r/679317 (https://phabricator.wikimedia.org/T280121) [11:47:52] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1034.eqiad.wmnet', 'wtp1035.eqiad.wmnet', 'wtp1036.eqiad.wmnet'] ` and were **ALL** successful. [11:49:19] (03CR) 10Marostegui: [C: 03+2] site.pp: Specify the old m1 master [puppet] - 10https://gerrit.wikimedia.org/r/679317 (https://phabricator.wikimedia.org/T280121) (owner: 10Marostegui) [11:50:18] (03CR) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond) [11:50:36] (03CR) 10Volans: "Looks sane to me, I'd add the timeout explicitly to all requests calls, also in a different patch if you prefer." (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond) [11:51:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: Repool db1096:3315 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15330 and previous config saved to /var/cache/conftool/dbconfig/20210414-115101-root.json [11:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:41] RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor2002 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 640 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [11:52:57] (03PS1) 10Marostegui: tendril.sql: Remove dbmonitor1001 grants [puppet] - 10https://gerrit.wikimedia.org/r/679318 (https://phabricator.wikimedia.org/T224589) [11:53:12] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=(wtp1034|wtp1035|wtp1036).eqiad.wmnet [11:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:21] RECOVERY - cassandra-a SSL 10.64.0.148:7001 on restbase1021 is OK: SSL OK - Certificate restbase1021-a valid until 2023-04-14 11:20:45 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [11:55:14] (03CR) 10Marostegui: [C: 03+2] tendril.sql: Remove dbmonitor1001 grants [puppet] - 10https://gerrit.wikimedia.org/r/679318 (https://phabricator.wikimedia.org/T224589) (owner: 10Marostegui) [11:55:19] RECOVERY - cassandra-b SSL 10.64.0.149:7001 on restbase1021 is OK: SSL OK - Certificate restbase1021-b valid until 2023-04-14 11:20:48 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [11:55:58] (03CR) 10Muehlenhoff: [C: 03+1] P:debmonitor::client: migrate timer::job to use send_mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond) [11:57:07] (03Abandoned) 10Muehlenhoff: Remove grant for dbmonitor1001 [puppet] - 10https://gerrit.wikimedia.org/r/678800 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff) [11:57:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: Repool db1096:3316 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15331 and previous config saved to /var/cache/conftool/dbconfig/20210414-115720-root.json [11:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:57] RECOVERY - cassandra-c SSL 10.64.0.150:7001 on restbase1021 is OK: SSL OK - Certificate restbase1021-c valid until 2023-04-14 11:20:51 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:02:42] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1037.eqiad.wmnet', 'wtp1038.eqiad.wmnet', 'wtp1039.eqia... [12:03:00] !log Upgrade mysql on db1080 T279281 [12:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:10] T279281: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 [12:04:41] RECOVERY - cassandra-b SSL 10.64.16.115:7001 on restbase1022 is OK: SSL OK - Certificate restbase1022-b valid until 2023-04-14 11:20:56 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:06:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: Repool db1096:3315 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15332 and previous config saved to /var/cache/conftool/dbconfig/20210414-120604-root.json [12:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:23] RECOVERY - cassandra-c SSL 10.64.16.116:7001 on restbase1022 is OK: SSL OK - Certificate restbase1022-c valid until 2023-04-14 11:20:59 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:07:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1025 for kernel and mysql upgrade T279281', diff saved to https://phabricator.wikimedia.org/P15333 and previous config saved to /var/cache/conftool/dbconfig/20210414-120724-marostegui.json [12:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:51] (03PS1) 10Jgreen: rename frmon*.frdev to just frmon* keeping a transitional legacy CNAME [dns] - 10https://gerrit.wikimedia.org/r/679319 (https://phabricator.wikimedia.org/T280034) [12:10:53] (03CR) 10Jgreen: [C: 03+2] rename frmon*.frdev to just frmon* keeping a transitional legacy CNAME [dns] - 10https://gerrit.wikimedia.org/r/679319 (https://phabricator.wikimedia.org/T280034) (owner: 10Jgreen) [12:11:43] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add lmeintrup [puppet] - 10https://gerrit.wikimedia.org/r/679279 (https://phabricator.wikimedia.org/T279531) (owner: 10Filippo Giunchedi) [12:12:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: Repool db1096:3316 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15334 and previous config saved to /var/cache/conftool/dbconfig/20210414-121223-root.json [12:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:33] RECOVERY - cassandra-a SSL 10.64.16.118:7001 on restbase1023 is OK: SSL OK - Certificate restbase1023-a valid until 2023-04-14 11:21:01 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:14:51] RECOVERY - cassandra-b SSL 10.64.16.119:7001 on restbase1023 is OK: SSL OK - Certificate restbase1023-b valid until 2023-04-14 11:21:04 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:15:31] RECOVERY - cassandra-a SSL 10.64.16.114:7001 on restbase1022 is OK: SSL OK - Certificate restbase1022-a valid until 2023-04-14 11:20:53 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:17:41] RECOVERY - cassandra-c SSL 10.64.16.120:7001 on restbase1023 is OK: SSL OK - Certificate restbase1023-c valid until 2023-04-14 11:21:07 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:21:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: Repool db1096:3315 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15335 and previous config saved to /var/cache/conftool/dbconfig/20210414-122108-root.json [12:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:59] (03PS1) 10Gehel: WDQS: Wait for updater to catchup during data transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108) [12:24:52] RECOVERY - cassandra-b SSL 10.64.16.123:7001 on restbase1024 is OK: SSL OK - Certificate restbase1024-b valid until 2023-04-14 11:21:12 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:26:03] (03CR) 10jerkins-bot: [V: 04-1] WDQS: Wait for updater to catchup during data transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108) (owner: 10Gehel) [12:26:42] RECOVERY - cassandra-c SSL 10.64.16.124:7001 on restbase1024 is OK: SSL OK - Certificate restbase1024-c valid until 2023-04-14 11:21:14 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:26:56] (03PS2) 10Gehel: WDQS: Wait for updater to catchup during data transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108) [12:27:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Repool db1096:3316 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15336 and previous config saved to /var/cache/conftool/dbconfig/20210414-122727-root.json [12:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:50] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:15] (03CR) 10Filippo Giunchedi: [C: 03+2] "Chatted on IRC, for wmf ldap membership only we're ok to go ahead" [puppet] - 10https://gerrit.wikimedia.org/r/679280 (https://phabricator.wikimedia.org/T280073) (owner: 10Filippo Giunchedi) [12:28:24] (03PS2) 10Filippo Giunchedi: admin: add hnordeen [puppet] - 10https://gerrit.wikimedia.org/r/679280 (https://phabricator.wikimedia.org/T280073) [12:29:45] (03CR) 10jerkins-bot: [V: 04-1] WDQS: Wait for updater to catchup during data transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108) (owner: 10Gehel) [12:30:42] (03PS3) 10Gehel: WDQS: Wait for updater to catchup during data transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108) [12:30:43] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1037.eqiad.wmnet with reason: REIMAGE [12:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:38] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:43] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1038.eqiad.wmnet with reason: REIMAGE [12:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:50] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1037.eqiad.wmnet with reason: REIMAGE [12:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:52] RECOVERY - cassandra-a SSL 10.64.48.126:7001 on restbase1025 is OK: SSL OK - Certificate restbase1025-a valid until 2023-04-14 11:21:17 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:33:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 10%: Repool es1025 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15337 and previous config saved to /var/cache/conftool/dbconfig/20210414-123357-root.json [12:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:42] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1038.eqiad.wmnet with reason: REIMAGE [12:34:44] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1039.eqiad.wmnet with reason: REIMAGE [12:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:33] (03PS1) 10Patriccck: Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) [12:36:44] (03PS2) 10Jbond: debmonitor-client: Improve retry logic [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 [12:36:48] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1039.eqiad.wmnet with reason: REIMAGE [12:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:34] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for HNordeen - https://phabricator.wikimedia.org/T280073 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi User added to `wmf` group (chatted on IRC with @jbond), @HNordeenWMF you should have access now! [12:38:22] (03CR) 10Elukey: [C: 03+2] "I think that Keith is working on the new cluster, will follow up later on :)" [homer/public] - 10https://gerrit.wikimedia.org/r/679296 (owner: 10Elukey) [12:39:13] !log update kafka term for analytics-in{4,6} on cr{1,2}-eqiad to include kafka-logging1001 - ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/679296 [12:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:40] RECOVERY - cassandra-b SSL 10.64.48.127:7001 on restbase1025 is OK: SSL OK - Certificate restbase1025-b valid until 2023-04-14 11:21:19 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:41:19] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10fgiunchedi) 05Open→03Resolved @Lena_WMDE you are now in `nda` and `wmde` groups, please verify access and reopen the task if something is amiss! [12:42:27] (03CR) 10Zabe: [C: 04-1] Czech Wikimedia / Powered by MediaWiki icons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck) [12:42:29] (03PS2) 10Patriccck: Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) [12:42:56] RECOVERY - cassandra-c SSL 10.64.48.128:7001 on restbase1025 is OK: SSL OK - Certificate restbase1025-c valid until 2023-04-14 11:21:22 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:43:57] (03PS3) 10Patriccck: Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) [12:44:40] PROBLEM - MegaRAID on an-worker1100 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:44:41] ACKNOWLEDGEMENT - MegaRAID on an-worker1100 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T280132 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:44:44] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10ops-monitoring-bot) [12:47:32] RECOVERY - cassandra-a SSL 10.64.16.122:7001 on restbase1024 is OK: SSL OK - Certificate restbase1024-a valid until 2023-04-14 11:21:09 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:49:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 25%: Repool es1025 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15338 and previous config saved to /var/cache/conftool/dbconfig/20210414-124901-root.json [12:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:08] RECOVERY - cassandra-a SSL 10.64.48.180:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-a valid until 2023-04-14 11:21:25 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:51:39] (03PS4) 10Urbanecm: Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck) [12:51:51] (03PS1) 10Seddon: Change HTTP to HTTPS for concept URIs on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679327 (https://phabricator.wikimedia.org/T258590) [12:52:29] (03CR) 10jerkins-bot: [V: 04-1] Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck) [12:53:26] (03PS5) 10Urbanecm: Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck) [12:53:39] (03PS1) 10Jbond: C:aptrepo: add gitlab repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/679328 (https://phabricator.wikimedia.org/T279545) [12:53:42] RECOVERY - cassandra-a SSL 10.64.48.184:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-a valid until 2023-04-14 11:21:33 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:54:52] RECOVERY - cassandra-b SSL 10.64.48.185:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-b valid until 2023-04-14 11:21:35 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:54:54] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:09] (03PS6) 10Urbanecm: Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck) [12:57:16] RECOVERY - cassandra-c SSL 10.64.48.186:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-c valid until 2023-04-14 11:21:38 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:57:45] (03CR) 10Kosta Harlan: "> Patch Set 1: Code-Review+2" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris) [12:57:52] RECOVERY - cassandra-b SSL 10.64.48.181:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-b valid until 2023-04-14 11:21:27 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:59:56] RECOVERY - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-c valid until 2023-04-14 11:21:30 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [13:01:08] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:38] !log extend prometheus global @ codfw by 100G [13:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:47] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [13:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:58] PROBLEM - Device not healthy -SMART- on an-worker1100 is CRITICAL: cluster=analytics device=sat+megaraid,10 instance=an-worker1100 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1100&var-datasource=eqiad+prometheus/ops [13:02:05] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [13:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29027/console" [puppet] - 10https://gerrit.wikimedia.org/r/679328 (https://phabricator.wikimedia.org/T279545) (owner: 10Jbond) [13:04:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 50%: Repool es1025 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15339 and previous config saved to /var/cache/conftool/dbconfig/20210414-130404-root.json [13:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:18] RECOVERY - cassandra-a SSL 10.64.0.101:7001 on restbase1019 is OK: SSL OK - Certificate restbase1019-a valid until 2023-04-14 11:20:29 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [13:11:30] (03CR) 10Ottomata: [C: 03+1] remove obsolete html files from snapshot manifests for dumps [puppet] - 10https://gerrit.wikimedia.org/r/678719 (https://phabricator.wikimedia.org/T279661) (owner: 10ArielGlenn) [13:12:07] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [13:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:49] !log installing OpenSSL updates on buster [13:12:49] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for HNordeen - https://phabricator.wikimedia.org/T280073 (10Ottomata) Should be fine, it'd be nice if this ticket had a little more info about who and why though! [13:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:19] (03CR) 10David Caro: [C: 03+2] "The compilation results differences are expected: https://puppet-compiler.wmflabs.org/compiler1001/718/" [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [13:15:56] (03CR) 10David Caro: [C: 03+2] ceph: add ceph repo and parameter to all client modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [13:17:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:19:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 75%: Repool es1025 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15340 and previous config saved to /var/cache/conftool/dbconfig/20210414-131908-root.json [13:19:12] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1038.eqiad.wmnet', 'wtp1037.eqiad.wmnet', 'wtp1039.eqiad.wmnet'] ` and were **ALL** successful. [13:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:21] (03CR) 10Alexandros Kosiaris: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris) [13:19:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:26:42] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:05] !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@825c60a]: T273847 export queries to relforge dag deployment - schedule change [13:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:14] T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847 [13:28:04] (03PS1) 10Muehlenhoff: Remove obsolete backup of /root on the apt servers [puppet] - 10https://gerrit.wikimedia.org/r/679332 [13:29:02] (03CR) 10Volans: [C: 03+1] "tested and LGTM, see comment inline on py2 support" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond) [13:29:14] !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@825c60a]: T273847 export queries to relforge dag deployment - schedule change (duration: 02m 08s) [13:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:47] (03PS1) 10Jbond: cfssl::multirootca: install certs script [puppet] - 10https://gerrit.wikimedia.org/r/679336 [13:34:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 100%: Repool es1025 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15341 and previous config saved to /var/cache/conftool/dbconfig/20210414-133411-root.json [13:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:10] (03CR) 10Jbond: [C: 03+2] cfssl::multirootca: install certs script [puppet] - 10https://gerrit.wikimedia.org/r/679336 (owner: 10Jbond) [13:38:59] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/679332 (owner: 10Muehlenhoff) [13:39:22] (03PS3) 10Jbond: debmonitor-client: Improve retry logic [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 [13:39:32] (03CR) 10Jbond: debmonitor-client: Improve retry logic (036 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond) [13:43:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/679328 (https://phabricator.wikimedia.org/T279545) (owner: 10Jbond) [13:43:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from es5 master', diff saved to https://phabricator.wikimedia.org/P15342 and previous config saved to /var/cache/conftool/dbconfig/20210414-134331-marostegui.json [13:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:51] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:aptrepo: add gitlab repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/679328 (https://phabricator.wikimedia.org/T279545) (owner: 10Jbond) [13:46:50] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [13:48:17] !log disabling puppet on C:mcrouter for cert renewal [13:48:22] PROBLEM - Check systemd state on aqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:49] 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10RLazarus) a:03RLazarus [13:53:41] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/679341 [14:01:50] RECOVERY - Check systemd state on aqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:04] (03PS1) 10Jbond: reprepo: add gitlab component [puppet] - 10https://gerrit.wikimedia.org/r/679345 [14:06:26] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond) [14:08:12] 10SRE, 10Maps, 10Packaging, 10Product-Infrastructure-Team-Backlog, 10serviceops: Packaging PostGIS 3.1 for the new Maps stack - https://phabricator.wikimedia.org/T277064 (10MSantos) [14:08:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/679345 (owner: 10Jbond) [14:08:34] PROBLEM - Check systemd state on aqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:40] (03CR) 10Jbond: [C: 03+2] reprepo: add gitlab component [puppet] - 10https://gerrit.wikimedia.org/r/679345 (owner: 10Jbond) [14:09:06] !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@8ae53e3]: T273847 export queries to relforge dag deployment - start date update [14:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:16] T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847 [14:11:04] !log installing intel-microcode updates on Buster [14:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:20] !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@8ae53e3]: T273847 export queries to relforge dag deployment - start date update (duration: 02m 14s) [14:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:39] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Use the main_app resources for loaddatasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/679347 [14:13:07] !log mcrouter cert renewal complete, puppet re-enabled T276029 [14:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:15] T276029: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 [14:14:00] 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10RLazarus) 05Open→03Resolved Done -- just re-enabled puppet, so they'll get picked up over the next 30m. [14:18:08] (03CR) 10Jcrespo: [C: 03+1] "I checked and found no other uses of the fileset." [puppet] - 10https://gerrit.wikimedia.org/r/679332 (owner: 10Muehlenhoff) [14:22:16] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for HNordeen - https://phabricator.wikimedia.org/T280073 (10HNordeenWMF) Thank you @Ottomata @jbond and @fgiunchedi ! Sorry for the lack of context -- I'm on the online fundraising team, and would like access to Turnilo for monitoring impressions on our A/B b... [14:31:02] RECOVERY - Check systemd state on aqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:47] (03PS1) 10Ladsgroup: Disable legacy javascript global variables in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470) [14:34:52] (03CR) 10Hoo man: Disable legacy javascript global variables in ruwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [14:35:22] (03CR) 10Arturo Borrero Gonzalez: "do you happen to know why we need this in the first place? It would be better if we could drop entirely that exception." [puppet] - 10https://gerrit.wikimedia.org/r/679278 (owner: 10Muehlenhoff) [14:35:49] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) ` joe@wotan:~/Sandbox/mw-on-k8s$ kubectl get pods NAME READY STATUS RESTARTS AGE mediawiki-test-6fb67b5f8b-... [14:38:32] (03CR) 10Ladsgroup: Disable legacy javascript global variables in ruwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [14:39:46] (03PS1) 10Ayounsi: Merge all system.conf templates in one [homer/public] - 10https://gerrit.wikimedia.org/r/679351 (https://phabricator.wikimedia.org/T269345) [14:40:12] (03CR) 10Hoo man: Disable legacy javascript global variables in ruwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [14:43:22] (03PS2) 10Ladsgroup: Disable legacy javascript global variables in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470) [14:45:03] (03PS2) 10CDanis: Add a public_cloud bit to X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/679341 (https://phabricator.wikimedia.org/T279380) [14:48:12] PROBLEM - Check systemd state on debmonitor2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_nginx.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:12] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission bast1002.wikimedia.org - https://phabricator.wikimedia.org/T280110 (10wiki_willy) a:03Cmjohnson [14:48:22] !log O:logstash::elasticsearch7 update elasticsearch-curator to 5.8.1 [14:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:12] PROBLEM - mediawiki-installation DSH group on wtp1039 is CRITICAL: Host wtp1039 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:49:18] PROBLEM - Check systemd state on aqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:20] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete backup of /root on the apt servers [puppet] - 10https://gerrit.wikimedia.org/r/679332 (owner: 10Muehlenhoff) [14:54:42] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:17] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10wiki_willy) a:03Cmjohnson [14:55:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 (10wiki_willy) a:05wiki_willy→03Cmjohnson [14:56:54] (03CR) 10Jbond: [C: 03+2] debmonitor-client: Improve retry logic [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond) [15:00:02] PROBLEM - mediawiki-installation DSH group on wtp1037 is CRITICAL: Host wtp1037 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:00:44] !log run new curator actions on codfw - T274394 [15:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:53] T274394: ES Curator cron jobs are not cleaned up when output no longer exists - https://phabricator.wikimedia.org/T274394 [15:01:06] RECOVERY - Check systemd state on aqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:38] (03PS1) 10Elukey: aptrepo: add component libmysql-java to buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/679356 (https://phabricator.wikimedia.org/T278424) [15:08:40] (03CR) 10Ema: [C: 03+1] Move hue.wikimedia.org to the an-tool1009 backend [puppet] - 10https://gerrit.wikimedia.org/r/678861 (https://phabricator.wikimedia.org/T264896) (owner: 10Elukey) [15:08:49] (03PS1) 10Ppchelko: Math: Enable RESTBase-less Wikidata math validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679357 (https://phabricator.wikimedia.org/T274436) [15:09:10] PROBLEM - mediawiki-installation DSH group on wtp1038 is CRITICAL: Host wtp1038 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:10:05] (03PS1) 10Jbond: Drop python3 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358 [15:10:47] (03PS1) 10Filippo Giunchedi: pontoon: use rolemap for template stack [puppet] - 10https://gerrit.wikimedia.org/r/679359 (https://phabricator.wikimedia.org/T280083) [15:12:59] (03CR) 10Cwhite: [C: 03+1] pontoon: use rolemap for template stack [puppet] - 10https://gerrit.wikimedia.org/r/679359 (https://phabricator.wikimedia.org/T280083) (owner: 10Filippo Giunchedi) [15:13:40] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use rolemap for template stack [puppet] - 10https://gerrit.wikimedia.org/r/679359 (https://phabricator.wikimedia.org/T280083) (owner: 10Filippo Giunchedi) [15:14:57] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/679356 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [15:15:13] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete backup of /root on the apt servers [puppet] - 10https://gerrit.wikimedia.org/r/679332 (owner: 10Muehlenhoff) [15:15:23] (03PS2) 10Muehlenhoff: Remove obsolete backup of /root on the apt servers [puppet] - 10https://gerrit.wikimedia.org/r/679332 [15:15:27] (03CR) 10Volans: [C: 03+1] "Thx, lgtm, we can probably add 3.8/9 (or just 3.9) in a later patch" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358 (owner: 10Jbond) [15:16:06] (03CR) 10Elukey: [C: 03+2] aptrepo: add component libmysql-java to buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/679356 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [15:18:40] (03PS2) 10Jbond: Drop python3 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358 [15:20:27] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358 (owner: 10Jbond) [15:21:06] (03PS3) 10Jbond: Drop python3 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358 [15:25:22] RECOVERY - AQS root url on aqs1010 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:25:32] RECOVERY - AQS root url on aqs1011 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:26:22] RECOVERY - AQS root url on aqs1012 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:27:24] (03PS1) 10Giuseppe Lavagetto: mediawiki/httpd: adapt to kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/679362 [15:28:29] new nodes --^ [15:29:10] RECOVERY - AQS root url on aqs1014 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:31:06] RECOVERY - AQS root url on aqs1015 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:31:46] (03PS11) 10Giuseppe Lavagetto: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [15:32:55] 10SRE, 10Traffic, 10Patch-For-Review: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10ema) Apparently we do [[https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=97&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3065&from... [15:32:57] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/679278 (owner: 10Muehlenhoff) [15:33:54] (03PS1) 10Ema: cache_upload: set nuke_limit to 1000 [puppet] - 10https://gerrit.wikimedia.org/r/679364 (https://phabricator.wikimedia.org/T275809) [15:33:58] (03PS12) 10Giuseppe Lavagetto: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [15:34:05] (03CR) 10Giuseppe Lavagetto: Helm chart to run MediaWiki (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [15:35:04] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/679364 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [15:36:12] (03PS4) 10Arturo Borrero Gonzalez: gridengine: set grid-configurator source files to use new domain name [puppet] - 10https://gerrit.wikimedia.org/r/678043 (https://phabricator.wikimedia.org/T277653) (owner: 10Bstorm) [15:36:14] (03PS4) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) [15:37:50] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [15:38:27] (03PS1) 10Volans: netbox: improve as_dict() [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367 [15:39:10] (03PS1) 10Elukey: bigtop::mysql_jdbc: use component/libmysql-java for buster [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) [15:40:01] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) [15:40:46] (03CR) 10Volans: "FYI" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108) (owner: 10Gehel) [15:40:47] 10SRE, 10netops: BGP: prioritize directly connected peers - https://phabricator.wikimedia.org/T280054 (10jbond) proposal seems fine to me however it would put it theses routes above PEER_INTERNAL which is probably fine but feels wrong ~~That said Im also curious why PEERING_ROUTE and PEERING_ROUTE_PRIMARY hav... [15:43:06] (03PS7) 10Cwhite: logstash: refactor how curator jobs are defined and deployed [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) [15:45:59] (03CR) 10Klausman: [C: 03+1] admin: Introduce the cluster_group concept [deployment-charts] - 10https://gerrit.wikimedia.org/r/678789 (owner: 10Alexandros Kosiaris) [15:48:13] (03CR) 10CRusnov: "This change is ready for review." [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [15:48:38] (03PS8) 10Cwhite: logstash: refactor how curator jobs are defined and deployed [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) [15:50:16] (03CR) 10CRusnov: "LGTM thank you for this" [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367 (owner: 10Volans) [15:50:25] (03CR) 10CRusnov: [C: 03+1] netbox: improve as_dict() [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367 (owner: 10Volans) [15:51:01] (03PS2) 10Elukey: bigtop::mysql_jdbc: use component/libmysql-java for buster [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) [15:53:17] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10elukey) @crusnov if you have time let's do it this week or the next! [15:55:09] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10crusnov) >>! In T271136#6999003, @elukey wrote: > @crusnov if you have time let's do it this week or the next! Yes (thank you for the ping), let's do it first th... [15:56:44] (03PS3) 10Elukey: bigtop::mysql_jdbc: use component/libmysql-java for buster [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) [15:57:34] (03PS4) 10Elukey: bigtop::mysql_jdbc: use component/libmysql-java for buster [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) [16:00:23] (03PS1) 10Ottomata: refine - lowercase eventlogging legeacy table names in include/exclude regexes [puppet] - 10https://gerrit.wikimedia.org/r/679376 (https://phabricator.wikimedia.org/T273789) [16:03:21] (03CR) 10Muehlenhoff: bigtop::mysql_jdbc: use component/libmysql-java for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [16:04:08] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:53] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/679278 (owner: 10Muehlenhoff) [16:06:00] (03PS5) 10Elukey: bigtop::mysql_jdbc: use component/libmysql-java for buster [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) [16:07:27] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29039/console" [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [16:09:10] (03CR) 10Physikerwelt: [C: 03+1] Math: Enable RESTBase-less Wikidata math validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679357 (https://phabricator.wikimedia.org/T274436) (owner: 10Ppchelko) [16:10:00] (03CR) 10Elukey: [V: 03+1] bigtop::mysql_jdbc: use component/libmysql-java for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [16:10:09] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) [16:10:36] (03CR) 10Muehlenhoff: bigtop::mysql_jdbc: use component/libmysql-java for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [16:12:44] (03CR) 10BryanDavis: "> Is this perhaps used by stashbot or other IRC-integrating bot? cc'ing Bryan." [puppet] - 10https://gerrit.wikimedia.org/r/679278 (owner: 10Muehlenhoff) [16:13:35] (03CR) 10Reedy: "Why has this suddenly become a thing? What does this add other than more complexity?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck) [16:13:59] (03CR) 10Patriccck: Czech Wikimedia / Powered by MediaWiki icons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck) [16:20:00] (03PS1) 10Jbond: base::firewall: ass switch to use seperate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) [16:20:37] (03CR) 10Elukey: [V: 03+1] "The code is not great but we'll likely do a clean up when stretch is gone, so it should be ok for the moment, but lemme know :)" [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [16:21:00] (03CR) 10Ottomata: [C: 03+1] bigtop::mysql_jdbc: use component/libmysql-java for buster [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [16:21:17] (03CR) 10jerkins-bot: [V: 04-1] base::firewall: ass switch to use seperate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond) [16:21:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29040/console" [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond) [16:21:58] jbond42: is it "add" ? :D [16:22:07] in the commit message :D [16:22:09] :D lol yes just noticed that :D [16:22:43] (03PS2) 10Jbond: base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) [16:23:46] (03CR) 10jerkins-bot: [V: 04-1] base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond) [16:29:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [16:29:54] (03PS1) 10Awight: Temporarily disable some reportupdater jobs [puppet] - 10https://gerrit.wikimedia.org/r/679390 (https://phabricator.wikimedia.org/T279046) [16:31:06] (03PS3) 10Jbond: base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) [16:32:45] (03CR) 10jerkins-bot: [V: 04-1] base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond) [16:32:59] (03PS4) 10Jbond: base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) [16:34:11] (03CR) 10jerkins-bot: [V: 04-1] base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond) [16:37:51] (03PS5) 10Jbond: base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) [16:40:04] (03PS1) 10Ahmon Dancy: MWScript.php: Add purgeMessageBlobStore.php to the wikiless list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679391 [16:40:38] (03PS1) 10Jbond: hiera - sretest: test sending ulog to separate file [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414) [16:41:02] (03PS2) 10Ahmon Dancy: MWScript.php: Add purgeMessageBlobStore.php to the wikiless list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679391 (https://phabricator.wikimedia.org/T263872) [16:41:24] (03CR) 10jerkins-bot: [V: 04-1] hiera - sretest: test sending ulog to separate file [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond) [16:42:13] (03PS2) 10Jbond: hiera - sretest: test sending ulog to separate file [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414) [16:42:52] (03CR) 10jerkins-bot: [V: 04-1] hiera - sretest: test sending ulog to separate file [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond) [16:45:20] (03CR) 10Ahmon Dancy: [C: 03+2] MWScript.php: Add purgeMessageBlobStore.php to the wikiless list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679391 (https://phabricator.wikimedia.org/T263872) (owner: 10Ahmon Dancy) [16:45:26] (03PS3) 10Jbond: hiera - sretest: test sending ulog to separate file [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414) [16:48:35] (03Merged) 10jenkins-bot: MWScript.php: Add purgeMessageBlobStore.php to the wikiless list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679391 (https://phabricator.wikimedia.org/T263872) (owner: 10Ahmon Dancy) [16:53:52] PROBLEM - Check systemd state on debmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_nginx.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:13] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/29042/sretest1001.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond) [16:55:30] PROBLEM - Check systemd state on ldap-replica1001 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:45] jbond42: ^^ looking [16:56:24] same, 502 proxy error [16:57:26] and a restart worked just fine [16:57:54] RECOVERY - Check systemd state on ldap-replica1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:26] strange that the switch to apache ios causing this issue. when i checked the one for gerrit there was an error getting data from the backend [16:58:42] i.e. apache -> uwsgi [17:00:07] ill see (tomorrow) if there are some settings i can put on the proxy config to improve the reliablity [17:03:22] RECOVERY - Check systemd state on debmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:40] RECOVERY - Check systemd state on debmonitor2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:32] jbond42: ack, thx, lmk if you want another pair of eyes, I'm not looking at it right now [17:12:32] (03PS1) 10Jbond: P:debmonitor::server: switch to mod_proxy_uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/679399 [17:12:45] thanks volans ^^^ seems like a good first step but im logging of for now will pick it back up tomorrow [17:13:53] (03CR) 10Volans: [C: 03+1] "Why not, let's test this one." [puppet] - 10https://gerrit.wikimedia.org/r/679399 (owner: 10Jbond) [17:27:38] (03CR) 10Ryan Kemper: [C: 03+1] "LGTM. We can make volans' proposed change once the required patch is merged." [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108) (owner: 10Gehel) [17:33:10] (03PS1) 10Urbanecm: DatabaseMentorStore: Cache mentor in memcached [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679003 (https://phabricator.wikimedia.org/T279959) [17:33:53] 10SRE, 10DC-Ops, 10SRE-tools, 10Sustainability (Incident Followup): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Volans) @LSobanski I'll try to give you some context from the SRE I/F team side of... [17:34:50] (03CR) 10Volans: [C: 03+2] netbox: improve as_dict() [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367 (owner: 10Volans) [17:39:04] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/679403 (owner: 10CRusnov) [17:39:06] thcipriani: hello, what happened with wmf.2 at T280157 please? 🙂 [17:39:06] T280157: 1.37.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T280157 [17:39:23] (03CR) 10Urbanecm: [C: 03+2] "backporting" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679003 (https://phabricator.wikimedia.org/T279959) (owner: 10Urbanecm) [17:39:37] (03CR) 10Urbanecm: [C: 03+2] "in preparation for B&C window" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679002 (https://phabricator.wikimedia.org/T279957) (owner: 10Urbanecm) [17:42:27] 10ops-eqiad, 10DC-Ops: Add eqiad airport express to Netbox - https://phabricator.wikimedia.org/T278934 (10wiki_willy) a:03Cmjohnson [17:42:28] Urbanecm: something got strange in the re-numbering following the branch cut, so I ended up having to make a wmf.2 so that the thing that generates the calendars would be happy :\ tl;dr: yak shaving [17:43:42] thcipriani: I see. So, there won't be a train next week? [17:43:46] or is it just a number skipped? [17:44:03] I'm asking because I want to know when my risky patch will be deployed [17:44:13] 10ops-eqiad, 10DC-Ops: Add eqiad airport express to Netbox - https://phabricator.wikimedia.org/T278934 (10wiki_willy) Hi @Cmjohnson - can you add the Airport Express access point into Netbox? It should be next to (or near) the management router. Thanks, Willy [17:44:52] (03CR) 10jerkins-bot: [V: 04-1] netbox: improve as_dict() [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367 (owner: 10Volans) [17:44:56] Urbanecm: there will not be a train next week. Sent the email to wikitech-l this morning. Date was on https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar (although I missed that until this morning as well). Your patch will go out the week after if it's in the mainline development branch now. [17:46:33] got it, thanks a lot [17:50:29] (03CR) 10Volans: [C: 03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367 (owner: 10Volans) [17:56:36] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/GrowthExperiments/: ce44792: 84107c5: GrowthExperiments backports related to DatabaseMentorStore (T279957; T279959) (duration: 01m 55s) [17:56:39] * Urbanecm done [17:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:49] T279957: DatabaseMentorStore::setMentorForUser needs to be safe to call on GET requests - https://phabricator.wikimedia.org/T279957 [17:56:49] T279959: Cache mentor/mentee relationship in memcached - https://phabricator.wikimedia.org/T279959 [17:58:50] (03Merged) 10jenkins-bot: netbox: improve as_dict() [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367 (owner: 10Volans) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for [[Backport windows|Morning backport window]]
''''''. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:04] longma and marxarelli: How many deployers does it take to do Train log triage with CPT deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T1800). [18:02:16] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 31089112 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:04:36] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:16:29] (03CR) 10Volans: [C: 03+1] "Looks good." (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/679403 (owner: 10CRusnov) [18:23:51] (03PS1) 10Herron: kafka-logging: migrate broker logstash1011 to kafka-logging1002 [puppet] - 10https://gerrit.wikimedia.org/r/679411 (https://phabricator.wikimedia.org/T279342) [18:30:03] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/29045/" [puppet] - 10https://gerrit.wikimedia.org/r/679411 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [18:34:38] (03CR) 10Dzahn: [C: 03+1] "yes, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/679279 (https://phabricator.wikimedia.org/T279531) (owner: 10Filippo Giunchedi) [18:39:48] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) >>! In T279100#6997273, @akosiaris wrote: > We seem to only have 1 dedicated videoscaler in c... [18:41:14] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) Alex, your patch looks good but i can also see Effie's point. hmm... [18:43:51] (03PS1) 10CDanis: Add es_exporter config for NEL events [puppet] - 10https://gerrit.wikimedia.org/r/679417 (https://phabricator.wikimedia.org/T257527) [18:49:28] 10SRE, 10RESTBase, 10Traffic, 10Page-Previews (Tracking), and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534 (10Jdlrobson) [18:51:04] (03CR) 10CDanis: "I *think* this is correct, but haven't actually written one of these before -- please let me know :)" [puppet] - 10https://gerrit.wikimedia.org/r/679417 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [18:55:52] PROBLEM - Disk space on urldownloader1002 is CRITICAL: DISK CRITICAL - free space: / 340 MB (3% inode=87%): /tmp 340 MB (3% inode=87%): /var/tmp 340 MB (3% inode=87%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=urldownloader1002&var-datasource=eqiad+prometheus/ops [18:56:28] (03CR) 10Cwhite: [C: 03+1] "LGTM! I ran the query and the output looks good." [puppet] - 10https://gerrit.wikimedia.org/r/679417 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [18:57:57] 10SRE, 10Discovery-Search: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10TJones) I've moved this ticket back to "needs triage" so we can discuss it again in light of the recent problems with T274200, and decide if we should make it more of a priority, an... [18:58:38] !log urldownloader1002 - icinga alerted about disk space, ran 'apt-get clean' which is my usual go to in that case. it reduced usage from 97% to 89% [18:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] longma and marxarelli: (Dis)respected human, time to deploy Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T1900). Please do the needful. [19:00:42] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Legoktm) >>! In T279100#6997472, @akosiaris wrote: >>>! In T279100#6997312, @jijiki wrote: >> I thin... [19:01:46] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) I like Lego's summary. [19:02:19] (03PS1) 10Jeena Huneidi: group1 wikis to 1.37.0-wmf.1 refs T278345 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679421 [19:02:21] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.37.0-wmf.1 refs T278345 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679421 (owner: 10Jeena Huneidi) [19:03:10] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.1 refs T278345 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679421 (owner: 10Jeena Huneidi) [19:04:09] (03CR) 10CDanis: [C: 03+2] Add es_exporter config for NEL events [puppet] - 10https://gerrit.wikimedia.org/r/679417 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [19:04:41] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.1 refs T278345 [19:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:49] T278345: 1.37.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T278345 [19:06:45] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.1 refs T278345 (duration: 02m 03s) [19:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:13] 10SRE, 10SRE-Access-Requests: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10Dzahn) [19:08:22] longma: o/ hey hey [19:08:26] 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10Dzahn) [19:08:34] * marxarelli is log watching [19:09:15] marxarelli: do you know anything about errors depooling the servers when running train? [19:09:38] i don't [19:09:47] did scap throw an error? [19:09:58] 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10Dzahn) [19:10:03] 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10Dzahn) Hi @HMonroy, slightly renamed the ticket, confirmed you already signed L3 and added releng for deployment approval. others will continue with this soon,... [19:10:55] well it said success at the end, but it shows some errors depooling some servers, I think because they are disabled instead of enabled. So maybe it's fine? [19:11:49] hello! which servers do you see there having issues? [19:11:52] wtp* ? [19:12:14] or mw* [19:12:57] jobrunner_443 and videoscaler_443 [19:13:31] ah, do you see any "mw" host name in there? [19:13:57] might be the special ones we defined as "jobrunner only but not videoscaler" [19:14:14] because they are depooled from videoscaler pool [19:14:40] yeah it also says not restarting php7.2-fpm 100 on mw1338 [19:14:45] mw1335 and mw1336 [19:14:55] mw1337 and mw1338 [19:15:15] mw1334.eqiad.wmnet: [apache2,nginx] [19:15:16] mw1335.eqiad.wmnet: [apache2,nginx] # Only pooled as videoscaler [19:15:18] mw1336.eqiad.wmnet: [apache2,nginx] # Only pooled as videoscaler [19:15:21] mw1337.eqiad.wmnet: [apache2,nginx] # Only pooled as jobrunner [19:15:24] mw1338.eqiad.wmnet: [apache2,nginx] # Only pooled as jobrunner [19:15:26] it's this ^ [19:15:55] if it doesn't actually break anything for you then it's just noise, but we can still think about how to remove that [19:16:21] so I assume the "1 hosts had failures restarting php-fpm" is mw1338, but it said it wasn't restarting because of "free opcache 362MB Fragmentation is at 33%, nothing to do here" [19:16:28] okay [19:16:37] ah, yea, that last part seems fine [19:16:56] alright, thanks for the help! [19:17:14] RECOVERY - Disk space on urldownloader1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=urldownloader1002&var-datasource=eqiad+prometheus/ops [19:17:18] yw [19:17:51] probably it should not call it a 'failure' if it just had nothing to do, ack [19:18:46] that is odd. i wonder is that systemctl that exited non-zero? [19:19:07] yeah, it was unclear to me since the "failure" message at the end didn't say which host [19:19:34] wmf.1 logs look fairly clean otherwise, just the usual lock wait timeouts [19:19:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:22:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:25:19] marxarelli: it's this script https://gerrit.wikimedia.org/r/c/operations/puppet/+/657398/3/modules/profile/files/mediawiki/php/php-check-and-restart.sh [19:25:29] some places there have an explicit "exit 0" [19:25:29] was just looking at that [19:25:36] but the "nothing to do" one does not [19:25:39] right, but not there [19:25:43] yea [19:25:55] looks like _joe_ or effie might know more [19:26:03] indeed [19:27:17] I will mention it but if you want to comment as well,this is for https://phabricator.wikimedia.org/T279100 [19:28:03] well, kind of :) [19:28:36] ack. thanks, mutante :) [19:28:58] it popped up because of this special case but also it's a general question about the restart script that scap runs all the time but usually wasnt a problem [19:29:37] longma: if you still have the scap output, it might help to post it ^ [19:29:41] in that task that is [19:29:50] yeah, should I put it on the linked task? [19:30:01] seems like a decent spot to me [19:30:09] yes, that would be helpful [19:30:17] after I figure out how to copy in tmux 😂 [19:30:41] haha [19:30:44] if in doubt, screenshot it :p [19:30:51] drag into phab comment [19:30:52] (03PS1) 10Herron: kafka-logging1002: disable notifications during setup [puppet] - 10https://gerrit.wikimedia.org/r/679424 [19:31:07] i often flail in tmux [19:31:35] (03CR) 10Herron: [C: 03+2] kafka-logging1002: disable notifications during setup [puppet] - 10https://gerrit.wikimedia.org/r/679424 (owner: 10Herron) [19:32:57] (03PS1) 10Herron: Revert "kafka-logging1001: disable icinga notifications during setup" [puppet] - 10https://gerrit.wikimedia.org/r/679446 [19:35:58] (03CR) 10Herron: [C: 03+2] Revert "kafka-logging1001: disable icinga notifications during setup" [puppet] - 10https://gerrit.wikimedia.org/r/679446 (owner: 10Herron) [19:40:31] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jeena) Some "errors" restarting php-fpm and depooling services popped up while running the train tod... [19:42:04] !log migrating kafka-logging broker logstash1011 to kafka-logging1002 T279342 [19:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:13] T279342: Migrate colocated kafka-logging brokers to dedicated kafka-logging hosts - https://phabricator.wikimedia.org/T279342 [19:45:21] (03CR) 10Herron: [C: 03+2] kafka-logging: migrate broker logstash1011 to kafka-logging1002 [puppet] - 10https://gerrit.wikimedia.org/r/679411 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [19:47:12] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2411.codfw.wmnet,cluster=jobrunner [19:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:59] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw2394.codfw.wmnet,cluster=videoscaler [19:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:10] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw2395.codfw.wmnet,cluster=videoscaler [19:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:00] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw2410.codfw.wmnet,cluster=videoscaler [19:50:05] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw2411.codfw.wmnet,cluster=videoscaler [19:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:27] !log dzahn@cumin1001 conftool action : set/weight=20; selector: name=mw2411.codfw.wmnet,cluster=videoscaler [19:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:33] !log dzahn@cumin1001 conftool action : set/weight=20; selector: name=mw2410.codfw.wmnet,cluster=videoscaler [19:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:28] !log dzahn@cumin1001 conftool action : set/weight=20; selector: name=mw2394.codfw.wmnet,cluster=jobrunner [19:52:33] !log dzahn@cumin1001 conftool action : set/weight=20; selector: name=mw2395.codfw.wmnet,cluster=jobrunner [19:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:21] (03PS1) 10Dzahn: conftool: fix TODO by adding 2 dedicated codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) [20:00:04] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for [[mw:Services|Services]] – [[mw:Extension:Graph|Graphoid]] / [[ORES]] . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T2000). [20:00:13] (03CR) 10jerkins-bot: [V: 04-1] conftool: fix TODO by adding 2 dedicated codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) (owner: 10Dzahn) [20:01:27] (03PS2) 10Dzahn: conftool: fix TODO by adding 2 dedicated codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) [20:02:10] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/679432" [puppet] - 10https://gerrit.wikimedia.org/r/679258 (https://phabricator.wikimedia.org/T279100) (owner: 10Alexandros Kosiaris) [20:02:12] (03CR) 10jerkins-bot: [V: 04-1] conftool: fix TODO by adding 2 dedicated codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) (owner: 10Dzahn) [20:02:55] (03PS3) 10Dzahn: conftool: fix TODO by adding 2 dedicated codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) [20:08:09] longma: thanks! looks like your screenshot is pointing out 2 separate issues (the WARNING vs the ERROR parts basically) [20:09:54] Yes! I thought so too [20:11:37] *nod* should both be ignorable for deployment for now, but should have follow-ups [20:12:15] (since it's just the 2 special hosts) [20:12:27] thanks for looking into it :) [20:15:18] PROBLEM - Ensure local MW versions match expected deployment on wtp1037 is CRITICAL: CRITICAL: 524 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:15:38] PROBLEM - Ensure local MW versions match expected deployment on wtp1038 is CRITICAL: CRITICAL: 524 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:17:26] PROBLEM - Ensure local MW versions match expected deployment on wtp1039 is CRITICAL: CRITICAL: 524 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:24:23] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/679447 (owner: 10Paladox) [20:25:36] mutante: could the version mismatches above be from the train? or some other reason? [20:25:38] ^ shouldn't that be looked at? [20:30:48] longma: won't scap pull fix it? I think some of the wtp* servers had work today [20:30:57] https://phabricator.wikimedia.org/T268524 [20:31:13] I can try it [20:31:45] longma: I'm sure I've seen it mentioned before with reimaged servers [20:32:03] Please be careful though [20:32:08] As that's just my memory [20:32:13] haha [20:32:46] 💀 [20:32:58] could something bad happen if I do scap pull? [20:33:58] Maybe see if the servers are pooled first [20:35:19] eh, back, seeing this now [20:35:22] longma: `scap pull` should be fully safe. It will just fetch the deploy server state to the server you run it on [20:35:27] that is yet another unrelated issue [20:35:34] because wtp servers are being reimaged [20:35:38] ooh. [20:35:40] scap pull shold fix it, yes [20:36:55] 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Aklapper) Hi and welcome, please see https://phabricator.wikimedia.org/tag/ldap-access-requests/ for required data and (for future reference) for a template link. Thanks! [20:37:11] you dont need to worry. these 3 servers are not getting any traffic [20:37:17] ah okay [20:37:22] but we should still avoid the alerts [20:38:10] !log wtp1037, wtp1038, wtp1039 - scap pull [20:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:52] longma: scap pull should never hurt [20:39:03] would scap sync-world also work? [20:39:12] RECOVERY - Ensure local MW versions match expected deployment on wtp1039 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:39:12] I think so, just takes longer [20:39:16] *as long as the state on the deploy server is not being actively changed [20:39:30] I was about to reschedule the icinga checks but there it goes ^ [20:39:36] PROBLEM - MariaDB Replica Lag: m2 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 890.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:40:28] RECOVERY - Ensure local MW versions match expected deployment on wtp1038 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:40:29] since I couldn't find fingerprints to confirm when I tried to log onto wtp1037 [20:40:33] while the servers are being reimaged they are in pooled=inactive state. this means not being in scap "dsh" groups, so not getting deploys [20:40:40] issue is that they still alert [20:40:47] or that scap pull wanst run manually [20:41:16] it only becomes a real problem if we never scap pull before repooling them [20:41:49] longma: yea, fingerprint also changed because of reimaging, that is ongoing "upgrade to buster" [20:42:12] only parsoid (wtp) [20:42:32] ah okay. That was the reason I wanted to run sync world instead :P [20:45:54] RECOVERY - Ensure local MW versions match expected deployment on wtp1037 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:47:21] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) After mw train was deployed we get some Icinga alerts which caused worry among deployers: ` 20:15 <+icinga-wm> PROBLEM - Ensure local MW versions match ex... [20:47:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:47:30] I left some comments about this on a ticket as well ^ [20:49:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:50:24] Ty mutante longma [20:52:52] ACKNOWLEDGEMENT - mediawiki-installation DSH group on wtp1037 is CRITICAL: Host wtp1037 is not in mediawiki-installation dsh group daniel_zahn T268524 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:52:52] ACKNOWLEDGEMENT - mediawiki-installation DSH group on wtp1038 is CRITICAL: Host wtp1038 is not in mediawiki-installation dsh group daniel_zahn T268524 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:52:52] ACKNOWLEDGEMENT - mediawiki-installation DSH group on wtp1039 is CRITICAL: Host wtp1039 is not in mediawiki-installation dsh group daniel_zahn T268524 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:54:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wtp[1037-1039].eqiad.wmnet with reason: reimage [20:54:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wtp[1037-1039].eqiad.wmnet with reason: reimage [20:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:22] downtimes expired, i silenced them for 24 hours [21:05:58] RECOVERY - MariaDB Replica Lag: m2 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:09:42] (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1003/29033/" [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [21:15:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:20:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:20:56] in the context of trying to test something else, I made an edit via the api (https://www.mediawiki.org/w/index.php?title=User:DannyS712/sandbox&diff=4529972&oldid=4529955) that also changed the content model of the page, but the "content model change" tag was not added to the edit. I reverted the content model back to wikitext, and made another [21:20:56] edit from the api that again changed the content model, and the second time the tag was properly applied. Any ideas what it wasn't the first time? [21:22:47] oh, found it - ContentHandler does not support multiple automatic tags [21:28:08] PROBLEM - Check systemd state on ml-serve2004 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:12] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is CRITICAL: cpu={1,11,13,15,3,5,7,9} https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [21:42:44] 10SRE, 10Product-Data-Infrastructure, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [21:44:28] !log manually started debmonitor-client.service on ml-serve2004 after 502 Bad gateway error [21:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:10] RECOVERY - Check systemd state on ml-serve2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:30] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [21:48:40] (03PS1) 10CDanis: prepend esams/knams [homer/public] - 10https://gerrit.wikimedia.org/r/679494 [21:50:04] (03CR) 10CDanis: [C: 03+2] prepend esams/knams [homer/public] - 10https://gerrit.wikimedia.org/r/679494 (owner: 10CDanis) [21:52:32] (03CR) 10Dzahn: [C: 03+1] "talked to Paladox about this. He kindly provided further links below. Upstream changed it from disable to enable and then changed their mi" [puppet] - 10https://gerrit.wikimedia.org/r/679447 (owner: 10Paladox) [21:53:32] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:56:18] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [22:04:14] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a r [22:04:14] ved: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [22:05:27] PROBLEM - fastnetmon is alerting #page on netflow3001 is CRITICAL: CRITICAL: fastnetmon is alerting for 91.198.174.192 https://bit.ly/wmf-fastnetmon https://w.wiki/8oU [22:05:31] indeed [22:05:40] here [22:05:42] hi [22:05:45] not, like, surprised, but here [22:06:19] here if ya need dc ops for anythign [22:06:32] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_80: Servers cp3060.esams.wmnet, cp3064.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:07:47] RECOVERY - fastnetmon is alerting #page on netflow3001 is OK: OK: no fastnetmon alerts https://bit.ly/wmf-fastnetmon https://w.wiki/8oU [22:08:54] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [22:08:56] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:16:57] (03PS1) 10RLazarus: depool esams [dns] - 10https://gerrit.wikimedia.org/r/679502 [22:16:59] (03CR) 10Legoktm: [C: 03+1] depool esams [dns] - 10https://gerrit.wikimedia.org/r/679502 (owner: 10RLazarus) [22:17:01] (03CR) 10RLazarus: [C: 03+2] depool esams [dns] - 10https://gerrit.wikimedia.org/r/679502 (owner: 10RLazarus) [22:34:44] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:37:27] ^ someone/some people have submitted a lot of stacked patches [22:39:35] legoktm: looks like it was Jdlrobson [22:40:27] ah, I didn't really look, I assumed that the queue would eventually catch up by itself [22:40:30] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 3.601 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:40:39] (03PS1) 10Ahmon Dancy: Fix error message if MWScript.php is run without arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679517 [22:41:21] can the submitted together limit in gerrit be tuned per repo? I wonder if things would be any less prone to self-DOS if only like 3 patchset were allowed to stack there. [22:42:28] ACKNOWLEDGEMENT - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 2.396 le 60 Legoktm esams is depooled https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:42:35] (03CR) 10Mstyles: rdf-streaming-updater: create helmfile.d structure (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [22:43:48] it was more than just Jon though. eileen sent in a huge pile of frtech patches too [22:44:20] If we only had infinite vms for jerkins to use up I guess :/ [22:44:50] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:49:26] (03PS1) 10Mstyles: rdf-streaming-updater: enable HA capability [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) [22:49:36] (03PS9) 10Mstyles: rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [22:49:38] (03PS2) 10Mstyles: rdf-streaming-updater: enable HA capability [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) [22:54:24] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [22:56:58] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:57:08] RECOVERY - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is OK: All metrics within thresholds. https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [23:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate [[Backport windows|Evening backport window]]
'''''' deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:16:49] (03CR) 10Bstorm: "So it's trying to run `qconf -ss` somewhere that isn't expected. That would be querying the submit hosts. The sideeffects are basically th" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [23:23:47] sorry I quickly backport https://gerrit.wikimedia.org/r/c/679350/ while we are in the window [23:23:53] (03CR) 10Ladsgroup: [C: 03+2] Disable legacy javascript global variables in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [23:24:39] (03Merged) 10jenkins-bot: Disable legacy javascript global variables in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [23:26:24] Amir1: nice. I can keep an eye on the logs if you want to enjoy your eveing :) [23:26:56] thanks. It takes some time to propagate through cache [23:27:10] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:679350|Disable legacy javascript global variables in ruwiki (T72470)]] (duration: 01m 16s) [23:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:21] T72470: Remove legacy javascript globals - https://phabricator.wikimedia.org/T72470 [23:27:32] IIRC, we had only 700 errors for this in total from all wikis in the last 12 hours [23:28:08] Amir1: yeh im just crunching the numbers now [23:28:14] remember thats 1% [23:28:44] According to https://grafana.wikimedia.org/d/000000037/mw-js-deprecate?orgId=1&viewPanel=7&refresh=1m&from=now-12h&to=now&var-Step=5min at peak we were seeing just over 600 events in 5 mins (that's 6000 unsampled) A rate of 2000 a minute will trigger an alert. [23:29:09] we could probably get away with deploying them all tomorrow morning and monitoring it through the day [23:30:42] oh no, lots of those are from code that iterate through the window object [23:31:03] if you look at the smallest variable. There's a baseline for all variables [23:31:06] 10SRE, 10Sustainability: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Krinkle) Ack, not missing messages !- active-active. So long as reconnect to the same hostname is expected to work within a reasonable amount of time, I guess we can close this. Requiring a publ... [23:31:15] ru.wikipedia seems quite, but that might be a false positive since Russia should be asleep now? [23:31:31] (03PS1) 10Ahmon Dancy: enable delay_messageblobstore_purge feature flag in beta scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/679522 (https://phabricator.wikimedia.org/T263872) [23:31:46] 10SRE, 10Sustainability: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Krinkle) 05Open→03Resolved a:03Legoktm [23:32:07] Jdlrobson: it takes at least a couple of hours to propagate through caches, you can cross check the time of my main deployment with the alert to be sure [23:32:15] (we had a couple of days ago) [23:33:22] Amir1: im wondering about what to do with all the scripts that dont get fix [23:33:45] do you think there's a case to make to blank any scripts that dont get fixed within a certain time frame? [23:33:54] 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Dzahn) @KFrancis Hi, here is another NDA request (and thanks for T279531#6995697 as well!) -- Daniel [23:33:55] if the scripts are throwing reference errors they are unusable anyway [23:34:22] and we can relatively easily get a list of user script wiki pages which are broken [23:34:59] 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Dzahn) @Manuel Please provide [[ https://www.mediawiki.org/wiki/User:KFrancis_(WMF) | Katie ]] with your email adddress and it will continue from there. [23:35:38] (03PS1) 10Cwhite: logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) [23:35:42] hmm, I honestly don't care. if they want to have/keep a broken script, it's their choice, their playground [23:36:07] simply turning off the logs for that if we care [23:38:11] my concern here though is these scripts generate a lot of noise, and if a script has a problem with code deprecated several years ago, the script is likely rotten to the core and probably contains other errors that are less easy to filter. It's also a bit of a privacy nightmare as these users are throwing errors on every page they visit. [23:38:58] (03PS2) 10Cwhite: logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) [23:39:25] (03CR) 10jerkins-bot: [V: 04-1] logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [23:40:12] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [23:40:21] (03PS3) 10Cwhite: logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) [23:43:05] (03PS4) 10Cwhite: logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) [23:46:49] (03PS5) 10Cwhite: logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) [23:48:30] (03PS1) 10Cwhite: logstash: clean up apifeatureusage curator job [puppet] - 10https://gerrit.wikimedia.org/r/679525 (https://phabricator.wikimedia.org/T274394) [23:49:34] (03PS6) 10Cwhite: logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) [23:50:14] yeah, my idea: just disable error logging on them [23:53:14] (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1002/29052/" [puppet] - 10https://gerrit.wikimedia.org/r/679525 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [23:57:39] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/29053/" [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [23:58:46] 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10KFrancis) @Dzahn As soon as I have the email address, I'll forward for processing. Thanks! [23:59:35] (03PS1) 10RLazarus: Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/679526