[00:00:11] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:21] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:21] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:19] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:37] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:45] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [00:27:23] 10Operations, 10Pywikibot, 10cloud-services-team (Kanban): http://pywikibot.org/ is displaying Wikimedia error page - https://phabricator.wikimedia.org/T257536 (10bd808) >>! In T257536#6316428, @Dzahn wrote: > So should this also be pointed at the wikmediacloud.org servers ? This redirect is currently handl... [01:32:11] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:39:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:48:53] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [01:49:01] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:52:37] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [02:49:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:51:37] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:33:59] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [03:34:39] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:35:45] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:36:31] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:45:49] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [03:47:43] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [03:57:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:06:29] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:09:38] (03PS1) 10Tim Starling: Disable LilyPond execution with a better error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614598 [04:46:36] (03CR) 10Legoktm: [C: 03+1] Disable LilyPond execution with a better error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614598 (owner: 10Tim Starling) [04:48:29] (03PS1) 10Tim Starling: [1.35.0-wmf.41] Nicer handling for disabling of shell execution [extensions/Score] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614040 (https://phabricator.wikimedia.org/T257062) [04:49:01] (03CR) 10Tim Starling: [C: 03+2] [1.35.0-wmf.41] Nicer handling for disabling of shell execution [extensions/Score] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614040 (https://phabricator.wikimedia.org/T257062) (owner: 10Tim Starling) [05:02:37] PROBLEM - Host analytics1050 is DOWN: PING CRITICAL - Packet loss = 100% [05:05:33] (03Merged) 10jenkins-bot: [1.35.0-wmf.41] Nicer handling for disabling of shell execution [extensions/Score] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614040 (https://phabricator.wikimedia.org/T257062) (owner: 10Tim Starling) [05:12:05] elukey: looks like analytics1050 is down indeed ^ [05:12:47] (03PS1) 10Marostegui: Revert "db1082: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/614041 [05:13:38] (03CR) 10Marostegui: [C: 03+2] Revert "db1082: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/614041 (owner: 10Marostegui) [05:15:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Changing link to section from "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/613732 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [05:18:09] !log tstarling@deploy1001 Synchronized php-1.35.0-wmf.41/extensions/Score: better error message for disabling of Score (duration: 01m 10s) [05:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1082 after a crash T258336', diff saved to https://phabricator.wikimedia.org/P11953 and previous config saved to /var/cache/conftool/dbconfig/20200720-051846-marostegui.json [05:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:52] T258336: db1082 crashed - https://phabricator.wikimedia.org/T258336 [05:19:01] (03CR) 10Tim Starling: [C: 03+2] Disable LilyPond execution with a better error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614598 (owner: 10Tim Starling) [05:19:02] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Marostegui) I have started to repool db1082 [05:19:54] (03Merged) 10jenkins-bot: Disable LilyPond execution with a better error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614598 (owner: 10Tim Starling) [05:23:21] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) Thank you both! [05:24:49] !log tstarling@deploy1001 Synchronized wmf-config/CommonSettings.php: disable lilypond with better error message (duration: 00m 57s) [05:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:58] !log Deploy MCR schema change on enwiki on db1119 - T238966 [05:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:03] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [05:30:43] !log tstarling@deploy1001 scap sync-l10n completed (1.35.0-wmf.41) (duration: 02m 44s) [05:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:53] !log tstarling@deploy1001 Started scap: fixing missing message from previous sync-dir [05:38:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1082 after a crash T258336', diff saved to https://phabricator.wikimedia.org/P11955 and previous config saved to /var/cache/conftool/dbconfig/20200720-053816-marostegui.json [05:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:37] T258336: db1082 crashed - https://phabricator.wikimedia.org/T258336 [05:47:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1082 after a crash T258336', diff saved to https://phabricator.wikimedia.org/P11956 and previous config saved to /var/cache/conftool/dbconfig/20200720-054747-marostegui.json [05:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:52] T258336: db1082 crashed - https://phabricator.wikimedia.org/T258336 [05:56:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1082 after a crash T258336', diff saved to https://phabricator.wikimedia.org/P11957 and previous config saved to /var/cache/conftool/dbconfig/20200720-055614-marostegui.json [05:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:20] T258336: db1082 crashed - https://phabricator.wikimedia.org/T258336 [05:57:36] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:57:38] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Marostegui) [05:58:18] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Marostegui) 05Open→03Resolved a:03Marostegui db1082 is fully repooled. I am going to consider this resolved. There is not much else we can do really - this host will be replaced and refreshed in Q2 (T258336), it is quite o... [05:58:20] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:07:50] !log tstarling@deploy1001 Finished scap: fixing missing message from previous sync-dir (duration: 29m 57s) [06:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:35] !log tstarling@deploy1001 Synchronized php-1.35.0-wmf.41/extensions/Score/includes/Score.php: reverting Reedy's temporary patch for hardcoding the lilypond version (duration: 00m 57s) [06:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:13] marostegui: good morning! checking thanks :) [06:16:23] elukey: :** [06:19:11] lovely, connected to mgmt and serial, then all blocked [06:19:24] PROBLEM - Host analytics1050.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:19:32] exactly [06:22:56] so I guess this needs to be handled by dcops sigh [06:24:33] RECOVERY - Host analytics1050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [06:26:49] 10Operations, 10ops-eqiad, 10Analytics-Clusters: analytics1050 host + mgmt down - https://phabricator.wikimedia.org/T258370 (10elukey) [06:26:58] ahahahahah [06:27:35] XDDDDDDD [06:27:39] it just needed a rest [06:27:51] 10Operations, 10CX-cxserver, 10Citoid, 10Graphoid, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10Physikerwelt) mobileapps continues to use URI as of https://gerrit.wikimedia.org/r/c/mediawiki/services/mobileapps/+/514414/1/test/features/app/spe... [06:27:52] PROBLEM - ores on ores2007 is CRITICAL: connect to address 10.192.48.88 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:29:16] RECOVERY - ores on ores2007 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 5.515 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:30:04] marostegui: so if I type `console com2` it gets stuck again [06:30:06] weird [06:30:12] (03CR) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: add kube-env script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613190 (owner: 10Giuseppe Lavagetto) [06:30:26] elukey: have you tried to reset the idrac? [06:30:32] 10Operations, 10ops-eqiad, 10Analytics-Clusters: analytics1050 host + mgmt down - https://phabricator.wikimedia.org/T258370 (10elukey) Right after creating this task (of course) the mgmt idrac returned available. `getsel` show as last event something from days ago: ` Record: 35 Date/Time: 03/05/2020... [06:30:46] marostegui: nope, the soft one right? [06:30:57] (I always fear to reset passwords etc..) [06:31:21] yeah, a soft one [06:31:25] nothing to lose [06:31:34] PROBLEM - Host analytics1050.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:33:26] I'll try when it recovers --^ [06:33:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:35:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:37:06] RECOVERY - Host analytics1050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [06:43:18] marostegui: tried, same result :( [06:43:57] 10Operations, 10ops-eqiad, 10Analytics-Clusters: analytics1050 host + mgmt down - https://phabricator.wikimedia.org/T258370 (10elukey) Also tried with a `racadm racreset soft`, but the issue is the same. [06:47:33] elukey: :( [06:54:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1114', diff saved to https://phabricator.wikimedia.org/P11958 and previous config saved to /var/cache/conftool/dbconfig/20200720-065438-marostegui.json [06:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:02] !log restart matomo1002's mariadb to pick up new TLS settings [06:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:37] (03PS1) 10Marostegui: wmnet: Failover m1 dbproxy; dbproxy1012 to dbproxy1014 [dns] - 10https://gerrit.wikimedia.org/r/614606 (https://phabricator.wikimedia.org/T255408) [07:00:48] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T258364 (10Joe) a:03Joe Hi @CGlenn can you please indicate a date at which you will stop needing access? Once I have this info, I can proceed with creating your access. [07:03:59] 10Operations, 10Analytics, 10Patch-For-Review: Move yarn.wikimedia.org to a separate Buster VM - https://phabricator.wikimedia.org/T258152 (10Joe) a:03MoritzMuehlenhoff Assigning to Moritz given it seems he's working on this. [07:04:44] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T258364 (10Joe) p:05Triage→03Medium [07:05:48] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Joe) a:03Dzahn Assigning to Daniel as he's actively working on this. [07:08:20] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10Joe) a:03JMeybohm [07:08:32] (03PS1) 10Privacybatm: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) [07:17:14] (03Abandoned) 10Giuseppe Lavagetto: termbox: actually use chart version 0.0.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/612524 (owner: 10Giuseppe Lavagetto) [07:19:05] !log Drop non used reviewdb database - T255715 [07:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:14] T255715: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 [07:22:45] (03PS2) 10Privacybatm: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) [07:28:15] (03PS3) 10Kormat: wmnet: Update es4-master alias [dns] - 10https://gerrit.wikimedia.org/r/612560 (https://phabricator.wikimedia.org/T257847) [07:28:22] (03PS3) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: add kube-env script [puppet] - 10https://gerrit.wikimedia.org/r/613190 [07:28:24] (03PS3) 10Giuseppe Lavagetto: kubernetes::deployment_server::helmfile: use kube_env in .hfenv [puppet] - 10https://gerrit.wikimedia.org/r/613191 [07:29:22] (03CR) 10Kormat: [C: 03+1] wmnet: Failover m1 dbproxy; dbproxy1012 to dbproxy1014 [dns] - 10https://gerrit.wikimedia.org/r/614606 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [07:29:35] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1 dbproxy; dbproxy1012 to dbproxy1014 [dns] - 10https://gerrit.wikimedia.org/r/614606 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [07:29:54] !log Move m1-master from dbproxy1012 to dbproxy1014 - T255408 [07:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:59] T255408: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 [07:32:07] (03PS4) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: add kube-env script [puppet] - 10https://gerrit.wikimedia.org/r/613190 [07:32:09] (03PS4) 10Giuseppe Lavagetto: kubernetes::deployment_server::helmfile: use kube_env in .hfenv [puppet] - 10https://gerrit.wikimedia.org/r/613191 [07:32:12] (03PS6) 10Muehlenhoff: Remove IDP defintions for logstash vhosts [puppet] - 10https://gerrit.wikimedia.org/r/607509 (https://phabricator.wikimedia.org/T246998) [07:39:06] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/611373 (owner: 10DCausse) [07:46:21] (03CR) 10Muehlenhoff: [C: 03+2] Switch over Yarn to an-tool1008 in ATS [puppet] - 10https://gerrit.wikimedia.org/r/613606 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff) [07:51:00] !log updating envoyproxy to 1.14.4-1 on all non mw and restbase hosts [07:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:15] (03PS3) 10Kormat: mariadb: Promote es1021 to es4 master. [puppet] - 10https://gerrit.wikimedia.org/r/612551 (https://phabricator.wikimedia.org/T257847) [07:54:46] !log installing libopenmpt security updates [07:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:59] (03PS6) 10Ema: LVS: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610047 (https://phabricator.wikimedia.org/T255015) [08:00:01] (03PS6) 10Ema: icinga: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610048 (https://phabricator.wikimedia.org/T255015) [08:00:03] (03PS6) 10Ema: varnish: stop responding to varnishcheck.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/610051 (https://phabricator.wikimedia.org/T255015) [08:00:05] (03PS6) 10Ema: ATS: stop responding to varnishcheck/status [puppet] - 10https://gerrit.wikimedia.org/r/610052 (https://phabricator.wikimedia.org/T255015) [08:01:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:04:58] (03CR) 10Vgutierrez: [C: 03+1] LVS: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610047 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [08:05:02] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:12:11] (03PS3) 10Ammarpad: Set $wgUrlShortenerAllowedDomains for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T258134) [08:16:07] (03CR) 10Ammarpad: "I have changed it to duplication out of caution, instead of direct switching. I am not sure whether direct replacement will have any side " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T258134) (owner: 10Ammarpad) [08:18:26] (03CR) 10Vgutierrez: [C: 03+1] icinga: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610048 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [08:23:20] (03CR) 10Urbanecm: [C: 04-1] "Is it really necessary to have the whitelist in place three times?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T258134) (owner: 10Ammarpad) [08:29:29] (03PS7) 10Ema: icinga: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610048 (https://phabricator.wikimedia.org/T255015) [08:29:32] (03PS7) 10Ema: LVS: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610047 (https://phabricator.wikimedia.org/T255015) [08:29:33] (03PS7) 10Ema: varnish: stop responding to varnishcheck.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/610051 (https://phabricator.wikimedia.org/T255015) [08:29:37] (03PS7) 10Ema: ATS: stop responding to varnishcheck/status [puppet] - 10https://gerrit.wikimedia.org/r/610052 (https://phabricator.wikimedia.org/T255015) [08:29:39] (03CR) 10Ammarpad: "> Is it really necessary to have the whitelist in place three times?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T258134) (owner: 10Ammarpad) [08:35:07] (03CR) 10Ema: [C: 03+2] icinga: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610048 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [08:37:28] 10Operations, 10ops-eqiad: Interface errors on asw2-d-eqiad:xe-7/0/0 (ms-be1037) - https://phabricator.wikimedia.org/T257541 (10fgiunchedi) This port recently got a SFP swap in {T257675} by @Jclark-ctr, perhaps the error spike was that ? I don't see continued errors now [08:37:42] 10Operations, 10DBA: Database cumin aliases without a matching host - https://phabricator.wikimedia.org/T258376 (10Joe) [08:38:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:38:50] 10Operations, 10DBA: Database cumin aliases without a matching host - https://phabricator.wikimedia.org/T258376 (10Marostegui) p:05Triage→03Medium @jbond Is db-section-idp-test really needed? [08:38:58] 10Operations: Broken cumin aliases - https://phabricator.wikimedia.org/T258377 (10Joe) [08:39:08] 10Operations: Broken cumin aliases - https://phabricator.wikimedia.org/T258377 (10Joe) p:05Triage→03Medium [08:40:20] 10Operations, 10DBA: Database cumin aliases without a matching host - https://phabricator.wikimedia.org/T258376 (10Kormat) The aliases are auto-generated from the list of valid sections defined in `modules/profile/types/mariadb/valid_section.pp`. We could maybe special-case ones which are always going to be em... [08:40:42] 10Operations, 10DBA, 10User-Kormat: Database cumin aliases without a matching host - https://phabricator.wikimedia.org/T258376 (10Kormat) a:03Kormat [08:48:13] (03PS1) 10Muehlenhoff: Switch yarn.wikimedia.org to CAS [puppet] - 10https://gerrit.wikimedia.org/r/614692 (https://phabricator.wikimedia.org/T159584) [08:51:06] (03PS1) 10Giuseppe Lavagetto: cumin: fix the alias hadoop-all [puppet] - 10https://gerrit.wikimedia.org/r/614694 (https://phabricator.wikimedia.org/T258377) [08:53:30] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1024 - https://phabricator.wikimedia.org/T257949 (10fgiunchedi) This appears to be a BBU fault: ` Cache Status: Permanently Disabled Cache Status Details: Cache disabled; battery/capacitor is not attached ` Do we have a BBU on site and/or can order a bun... [08:55:19] (03CR) 10Elukey: [C: 03+1] cumin: fix the alias hadoop-all [puppet] - 10https://gerrit.wikimedia.org/r/614694 (https://phabricator.wikimedia.org/T258377) (owner: 10Giuseppe Lavagetto) [08:56:32] (03Abandoned) 10Filippo Giunchedi: role: install fcgid package on netmon [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:58:28] 10Operations, 10DBA, 10User-Kormat: Database cumin aliases without a matching host - https://phabricator.wikimedia.org/T258376 (10Kormat) >>! In T258376#6318266, @Marostegui wrote: > @jbond Is db-section-idp-test really needed? It's there because of `profile::mariadb::misc::idp_test`, but it looks like cur... [08:59:26] (03CR) 10Muehlenhoff: [C: 03+1] cumin: fix the alias hadoop-all [puppet] - 10https://gerrit.wikimedia.org/r/614694 (https://phabricator.wikimedia.org/T258377) (owner: 10Giuseppe Lavagetto) [09:00:11] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:00:58] 10Operations, 10DBA, 10User-Kormat: Database cumin aliases without a matching host - https://phabricator.wikimedia.org/T258376 (10jcrespo) Maybe 608639 wasn't properly reverted? [09:02:42] (03PS1) 10Kormat: mariadb: Remove unused idp_test profile. [puppet] - 10https://gerrit.wikimedia.org/r/614696 (https://phabricator.wikimedia.org/T258376) [09:02:55] (03CR) 10Hnowlan: [C: 03+1] role::parsoid: Add missing exporters for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/613307 (owner: 10Effie Mouzeli) [09:04:15] !log updating envoyproxy to 1.14.4-1 on all codfw hosts [09:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:54] (03CR) 10DCausse: [C: 03+1] Add logout location (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [09:09:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cumin: fix the alias hadoop-all [puppet] - 10https://gerrit.wikimedia.org/r/614694 (https://phabricator.wikimedia.org/T258377) (owner: 10Giuseppe Lavagetto) [09:10:08] (03PS1) 10Kormat: mariadb: Comment out sections that do not appear in puppet. [puppet] - 10https://gerrit.wikimedia.org/r/614697 (https://phabricator.wikimedia.org/T258376) [09:11:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P11959 and previous config saved to /var/cache/conftool/dbconfig/20200720-091119-marostegui.json [09:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:21] (03PS2) 10Muehlenhoff: Switch yarn.wikimedia.org to CAS [puppet] - 10https://gerrit.wikimedia.org/r/614692 (https://phabricator.wikimedia.org/T159584) [09:14:25] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614698 (https://phabricator.wikimedia.org/T128546) [09:16:06] (03PS1) 10Jbond: mariadp::misc::idp_test: remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/614699 [09:16:56] (03PS2) 10Jbond: mariadp::misc::idp_test: remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/614699 (https://phabricator.wikimedia.org/T258376) [09:17:53] !log updating envoyproxy to 1.14.4-1 on all eqiad hosts [09:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:02] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/614699 (https://phabricator.wikimedia.org/T258376) (owner: 10Jbond) [09:18:58] (03CR) 10Kormat: [C: 03+1] "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/614699 (https://phabricator.wikimedia.org/T258376) (owner: 10Jbond) [09:19:11] (03Abandoned) 10Kormat: mariadb: Remove unused idp_test profile. [puppet] - 10https://gerrit.wikimedia.org/r/614696 (https://phabricator.wikimedia.org/T258376) (owner: 10Kormat) [09:20:02] (03CR) 10Jbond: [C: 03+2] mariadp::misc::idp_test: remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/614699 (https://phabricator.wikimedia.org/T258376) (owner: 10Jbond) [09:23:12] (03PS2) 10Kormat: mariadb: Comment out sections that do not appear in puppet. [puppet] - 10https://gerrit.wikimedia.org/r/614697 (https://phabricator.wikimedia.org/T258376) [09:24:12] (03PS4) 10Kormat: wmnet: Update es4-master alias [dns] - 10https://gerrit.wikimedia.org/r/612560 (https://phabricator.wikimedia.org/T257847) [09:25:32] !log update compiler facts [09:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:40] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: Database cumin aliases without a matching host - https://phabricator.wikimedia.org/T258376 (10jbond) 05Open→03Resolved This has been removed now [09:27:23] godog: i feel like you need a "ze frank" narrated video for that :) (https://www.youtube.com/watch?v=GDwOi7HpHtQ) [09:28:01] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: Database cumin aliases without a matching host - https://phabricator.wikimedia.org/T258376 (10Kormat) 05Resolved→03Open Re-opening until the s10/s11 issue has been resolved. [09:28:21] "true facts about pcc" would be a hit [09:28:38] kormat: hahaha that's brilliant [09:29:02] looking forward to "TTS SAL" app [09:30:55] (03PS1) 10Elukey: Update analytics-in(4|6) filters [homer/public] - 10https://gerrit.wikimedia.org/r/614702 [09:31:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P11960 and previous config saved to /var/cache/conftool/dbconfig/20200720-093154-marostegui.json [09:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [09:41:00] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/23982/" [puppet] - 10https://gerrit.wikimedia.org/r/614692 (https://phabricator.wikimedia.org/T159584) (owner: 10Muehlenhoff) [09:43:06] (03PS12) 10Vgutierrez: ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) [09:43:22] 10Operations, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10fgiunchedi) With Thanos in production now we'll have to add steps to cater for the switchover from one host to the other. Since the underlying data will be the same... [09:45:54] (03PS4) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [09:46:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P11961 and previous config saved to /var/cache/conftool/dbconfig/20200720-094609-marostegui.json [09:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [09:49:40] (03CR) 10Filippo Giunchedi: [C: 03+1] role::grafana: allow embedding [puppet] - 10https://gerrit.wikimedia.org/r/611250 (https://phabricator.wikimedia.org/T250792) (owner: 10Jbond) [09:50:15] (03CR) 10Jbond: [C: 03+2] role::grafana: allow embedding [puppet] - 10https://gerrit.wikimedia.org/r/611250 (https://phabricator.wikimedia.org/T250792) (owner: 10Jbond) [09:50:31] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263 (owner: 10Jbond) [09:53:20] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "This should still be merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609748 (https://phabricator.wikimedia.org/T256623) (owner: 10Awight) [09:53:41] PROBLEM - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-08-19 09:53:05 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [09:53:55] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-08-19 09:53:05 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [09:55:30] (03CR) 10Filippo Giunchedi: "I tried running PCC on production hosts for this change to confirm, although there are merge conflicts now:" [puppet] - 10https://gerrit.wikimedia.org/r/610045 (owner: 10Jbond) [09:58:03] (03PS6) 10Jbond: statsite::instance: fix style violation in define [puppet] - 10https://gerrit.wikimedia.org/r/610045 [10:00:29] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:02:15] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:04:00] ACKNOWLEDGEMENT - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-08-19 09:53:05 +0000 (expires in 29 days) Valentin Gutierrez the renewed cert will be pushed to the servers on 2020-07-27 09:02:37 - The acknowledgement expires at: 2020-07-27 11:02:41. https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [10:04:00] ACKNOWLEDGEMENT - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-08-19 09:53:05 +0000 (expires in 29 days) Valentin Gutierrez the renewed cert will be pushed to the servers on 2020-07-27 09:02:37 - The acknowledgement expires at: 2020-07-27 11:02:41. https://phabricator.wikimedia.org/tag/phabricator/ [10:15:02] (03CR) 10Filippo Giunchedi: "LGTM overall, although for graphite hosts in production this will change statsite to point from graphite-in.eqiad.wmnet to localhost, whic" [puppet] - 10https://gerrit.wikimedia.org/r/610045 (owner: 10Jbond) [10:20:43] (03PS3) 10Muehlenhoff: Switch yarn.wikimedia.org to CAS [puppet] - 10https://gerrit.wikimedia.org/r/614692 (https://phabricator.wikimedia.org/T159584) [10:25:38] (03CR) 10Elukey: [C: 03+1] Switch yarn.wikimedia.org to CAS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/614692 (https://phabricator.wikimedia.org/T159584) (owner: 10Muehlenhoff) [10:27:18] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10akosiaris) >>! In T256863#6313866, @wiki_willy wrote: > but then someone decided to push out the refresh until FY21-22. That was my recommendation by the way. The reasoning behind it was that restbase201... [10:30:04] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200720T1030). [10:30:28] (03CR) 10Muehlenhoff: "Updated PCC: https://puppet-compiler.wmflabs.org/compiler1002/23986/" [puppet] - 10https://gerrit.wikimedia.org/r/614692 (https://phabricator.wikimedia.org/T159584) (owner: 10Muehlenhoff) [10:30:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1114', diff saved to https://phabricator.wikimedia.org/P11962 and previous config saved to /var/cache/conftool/dbconfig/20200720-103058-marostegui.json [10:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:25] (03PS7) 10Jbond: statsite::instance: fix style violation in define [puppet] - 10https://gerrit.wikimedia.org/r/610045 [10:32:08] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614698 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:32:51] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614698 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:40] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:614698| Bumping portals to master (614698)]] (duration: 00m 59s) [10:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:26] (03PS17) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 [10:35:37] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:614698| Bumping portals to master (614698)]] (duration: 00m 56s) [10:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:04] (03PS8) 10Jbond: statsite::instance: fix style violation in define [puppet] - 10https://gerrit.wikimedia.org/r/610045 [10:38:21] (03CR) 10Alexandros Kosiaris: [C: 03+1] profile::kubernetes::deployment_server: add kube-env script [puppet] - 10https://gerrit.wikimedia.org/r/613190 (owner: 10Giuseppe Lavagetto) [10:38:53] (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes::deployment_server::helmfile: use kube_env in .hfenv [puppet] - 10https://gerrit.wikimedia.org/r/613191 (owner: 10Giuseppe Lavagetto) [10:39:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::kubernetes::deployment_server: add kube-env script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613190 (owner: 10Giuseppe Lavagetto) [10:40:27] (03CR) 10Jbond: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/610045 (owner: 10Jbond) [10:41:33] (03CR) 10Jbond: [C: 03+2] profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263 (owner: 10Jbond) [10:42:58] (03PS18) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 [10:48:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add mw_resource_loader_uri to Node.js service config vars [puppet] - 10https://gerrit.wikimedia.org/r/613645 (https://phabricator.wikimedia.org/T258186) (owner: 10Mholloway) [10:48:16] !log rebooting releases* hosts for kernel security update [10:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:35] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:54] 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [10:50:11] 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) p:05Triage→03Medium [10:50:35] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:46] 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [10:51:48] 10Operations, 10DBA: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10Marostegui) [10:51:51] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Marostegui) [10:52:37] 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [10:52:40] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:54:29] (03PS1) 10Giuseppe Lavagetto: admin: refresh my bashrc a bit [puppet] - 10https://gerrit.wikimedia.org/r/614726 [10:55:05] (03CR) 10Marostegui: Update analytics-in(4|6) filters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/614702 (owner: 10Elukey) [10:55:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: refresh my bashrc a bit [puppet] - 10https://gerrit.wikimedia.org/r/614726 (owner: 10Giuseppe Lavagetto) [10:57:26] (03PS1) 10Giuseppe Lavagetto: admin: fix PS1 string for deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/614729 [10:58:04] jouncebot: next [10:58:04] In 0 hour(s) and 1 minute(s): European mid-day backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200720T1100) [10:58:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: fix PS1 string for deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/614729 (owner: 10Giuseppe Lavagetto) [10:59:19] (03PS2) 10Giuseppe Lavagetto: admin: fix PS1 string for deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/614729 [10:59:34] (03CR) 10jerkins-bot: [V: 04-1] admin: fix PS1 string for deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/614729 (owner: 10Giuseppe Lavagetto) [11:00:00] (03PS3) 10Giuseppe Lavagetto: admin: fix PS1 string for deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/614729 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200720T1100). [11:00:04] Urbanecm and Ammarpad: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] I'll deploy today :) [11:00:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: fix PS1 string for deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/614729 (owner: 10Giuseppe Lavagetto) [11:00:25] (03PS5) 10Urbanecm: Create closer group at itwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927) [11:00:31] (03CR) 10Urbanecm: [C: 03+2] Create closer group at itwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927) (owner: 10Urbanecm) [11:00:51] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] admin: fix PS1 string for deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/614729 (owner: 10Giuseppe Lavagetto) [11:02:00] (03Merged) 10jenkins-bot: Create closer group at itwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927) (owner: 10Urbanecm) [11:10:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 1c7a6215d06aff6cb0a75701292d8147f006d9e4: Create closer group at itwikinews (T257927) (duration: 00m 57s) [11:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:10] T257927: Create new "archiviatore" flag on it.wikinews - https://phabricator.wikimedia.org/T257927 [11:11:19] (03PS2) 10Urbanecm: Convert ukwikisource ns:250 and ns:251 to have subpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614579 (https://phabricator.wikimedia.org/T255930) [11:11:24] (03CR) 10Urbanecm: [C: 03+2] Convert ukwikisource ns:250 and ns:251 to have subpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614579 (https://phabricator.wikimedia.org/T255930) (owner: 10Urbanecm) [11:12:14] (03Merged) 10jenkins-bot: Convert ukwikisource ns:250 and ns:251 to have subpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614579 (https://phabricator.wikimedia.org/T255930) (owner: 10Urbanecm) [11:12:35] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01006 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:13:44] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 6830723b0ad5031e67062ba838f09cd07c2b97a1: Convert ukwikisource ns:250 and ns:251 to have subpages (T255930) (duration: 00m 57s) [11:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:50] T255930: Convert ukwikisource ns:250 and ns:251 to have subpages - https://phabricator.wikimedia.org/T255930 [11:13:50] (03PS4) 10Urbanecm: Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader) [11:14:27] (03PS5) 10Urbanecm: Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader) [11:14:33] (03CR) 10Urbanecm: [C: 03+2] Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader) [11:15:28] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader) [11:17:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 0c784784d75c2bbfb570495a6a097d4c44cbe6b3: Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wiktionary (T258346) (duration: 00m 58s) [11:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:44] T258346: Set $wgCategoryCollation to 'uca-bs-u-kn' on Bosnian Wiktionary and rebuild category sort keys - https://phabricator.wikimedia.org/T258346 [11:18:47] !log Run mwscript updateCollation.php --wiki=bswiktionary --previous-collation=uppercase in a tmux session at mwmaint1002 (T258346) [11:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:36] (03PS2) 10Urbanecm: Add media.farsnews.ir to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614584 (https://phabricator.wikimedia.org/T253800) [11:19:41] Ammarpad: are you here? [11:19:45] (03CR) 10Urbanecm: [C: 03+2] Add media.farsnews.ir to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614584 (https://phabricator.wikimedia.org/T253800) (owner: 10Urbanecm) [11:20:39] (03Merged) 10jenkins-bot: Add media.farsnews.ir to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614584 (https://phabricator.wikimedia.org/T253800) (owner: 10Urbanecm) [11:20:59] (03CR) 10Urbanecm: "marked as ready per RhinosF1's request at IRC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604515 (https://phabricator.wikimedia.org/T255031) (owner: 10RhinosF1) [11:21:04] (03PS5) 10Urbanecm: Add NamespaceAliases for kowikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604515 (https://phabricator.wikimedia.org/T255031) (owner: 10RhinosF1) [11:21:44] (03CR) 10Urbanecm: [C: 03+2] Add NamespaceAliases for kowikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604515 (https://phabricator.wikimedia.org/T255031) (owner: 10RhinosF1) [11:22:25] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: bc5671a90c65b66989e470fc41225986b2ec9fb5: Add media.farsnews.ir to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T253800) (duration: 00m 57s) [11:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:31] T253800: Add media.farsnews.ir to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T253800 [11:22:32] (03Merged) 10jenkins-bot: Add NamespaceAliases for kowikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604515 (https://phabricator.wikimedia.org/T255031) (owner: 10RhinosF1) [11:23:29] @Urbanecm Yes [11:24:28] I’d like to do a quick createAndPromote.php (for myself on testwikidatawiki) after the config changes are done [11:24:36] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 3719668511231589b4fc6a723ccdfa772068ad5f: Add NamespaceAliases for kowikiquote (T255031) (duration: 00m 57s) [11:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:42] T255031: Add namespace aliases on Korean Wikiquote - https://phabricator.wikimedia.org/T255031 [11:24:53] (03CR) 10Urbanecm: [C: 04-1] "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T258134) (owner: 10Ammarpad) [11:24:59] Ammarpad: gotcha. See my reply on your second patch [11:25:41] !log mwscript namespaceDupes.php --wiki=kowikiquote --fix (T255031) [11:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:50] Ready to go https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/613684 Ammarpad [11:27:04] (03PS2) 10Urbanecm: Remove wgPopupsPageBlacklist config setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613684 (https://phabricator.wikimedia.org/T254676) (owner: 10Ammarpad) [11:27:14] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613684 (https://phabricator.wikimedia.org/T254676) (owner: 10Ammarpad) [11:28:03] (03Merged) 10jenkins-bot: Remove wgPopupsPageBlacklist config setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613684 (https://phabricator.wikimedia.org/T254676) (owner: 10Ammarpad) [11:28:33] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:28:47] Ammarpad: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/613684 is at mwdebug1001, could you test it please? [11:29:28] Ty Urbanecm [11:29:34] np RhinosF1 :) [11:29:53] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610047 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [11:32:13] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:32:17] Ammarpad: ping? [11:32:49] Lucas_WMDE: who would you like to be? :-) [11:32:57] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005031 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:33:23] I need interface-admin to fix a gadget definition ^^ [11:33:32] (I can do it myself once the rest of the window is done, just wanted to let you know) [11:33:34] @Urbanecm It looks normal (popups config) [11:33:58] (03CR) 10Ema: [C: 03+1] ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) (owner: 10Vgutierrez) [11:34:17] Lucas_WMDE: okay, feel free to do the script run now :) [11:34:32] ok! [11:34:39] Ammarpad: okay, syncing [11:34:57] !log urbanecm@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 01s) [11:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:35] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript createAndPromote.php testwikidatawiki --custom-groups=interface-admin --force 'Lucas Werkmeister (WMDE)' [11:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:44] (03PS4) 10Ammarpad: Set $wgUrlShortenerAllowedDomains for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T258134) [11:36:18] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c12f1dee6b9888849c64312c2a4fd65ecbd4091e: Remove wgPopupsPageBlacklist config setting (T254676) (duration: 00m 57s) [11:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:23] T254676: Rename wgPopupsPageBlacklist - https://phabricator.wikimedia.org/T254676 [11:36:40] Ammarpad: first one done [11:36:41] (03CR) 10Ammarpad: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T258134) (owner: 10Ammarpad) [11:37:11] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T258134) (owner: 10Ammarpad) [11:37:32] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Marostegui) Just for the record, looks like this host has had a history of HW crashes before: T178460 T158188 T145533 T145607 [11:37:34] (03CR) 10Urbanecm: [C: 03+2] Update brwikimedia logo and add upscaled versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612629 (https://phabricator.wikimedia.org/T257925) (owner: 10Chico Venancio) [11:37:36] (03PS5) 10Ammarpad: Set $wgUrlShortenerAllowedDomains for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T258134) [11:37:50] 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [11:37:56] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T258134) (owner: 10Ammarpad) [11:38:17] (03Merged) 10jenkins-bot: Update brwikimedia logo and add upscaled versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612629 (https://phabricator.wikimedia.org/T257925) (owner: 10Chico Venancio) [11:38:42] (03Merged) 10jenkins-bot: Set $wgUrlShortenerAllowedDomains for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T258134) (owner: 10Ammarpad) [11:39:00] @Urbanecm responded to your comment (urlshortener config) [11:39:12] Ammarpad: saw that, merged, and pulled onto mwdebug1001 :) [11:39:20] (03PS1) 10DannyS712: CommonSettings: slave -> replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614043 (https://phabricator.wikimedia.org/T254646) [11:39:47] (03PS2) 10DannyS712: CommonSettings: slave -> replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614043 (https://phabricator.wikimedia.org/T254646) [11:39:59] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614043 (https://phabricator.wikimedia.org/T254646) (owner: 10DannyS712) [11:40:19] can I add https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/614043 to the current backport window? It just changes a comment [11:42:25] @Urbanecm working fine on meta. [11:42:31] thanks, syncing [11:42:51] !log urbanecm@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 00s) [11:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:00] (03CR) 10Majavah: [C: 04-1] CommonSettings: slave -> replica (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614043 (https://phabricator.wikimedia.org/T254646) (owner: 10DannyS712) [11:43:30] (03PS3) 10DannyS712: CommonSettings: slave -> replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614043 (https://phabricator.wikimedia.org/T254646) [11:43:55] (03CR) 10Majavah: [C: 03+1] CommonSettings: slave -> replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614043 (https://phabricator.wikimedia.org/T254646) (owner: 10DannyS712) [11:43:57] (03CR) 10DannyS712: CommonSettings: slave -> replica (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614043 (https://phabricator.wikimedia.org/T254646) (owner: 10DannyS712) [11:44:01] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 5b97a06fa2e9a06c251a9c1fd2ddd9beec01a683: Set $wgUrlShortenerAllowedDomains for all wikis (T258134) (duration: 00m 57s) [11:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:18] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: f7560b6061dd3a60ccf56c916ebf70a3f104bea7: Update brwikimedia logo and add upscaled versions (T257925) (duration: 00m 56s) [11:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:23] T257925: Update brwikimedia logo - https://phabricator.wikimedia.org/T257925 [11:47:18] (03PS2) 10Urbanecm: Update brwikimedia logo and add upscaled versions (config) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612636 (https://phabricator.wikimedia.org/T257925) (owner: 10Chico Venancio) [11:47:22] (03CR) 10Urbanecm: [C: 03+2] Update brwikimedia logo and add upscaled versions (config) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612636 (https://phabricator.wikimedia.org/T257925) (owner: 10Chico Venancio) [11:48:07] (03Merged) 10jenkins-bot: Update brwikimedia logo and add upscaled versions (config) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612636 (https://phabricator.wikimedia.org/T257925) (owner: 10Chico Venancio) [11:49:03] !log Purge 'https://en.wikipedia.org/static/images/project-logos/bnwikimedia.png' [11:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:16] !log urbanecm@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 00s) [11:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:41] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 946bf3d239f278b4e099f5dec676f5e2be61d8ca: Update brwikimedia logo and add upscaled versions (config) (T257925) (duration: 00m 57s) [11:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:16] (03CR) 10Urbanecm: [C: 03+2] "noop" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614043 (https://phabricator.wikimedia.org/T254646) (owner: 10DannyS712) [11:58:21] DannyS712: +2'ed [11:58:27] thanks [11:58:47] np [11:58:52] (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: slave -> replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614043 (https://phabricator.wikimedia.org/T254646) (owner: 10DannyS712) [11:58:57] what? [11:59:24] (03CR) 10Urbanecm: [C: 03+2] CommonSettings: slave -> replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614043 (https://phabricator.wikimedia.org/T254646) (owner: 10DannyS712) [12:00:04] Urbanecm and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Creating new wikis deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200720T1200). [12:00:08] (03Merged) 10jenkins-bot: CommonSettings: slave -> replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614043 (https://phabricator.wikimedia.org/T254646) (owner: 10DannyS712) [12:01:37] \o/ [12:01:45] Amir1: waiting for you [12:02:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thanks for the update, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/610045 (owner: 10Jbond) [12:02:10] !log installing qemu security updates on buster [12:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:15] (03PS13) 10Vgutierrez: ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) [12:09:43] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) (owner: 10Vgutierrez) [12:10:32] (03CR) 10jerkins-bot: [V: 04-1] ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) (owner: 10Vgutierrez) [12:10:40] wut? [12:11:27] Urbanecm: oh sorry [12:11:34] hi! [12:11:41] I got the time wrong, I thought it's in an hour [12:11:47] it's all good [12:12:00] do you want to start? [12:12:03] yup [12:12:07] starting with T257674 [12:12:08] T257674: Create Moroccan Arabic Wikipedia - https://phabricator.wikimedia.org/T257674 [12:12:25] (03PS4) 10Urbanecm: Initial configuration for arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611934 (https://phabricator.wikimedia.org/T257674) [12:12:41] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611934 (https://phabricator.wikimedia.org/T257674) (owner: 10Urbanecm) [12:13:42] (03Merged) 10jenkins-bot: Initial configuration for arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611934 (https://phabricator.wikimedia.org/T257674) (owner: 10Urbanecm) [12:13:44] (03CR) 10Jbond: "Alex made the observation that orts users are able to create email addresses under the wikimedia.org domain. As such if otrs was processe" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [12:13:54] (03PS14) 10Vgutierrez: ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) [12:14:15] pulled onto deploy1001, going to scap pull it to mwmaint1002 [12:15:05] done, addWiki.php completed [12:15:16] going to sync dblists&wikiversions [12:16:23] !log urbanecm@deploy1001 Synchronized dblists: Creating arywiki (T257674) (duration: 00m 57s) [12:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:52] !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating arywiki (T257674) [12:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:59] T257674: Create Moroccan Arabic Wikipedia - https://phabricator.wikimedia.org/T257674 [12:19:21] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: Creating arywiki (T257674) (duration: 00m 57s) [12:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:35] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating arywiki (T257674) (duration: 00m 56s) [12:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:47] (03PS15) 10Vgutierrez: ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) [12:21:11] Amir1: wiki seems to be up, finishing langlist sync :) [12:21:34] cool, we can rebuild interwiki cache once they all are done [12:21:38] !log urbanecm@deploy1001 Synchronized langlist: Creating arywiki (T257674) (duration: 00m 56s) [12:21:42] sure, will do that [12:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:54] (03PS8) 10Urbanecm: Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [12:22:00] okay, doing T257672 now [12:22:00] T257672: Create Wikisource Ligurian - https://phabricator.wikimedia.org/T257672 [12:22:08] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [12:22:09] o/ [12:22:42] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [12:23:03] :( [12:23:28] hmm...why didn't that fail earlier, fixing [12:24:59] (03PS1) 10Urbanecm: Add arywiki to rtl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614733 (https://phabricator.wikimedia.org/T257674) [12:25:19] (03CR) 10Urbanecm: [C: 03+2] Add arywiki to rtl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614733 (https://phabricator.wikimedia.org/T257674) (owner: 10Urbanecm) [12:26:22] (03Merged) 10jenkins-bot: Add arywiki to rtl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614733 (https://phabricator.wikimedia.org/T257674) (owner: 10Urbanecm) [12:27:08] !log urbanecm@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 00s) [12:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:38] (03PS9) 10Urbanecm: Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [12:27:42] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [12:27:53] !log draining restbase2013 for eventual reboot for kernel security update [12:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:11] (03CR) 10JMeybohm: [C: 04-1] Add discovery records for chartmuseum (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [12:28:20] !log urbanecm@deploy1001 Synchronized dblists/rtl.dblist: Add arywiki to rtl.dblist (T257674) (duration: 00m 57s) [12:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:25] T257674: Create Moroccan Arabic Wikipedia - https://phabricator.wikimedia.org/T257674 [12:28:28] (03Merged) 10jenkins-bot: Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [12:28:51] pulling lijwikisource onto deploy1001 and mwmaint1002 [12:29:14] running addwiki [12:29:18] (03CR) 10Vgutierrez: [C: 03+2] ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) (owner: 10Vgutierrez) [12:29:23] (03PS1) 10Kormat: backups: Move metadata monitoring to icinga hosts. [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) [12:29:50] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) (owner: 10Kormat) [12:30:31] addwiki completed, syncing config now [12:30:51] !log urbanecm@deploy1001 Synchronized dblists: Creating lijwikisource (T257672) (duration: 00m 56s) [12:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:56] T257672: Create Wikisource Ligurian - https://phabricator.wikimedia.org/T257672 [12:31:35] PROBLEM - cassandra-a CQL 10.192.16.82:9042 on restbase2013 is CRITICAL: connect to address 10.192.16.82 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [12:31:51] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:16] (03PS1) 10Ammarpad: Switch $wgUrlShortenerDomainsWhitelist --> $wgUrlShortenerAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614735 (https://phabricator.wikimedia.org/T255491) [12:32:25] !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating lijwikisource (T257672) [12:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:37] (03PS2) 10Kormat: backups: Move metadata monitoring to icinga hosts. [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) [12:34:02] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating lijwikisource (T257672) (duration: 00m 57s) [12:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:36] this has the default logo and doesn't change langlist => going to T256545 [12:34:37] T256545: Create private wiki sysop_itwiki - https://phabricator.wikimedia.org/T256545 [12:35:00] (03PS4) 10Urbanecm: Initial configuration for sysop_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609768 (https://phabricator.wikimedia.org/T256545) [12:35:11] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for sysop_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609768 (https://phabricator.wikimedia.org/T256545) (owner: 10Urbanecm) [12:35:44] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:05] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for sysop_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609768 (https://phabricator.wikimedia.org/T256545) (owner: 10Urbanecm) [12:36:05] PROBLEM - cassandra-b CQL 10.192.16.83:9042 on restbase2013 is CRITICAL: connect to address 10.192.16.83 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [12:37:06] ^ expected [12:37:48] Urbanecm: for the private wiki, has the puppet deployed? [12:37:55] RECOVERY - cassandra-b CQL 10.192.16.83:9042 on restbase2013 is OK: TCP OK - 0.036 second response time on 10.192.16.83 port 9042 https://phabricator.wikimedia.org/T93886 [12:37:59] IMPORTANT: If the wiki is a regular public wiki to appear on Cloud Services - you can continue. If the wiki is private and should not be replicated to Cloud Services DO NOT CONTINUE UNTIL YOU HAVE THE OK FROM OPS/DBAs to check no private data is leaked. There are mechanisms in place to prevent that by default, but those should be manually checked. [12:38:14] https://wikitech.wikimedia.org/wiki/Add_a_wiki#IMPORTANT:_For_Private_Wikis [12:38:18] Amir1: yes, DBA's gave me the green light in T257125 [12:38:19] T257125: Prepare and check storage layer for sysop_itwiki - https://phabricator.wikimedia.org/T257125 [12:38:27] coolio [12:38:37] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:38:39] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:59] (03PS5) 10Urbanecm: Initial configuration for sysop_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609768 (https://phabricator.wikimedia.org/T256545) [12:39:29] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for sysop_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609768 (https://phabricator.wikimedia.org/T256545) (owner: 10Urbanecm) [12:40:10] !log draining restbase2014 for eventual reboot for kernel security update [12:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:40] 10Operations, 10Cloud-Services, 10Traffic, 10cloud-services-team (Kanban): Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10jbond) [12:43:02] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:29] (03Merged) 10jenkins-bot: Initial configuration for sysop_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609768 (https://phabricator.wikimedia.org/T256545) (owner: 10Urbanecm) [12:44:37] (03PS12) 10MSantos: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) [12:44:39] pulling onto deploy1001 and mwmaint1002 [12:45:22] running addwiki script [12:45:52] syncing wikiversions etc [12:46:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:55] !log urbanecm@deploy1001 Synchronized dblists: Creating sysop_itwiki (T256545) (duration: 00m 57s) [12:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:00] T256545: Create private wiki sysop_itwiki - https://phabricator.wikimedia.org/T256545 [12:47:21] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] statsite::instance: fix style violation in define (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/610045 (owner: 10Jbond) [12:48:32] !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating sysop_itwiki (T256545) [12:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:31] (03CR) 10Filippo Giunchedi: [C: 03+2] role: fold weblog into centrallog [puppet] - 10https://gerrit.wikimedia.org/r/610832 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [12:49:41] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating sysop_itwiki (T256545) (duration: 00m 57s) [12:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:47] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: Creating sysop_itwiki (T256545) (duration: 00m 57s) [12:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Very minor inline nitpick, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [12:52:10] all done, running scap update-interwiki-cache now [12:53:13] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614737 [12:53:15] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614737 (owner: 10Urbanecm) [12:54:06] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614737 (owner: 10Urbanecm) [12:54:15] (03PS13) 10MSantos: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) [12:54:27] (03CR) 10Alexandros Kosiaris: Add discovery records for chartmuseum (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [12:55:08] !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 01m 59s) [12:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:29] (03PS19) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 [12:55:39] Amir1: I think we're all done? [12:55:58] \o/ [12:56:03] Yay [12:56:09] !log Create Daimona Eaytoy at sysop_itwiki (T256545) [12:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:14] T256545: Create private wiki sysop_itwiki - https://phabricator.wikimedia.org/T256545 [12:56:50] (03PS3) 10Kormat: backups: Move metadata monitoring to icinga hosts. [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) [12:56:53] PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:31] (03CR) 10Elukey: Update analytics-in(4|6) filters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/614702 (owner: 10Elukey) [12:57:35] PROBLEM - Check systemd state on centrallog2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:05] !log draining restbase2015 for eventual reboot for kernel security update [12:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:13] !log creating arywiki (T257674), lijwikisource (T257672), sysop_itwiki (T256545) done [12:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:20] T257674: Create Moroccan Arabic Wikipedia - https://phabricator.wikimedia.org/T257674 [12:59:20] T257672: Create Wikisource Ligurian - https://phabricator.wikimedia.org/T257672 [13:00:43] Amir1: should I create a task for you to do the wikidata thingie for the content projects? [13:01:18] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:58] Nah, I run it later today [13:02:06] okay, thanks! [13:05:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:25] !log reset broken ifup systemd states on puppetdb* hosts [13:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:39] RECOVERY - Check systemd state on puppetdb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:59] (03PS14) 10MSantos: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) [13:08:03] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:08:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:08:05] (03PS9) 10Jbond: statsite::instance: fix style violation in define [puppet] - 10https://gerrit.wikimedia.org/r/610045 [13:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:09] (03CR) 10Jbond: "updated thanks" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/610045 (owner: 10Jbond) [13:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:16] (03PS2) 10Elukey: Update analytics-in(4|6) filters [homer/public] - 10https://gerrit.wikimedia.org/r/614702 [13:09:43] !log draining restbase2016 for eventual reboot for kernel security update [13:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:17] (03PS1) 10Jbond: profile::mariadb::backup::check: add dummy db_password [labs/private] - 10https://gerrit.wikimedia.org/r/614743 (https://phabricator.wikimedia.org/T258045) [13:15:12] (03CR) 10Ema: [C: 03+2] LVS: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610047 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:15:47] 10Operations, 10DBA, 10OTRS, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10akosiaris) Many thanks! [13:16:15] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:07] (03PS7) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) [13:17:38] (03PS1) 10Privacybatm: [POC2 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614744 (https://phabricator.wikimedia.org/T257601) [13:17:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] statsite::instance: fix style violation in define [puppet] - 10https://gerrit.wikimedia.org/r/610045 (owner: 10Jbond) [13:18:39] (03PS1) 10Privacybatm: [POC1 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614745 (https://phabricator.wikimedia.org/T257601) [13:19:04] (03CR) 10jerkins-bot: [V: 04-1] [POC1 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614745 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [13:19:07] (03PS1) 10Alexandros Kosiaris: otrs: Set otrs1001 as OTRS role [puppet] - 10https://gerrit.wikimedia.org/r/614746 (https://phabricator.wikimedia.org/T187984) [13:19:28] 10Operations, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [13:20:11] (03CR) 10Jbond: [V: 03+2 C: 03+2] profile::mariadb::backup::check: add dummy db_password [labs/private] - 10https://gerrit.wikimedia.org/r/614743 (https://phabricator.wikimedia.org/T258045) (owner: 10Jbond) [13:21:12] 10Operations, 10observability: Setup metrics monitoring for OpenLDAP/corp - https://phabricator.wikimedia.org/T206327 (10fgiunchedi) AFAIK the OIT ldap replica will go away, safe to resolve @MoritzMuehlenhoff ? [13:22:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:19] (03CR) 10Jbond: [C: 03+2] statsite::instance: fix style violation in define [puppet] - 10https://gerrit.wikimedia.org/r/610045 (owner: 10Jbond) [13:22:48] (03PS2) 10Privacybatm: [POC1 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614745 (https://phabricator.wikimedia.org/T257601) [13:23:20] 10Operations, 10observability: Setup metrics monitoring for OpenLDAP/corp - https://phabricator.wikimedia.org/T206327 (10MoritzMuehlenhoff) 05Open→03Declined Yes, we can close this. [13:23:23] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) [13:24:54] !log lvs4007 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015 [13:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:59] T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 [13:25:17] (03PS17) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [13:25:57] RECOVERY - Check systemd state on puppetdb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:03] (03CR) 10Privacybatm: "This patch has the follwoing problem (AssertionError):" [software/transferpy] - 10https://gerrit.wikimedia.org/r/614745 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [13:27:06] !log draining restbase2017 for eventual reboot for kernel security update [13:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:25] (03PS4) 10Kormat: backups: Move metadata monitoring to icinga hosts. [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) [13:27:27] (03PS15) 10Jbond: labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [13:29:57] (03CR) 10Privacybatm: "I understand this has issues with sanit_checks and 'stop_slave' option (which can be resolved). But as I said in the other patch (https://" [software/transferpy] - 10https://gerrit.wikimedia.org/r/614744 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [13:31:08] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:51] !log lvs400[56] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015 [13:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:56] T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 [13:33:12] (03PS1) 10Kormat: mariadb: Add replication monitoring for zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/614747 (https://phabricator.wikimedia.org/T257816) [13:34:14] (03CR) 10Kormat: "+filippo for observability team." [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) (owner: 10Kormat) [13:34:39] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:25] (03PS8) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) [13:35:48] (03CR) 10jerkins-bot: [V: 04-1] start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [13:42:05] !log lvs2010 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015 [13:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:12] T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 [13:44:25] !log lvs200[78] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015 [13:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:41] !log lvs5003 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015 [13:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:46] T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 [13:47:56] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:47:58] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:57] !log draining restbase2018 for eventual reboot for kernel security update [13:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:12] (03CR) 10Filippo Giunchedi: "I was able to build the package as expected on a "builder" host in WMCS under Docker. Building on deneb in production at the moment is mod" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/552486 (https://phabricator.wikimedia.org/T217340) (owner: 10Filippo Giunchedi) [13:50:31] !log lvs500[12] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015 [13:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:04] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) (owner: 10Kormat) [13:57:06] !log lvs3007 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015 [13:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:12] T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 [13:59:09] (03CR) 10ArielGlenn: "Shellcheck is lying about the quoted array in a for loop. I'll remove the quotes anyways to get it to shut up >_<" [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [13:59:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10Joe) a:03Joe [13:59:18] !log lvs300[56] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015 [13:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:04] (03PS9) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) [14:00:16] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:52] 10Operations, 10Release-Engineering-Team-TODO, 10observability, 10Release-Engineering-Team (Deployment services): "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520 (10fgiunchedi) a:05fgiunchedi→03None Unassigning from me sin... [14:02:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:06] 10Operations, 10Traffic: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 (10Vgutierrez) [14:06:40] 10Operations, 10observability, 10cloud-services-team (Kanban): Prometheus vs. CPU usage vs. hyperthreading - https://phabricator.wikimedia.org/T193272 (10fgiunchedi) 05Open→03Declined Boldly resolving, it seems things are working as intended [14:06:57] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:18] !log lvs1016 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015 [14:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:23] T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 [14:08:34] !log lvs101[34] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015 [14:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:15] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:11:18] (03PS10) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) [14:13:42] (03PS1) 10Elukey: Add PTR/AAAA records for db1108 [dns] - 10https://gerrit.wikimedia.org/r/614751 (https://phabricator.wikimedia.org/T234826) [14:13:44] 10Operations, 10Traffic: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 (10Vgutierrez) p:05Triage→03Medium [14:14:06] !log draining restbase2019 for eventual reboot for kernel security update [14:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:55] (03CR) 10ArielGlenn: [C: 03+2] start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [14:16:55] (03CR) 10Hnowlan: api-gateway: Basic envoy chart WIP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [14:17:01] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:46] (03CR) 10Ppchelko: api-gateway: Basic envoy chart WIP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [14:18:00] (03CR) 10Jcrespo: "Was the private repo updated already?" [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) (owner: 10Kormat) [14:19:11] (03CR) 10Kormat: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) (owner: 10Kormat) [14:20:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:41] (03CR) 10jerkins-bot: [V: 04-1] Add PTR/AAAA records for db1108 [dns] - 10https://gerrit.wikimedia.org/r/614751 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [14:22:17] uff [14:22:49] (03CR) 10Jcrespo: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) (owner: 10Kormat) [14:23:00] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:23:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:31] (03CR) 10Jcrespo: [C: 03+1] backups: Move metadata monitoring to icinga hosts. [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) (owner: 10Kormat) [14:24:10] (03PS2) 10Elukey: Add PTR/AAAA records for db1108 [dns] - 10https://gerrit.wikimedia.org/r/614751 (https://phabricator.wikimedia.org/T234826) [14:25:57] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:26:29] !log draining restbase2020 for eventual reboot for kernel security update [14:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:03] (03CR) 10Filippo Giunchedi: "I think the alerting host is fine to run these checks. Also worth noting that checks will be run from both codfw and eqiad all the time, t" [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) (owner: 10Kormat) [14:29:20] (03CR) 10Filippo Giunchedi: [C: 03+1] backups: Move metadata monitoring to icinga hosts. [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) (owner: 10Kormat) [14:29:28] (03PS1) 10Mholloway: Update mobileapps to 2020-07-17-172246-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/614754 (https://phabricator.wikimedia.org/T258186) [14:29:48] (03CR) 10Kormat: [C: 03+2] backups: Move metadata monitoring to icinga hosts. [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) (owner: 10Kormat) [14:30:31] (03CR) 10Jcrespo: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) (owner: 10Kormat) [14:31:07] (03CR) 10Jcrespo: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/614734 (https://phabricator.wikimedia.org/T258045) (owner: 10Kormat) [14:31:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:32:50] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@ff49fdf]: Update mobileapps to 0bf7bafa [14:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:56] (03PS1) 10ArielGlenn: dumps rsync refactor, better opts and flags handling [puppet] - 10https://gerrit.wikimedia.org/r/614755 (https://phabricator.wikimedia.org/T254856) [14:33:46] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:11] !log starting drain and restart of sessionstore hosts for new kernel [14:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:34] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:34:35] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:35] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@ff49fdf]: Update mobileapps to 0bf7bafa (duration: 03m 50s) [14:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:34] PROBLEM - snapshot of s1 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:37:34] PROBLEM - snapshot of s8 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:37:34] PROBLEM - dump of m5 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:37:34] PROBLEM - dump of s7 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:37:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:10] PROBLEM - cassandra-a SSL 10.192.32.119:7001 on restbase2020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:38:10] ^kormat: did you checked grants? [14:38:55] (03CR) 10Mholloway: [C: 03+2] Update mobileapps to 2020-07-17-172246-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/614754 (https://phabricator.wikimedia.org/T258186) (owner: 10Mholloway) [14:39:12] RECOVERY - cassandra-a SSL 10.192.32.119:7001 on restbase2020 is OK: SSL OK - Certificate restbase2020-a valid until 2021-04-07 15:35:52 +0000 (expires in 261 days) https://phabricator.wikimedia.org/T120662 [14:39:50] PROBLEM - dump of m5 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:39:50] PROBLEM - snapshot of s1 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:39:50] PROBLEM - dump of s7 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:39:50] PROBLEM - snapshot of s8 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:40:06] (03Merged) 10jenkins-bot: Update mobileapps to 2020-07-17-172246-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/614754 (https://phabricator.wikimedia.org/T258186) (owner: 10Mholloway) [14:40:35] jynus: known? ^^^ [14:40:54] volans: kormat deployed, that is why I asked him [14:41:02] (03PS4) 10Herron: prometheus[123]001 assign role::prometheus, add to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) [14:41:47] (03PS1) 10Ladsgroup: labs: Rename $wgUrlShortenerDomainsWhitelist to $wgUrlShortenerAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614757 (https://phabricator.wikimedia.org/T255491) [14:41:53] k [14:42:12] PROBLEM - dump of s1 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:42:12] PROBLEM - dump of s8 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:42:12] PROBLEM - snapshot of x1 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:42:12] PROBLEM - snapshot of s2 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:42:30] (03CR) 10Muehlenhoff: prometheus[123]001 assign role::prometheus, add to prometheus_nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [14:42:51] volans: if he doesn't answer, I can revert [14:42:52] er, crap, no not known [14:43:12] i'll revert [14:43:17] don't worry [14:43:21] will let you research [14:43:39] I only suggested if you weren't around [14:43:59] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:43:59] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:43:59] ok. poking first then [14:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:30] PROBLEM - dump of s8 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:44:30] PROBLEM - dump of s1 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:44:30] PROBLEM - snapshot of x1 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:44:30] PROBLEM - snapshot of s2 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:44:44] (03CR) 10Herron: prometheus[123]001 assign role::prometheus, add to prometheus_nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [14:44:52] (03PS1) 10Elukey: superset: enable TLS to connect to Mysql [puppet] - 10https://gerrit.wikimedia.org/r/614758 (https://phabricator.wikimedia.org/T257412) [14:44:53] acking the alerts on icinga [14:45:07] ACKNOWLEDGEMENT - dump of m5 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:07] ACKNOWLEDGEMENT - dump of m5 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:07] ACKNOWLEDGEMENT - dump of s1 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:07] ACKNOWLEDGEMENT - dump of s1 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:07] ACKNOWLEDGEMENT - dump of s7 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:07] ACKNOWLEDGEMENT - dump of s7 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:07] ACKNOWLEDGEMENT - dump of s8 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:08] ACKNOWLEDGEMENT - dump of s8 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:08] ACKNOWLEDGEMENT - snapshot of s1 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:09] ACKNOWLEDGEMENT - snapshot of s1 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:09] ACKNOWLEDGEMENT - snapshot of s2 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:10] ACKNOWLEDGEMENT - snapshot of s2 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:10] ACKNOWLEDGEMENT - snapshot of s8 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:10] better disable than... [14:45:11] ACKNOWLEDGEMENT - snapshot of s8 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:11] ACKNOWLEDGEMENT - snapshot of x1 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:12] ACKNOWLEDGEMENT - snapshot of x1 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:45:14] (03CR) 10Ladsgroup: [C: 03+2] "Beta cluster was left out." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614757 (https://phabricator.wikimedia.org/T255491) (owner: 10Ladsgroup) [14:45:15] ... too late [14:45:28] (03PS5) 10Herron: prometheus[345]001 assign role::prometheus, add to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) [14:45:52] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:46:04] (03PS2) 10Alexandros Kosiaris: otrs: Set otrs1001 as OTRS role [puppet] - 10https://gerrit.wikimedia.org/r/614746 (https://phabricator.wikimedia.org/T187984) [14:46:06] (03PS1) 10Alexandros Kosiaris: traffic: Add ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/614759 (https://phabricator.wikimedia.org/T187984) [14:46:06] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:46:06] ACKNOWLEDGEMENT - dump of es4 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:46:06] ACKNOWLEDGEMENT - dump of s2 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:46:06] ACKNOWLEDGEMENT - dump of x1 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:46:06] ACKNOWLEDGEMENT - snapshot of s3 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:46:17] (03Merged) 10jenkins-bot: labs: Rename $wgUrlShortenerDomainsWhitelist to $wgUrlShortenerAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614757 (https://phabricator.wikimedia.org/T255491) (owner: 10Ladsgroup) [14:46:22] (03CR) 10Herron: [C: 03+2] prometheus[345]001 assign role::prometheus, add to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [14:47:00] (03CR) 10Elukey: [C: 03+2] superset: enable TLS to connect to Mysql [puppet] - 10https://gerrit.wikimedia.org/r/614758 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [14:47:06] !log draining restbase2021 for eventual reboot for kernel security update [14:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:19] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [14:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:50] rebased that labs patch in deploy1001 ^ [14:48:20] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:48:56] PROBLEM - dump of x1 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:48:56] PROBLEM - dump of es4 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:48:56] PROBLEM - snapshot of s3 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:48:56] PROBLEM - dump of s2 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:48:57] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [14:49:18] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:26] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [14:49:26] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [14:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:30] PROBLEM - dump of es5 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:50:30] PROBLEM - dump of zarcillo in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:50:30] PROBLEM - snapshot of s4 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:50:30] PROBLEM - dump of s3 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:51:47] i'm confused - why did acking the backup alerts not stop them from spamming the channel? [14:51:51] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [14:51:51] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [14:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:11] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:52:12] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:15] kormat: ack or downtime? [14:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:32] !log draining and restarting sessionstore1003 [14:52:33] volans: i ack'd them [14:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:40] PROBLEM - dump of zarcillo in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:52:40] PROBLEM - dump of es5 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:52:40] PROBLEM - dump of s3 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:52:40] PROBLEM - snapshot of s4 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:52:42] so they were already in critical I guess [14:52:46] yes [14:52:54] and already in hard state? [14:53:15] i don't know what that is [14:53:16] icinga has soft until a threshold of failed checks is reached (usually 3) and then it becomes HARD and notifications are sent [14:53:20] PROBLEM - MariaDB Replica SQL: m1 on db2078 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 208.80.154.84-nagios for key PRIMARY on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:53:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:53:38] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:53:47] now m1 broke? [14:53:57] so critical-soft, critical-soft, critical-hard -> notification sent, is the default behaviour [14:54:01] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:54:03] jynus: can you handle that re: m1? [14:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:05] changeable in puppet via parameters [14:54:11] kormat: did you touch m1 data? [14:54:30] all m1 hosts broke [14:54:31] oh. gah. [14:54:50] let's go to pm [14:54:56] PROBLEM - MariaDB read only m1 on db1080 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:55:00] PROBLEM - MariaDB Replica SQL: m1 on db1117 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:55:08] PROBLEM - snapshot of s5 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:55:08] PROBLEM - dump of m1 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:55:08] PROBLEM - dump of s4 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:55:13] - [14:55:16] PROBLEM - MariaDB read only m1 on db1117 is CRITICAL: Could not connect to localhost:3321 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:55:54] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:56:09] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:56:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:54] !log draining restbase2022 for eventual reboot for kernel security update [14:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:36] PROBLEM - dump of m1 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:57:37] PROBLEM - dump of s4 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:57:38] PROBLEM - snapshot of s5 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:58:53] (03PS1) 10Vgutierrez: vcl: Use synthetic warning for ECDHE-RSA-AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/614763 (https://phabricator.wikimedia.org/T258405) [14:59:57] PROBLEM - MariaDB Replica IO: m1 on db2132 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:00:02] PROBLEM - dump of s5 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:00:02] PROBLEM - dump of m2 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:00:02] PROBLEM - snapshot of s6 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:00:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:50] PROBLEM - MariaDB Replica SQL: m1 on db2132 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:01:50] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:02:34] PROBLEM - snapshot of s6 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:02:34] PROBLEM - dump of m2 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:02:34] PROBLEM - dump of s5 in eqiad on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:02:49] (03PS1) 10Elukey: profile::analytics::search::airflow: use TLS with mysql [puppet] - 10https://gerrit.wikimedia.org/r/614764 (https://phabricator.wikimedia.org/T257412) [15:03:16] PROBLEM - MariaDB Replica IO: m1 on db1080 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:04:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:04:52] PROBLEM - MariaDB Replica Lag: m1 on db2132 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:54] PROBLEM - MariaDB read only m1 on db2132 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:04:58] PROBLEM - snapshot of s7 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:04:58] PROBLEM - dump of m3 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:04:58] PROBLEM - dump of s6 in codfw on icinga1001 is CRITICAL: We could not connect to the backup metadata database https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:05:46] PROBLEM - MariaDB Replica SQL: m1 on db1080 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:08:35] !log draining restbase2023 for eventual reboot for kernel security update [15:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:01] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [15:09:01] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:06] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [15:09:06] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:26] !log draining and restarting sessionstore2001 [15:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:50] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/24003/an-airflow1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/614764 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [15:09:56] (03PS3) 10Alexandros Kosiaris: otrs: Set otrs1001 as OTRS role [puppet] - 10https://gerrit.wikimedia.org/r/614746 (https://phabricator.wikimedia.org/T187984) [15:11:40] RECOVERY - MariaDB Replica IO: m1 on db1080 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:11:46] RECOVERY - MariaDB read only m1 on db1080 is OK: Version 10.4.13-MariaDB-log, Uptime 1746791s, read_only: False, read_only: True, 733.42 QPS, connection latency: 0.004819s, query latency: 0.000883s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:12:10] RECOVERY - MariaDB read only m1 on db1117 is OK: Version 10.4.13-MariaDB-log, Uptime 1748239s, read_only: True, read_only: True, 65.82 QPS, connection latency: 0.002028s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:12:16] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:36] RECOVERY - MariaDB Replica SQL: m1 on db1080 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:13:06] RECOVERY - MariaDB Replica IO: m1 on db2132 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:13:26] RECOVERY - MariaDB Replica Lag: m1 on db2132 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:13:28] RECOVERY - MariaDB read only m1 on db2132 is OK: Version 10.4.13-MariaDB-log, Uptime 2708977s, read_only: True, read_only: True, 198.57 QPS, connection latency: 0.002460s, query latency: 0.000509s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:13:34] RECOVERY - MariaDB Replica SQL: m1 on db1117 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:13:37] !log dropping and recreating nagios@localhost users on all m1 servers [15:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:12] RECOVERY - MariaDB Replica SQL: m1 on db2132 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:15:37] (03CR) 10Ebernhardson: [C: 03+1] profile::analytics::search::airflow: use TLS with mysql [puppet] - 10https://gerrit.wikimedia.org/r/614764 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [15:16:53] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:57] PROBLEM - cassandra-b SSL 10.192.48.143:7001 on restbase2023 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [15:17:09] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [15:17:10] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:18] !log draining and restarting sessionstore2002 [15:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:59] RECOVERY - cassandra-b SSL 10.192.48.143:7001 on restbase2023 is OK: SSL OK - Certificate restbase2023-b valid until 2022-01-15 15:53:14 +0000 (expires in 544 days) https://phabricator.wikimedia.org/T120662 [15:21:01] RECOVERY - MariaDB Replica SQL: m1 on db2078 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:21:48] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [15:21:49] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:08] 10Operations: Script to point SRE local machine traffic to another LB - https://phabricator.wikimedia.org/T244761 (10CDanis) 05Open→03Resolved [15:26:54] 10Operations, 10observability, 10Patch-For-Review: Evaluate/integrate rasdaemon as a replacement for mcelog - https://phabricator.wikimedia.org/T205396 (10CDanis) 05Open→03Resolved [15:33:44] (03PS18) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [15:35:09] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:59] PROBLEM - Check the last execution of generate-mysqld-exporter-config on prometheus5001 is CRITICAL: CRITICAL: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:41:52] (03PS1) 10Kormat: mariadb: Change backup monitoring user to avoid confusion. [puppet] - 10https://gerrit.wikimedia.org/r/614786 [15:43:11] (03PS2) 10Kormat: backups: Change monitoring user for backup metadata. [puppet] - 10https://gerrit.wikimedia.org/r/614786 [15:45:35] (03PS3) 10Cicalese: Install WikimediaApiPortal/WikimediaApiPortalOAuth - I: Add i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609206 (https://phabricator.wikimedia.org/T251279) [15:45:43] PROBLEM - Prometheus k8s cache not updating on prometheus5001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus5001&var-datasource=eqsin+prometheus/ops [15:49:40] (03CR) 10Jcrespo: [C: 03+2] backups: Change monitoring user for backup metadata. [puppet] - 10https://gerrit.wikimedia.org/r/614786 (owner: 10Kormat) [15:51:15] (03PS1) 10Jbond: (WIP) use dnsmasq: add configueration to use dnsmasq with WMF config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 [15:53:24] PROBLEM - Check systemd state on prometheus4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:58] (03PS2) 10Jbond: (WIP) use dnsmasq: add configueration to use dnsmasq with WMF config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 [15:54:29] (03PS1) 10Nray: Max-width layout: Make page container fill viewport if content height is small [skins/Vector] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614771 (https://phabricator.wikimedia.org/T257518) [15:55:37] (03CR) 10Elukey: [C: 03+2] profile::analytics::search::airflow: use TLS with mysql [puppet] - 10https://gerrit.wikimedia.org/r/614764 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [15:56:47] PROBLEM - Check the last execution of generate-mysqld-exporter-config on prometheus4001 is CRITICAL: CRITICAL: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:59:52] !log restart airflow-webserver/scheduler to pick up TLS to mysql settings [15:59:52] RECOVERY - dump of m1 in codfw on icinga1001 is OK: Last dump for m1 at codfw (db2078.codfw.wmnet:3321) taken on 2020-07-14 04:21:47 (23 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:59:52] RECOVERY - dump of m1 in eqiad on icinga1001 is OK: Last dump for m1 at eqiad (db1117.eqiad.wmnet:3321) taken on 2020-07-14 02:58:54 (23 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:00] PROBLEM - Check systemd state on prometheus3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:40] RECOVERY - dump of m2 in codfw on icinga1001 is OK: Last dump for m2 at codfw (db2078.codfw.wmnet:3322) taken on 2020-07-14 03:05:42 (428 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:00:40] RECOVERY - snapshot of s6 in codfw on icinga1001 is OK: Last snapshot for s6 at codfw (db2097.codfw.wmnet:3316) taken on 2020-07-20 04:39:35 (516 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:00:40] RECOVERY - dump of s5 in codfw on icinga1001 is OK: Last dump for s5 at codfw (db2099.codfw.wmnet:3315) taken on 2020-07-14 00:00:01 (101 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:22] RECOVERY - dump of s1 in eqiad on icinga1001 is OK: Last dump for s1 at eqiad (db1139.eqiad.wmnet:3311) taken on 2020-07-14 00:52:22 (156 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:22] RECOVERY - dump of s1 in codfw on icinga1001 is OK: Last dump for s1 at codfw (db2097.codfw.wmnet:3311) taken on 2020-07-14 03:29:58 (156 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:22] RECOVERY - dump of s2 in eqiad on icinga1001 is OK: Last dump for s2 at eqiad (db1095.eqiad.wmnet:3312) taken on 2020-07-14 00:00:01 (123 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:22] RECOVERY - dump of m5 in eqiad on icinga1001 is OK: Last dump for m5 at eqiad (db1117.eqiad.wmnet:3325) taken on 2020-07-14 00:00:01 (14 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:22] RECOVERY - dump of s6 in codfw on icinga1001 is OK: Last dump for s6 at codfw (db2097.codfw.wmnet:3316) taken on 2020-07-14 08:09:44 (85 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:22] RECOVERY - dump of m5 in codfw on icinga1001 is OK: Last dump for m5 at codfw (db2078.codfw.wmnet:3325) taken on 2020-07-14 02:49:18 (14 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:22] RECOVERY - dump of s3 in codfw on icinga1001 is OK: Last dump for s3 at codfw (db2098.codfw.wmnet:3313) taken on 2020-07-14 00:51:35 (105 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:23] RECOVERY - dump of m2 in eqiad on icinga1001 is OK: Last dump for m2 at eqiad (db1117.eqiad.wmnet:3322) taken on 2020-07-14 05:55:25 (428 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:23] RECOVERY - dump of s7 in codfw on icinga1001 is OK: Last dump for s7 at codfw (db2100.codfw.wmnet:3317) taken on 2020-07-14 02:44:13 (125 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:24] RECOVERY - dump of m3 in codfw on icinga1001 is OK: Last dump for m3 at codfw (db2078.codfw.wmnet:3323) taken on 2020-07-14 00:00:01 (57 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:24] RECOVERY - dump of s5 in eqiad on icinga1001 is OK: Last dump for s5 at eqiad (db1145.eqiad.wmnet:3315) taken on 2020-07-14 00:55:43 (101 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:25] RECOVERY - dump of s3 in eqiad on icinga1001 is OK: Last dump for s3 at eqiad (db1095.eqiad.wmnet:3313) taken on 2020-07-14 03:28:50 (106 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:25] RECOVERY - dump of s8 in eqiad on icinga1001 is OK: Last dump for s8 at eqiad (db1116.eqiad.wmnet:3318) taken on 2020-07-14 03:29:39 (167 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:26] RECOVERY - dump of s7 in eqiad on icinga1001 is OK: Last dump for s7 at eqiad (db1116.eqiad.wmnet:3317) taken on 2020-07-14 01:20:45 (125 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:26] RECOVERY - dump of s8 in codfw on icinga1001 is OK: Last dump for s8 at codfw (db2100.codfw.wmnet:3318) taken on 2020-07-14 06:15:10 (167 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:27] RECOVERY - dump of x1 in eqiad on icinga1001 is OK: Last dump for x1 at eqiad (db1102.eqiad.wmnet:3320) taken on 2020-07-14 02:32:18 (31 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:27] RECOVERY - dump of zarcillo in codfw on icinga1001 is OK: Last dump for zarcillo at codfw (db2093.codfw.wmnet) taken on 2020-07-14 00:50:36 (1 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:31] bye bye icinga-wm [16:02:17] oh nice it managed to not get kicked out [16:03:18] RECOVERY - snapshot of s6 in eqiad on icinga1001 is OK: Last snapshot for s6 at eqiad (db1139.eqiad.wmnet:3316) taken on 2020-07-20 04:44:21 (508 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:31] volans: lol [16:03:38] RECOVERY - snapshot of s1 in eqiad on icinga1001 is OK: Last snapshot for s1 at eqiad (db1139.eqiad.wmnet:3311) taken on 2020-07-19 20:32:38 (986 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:38] RECOVERY - snapshot of s3 in eqiad on icinga1001 is OK: Last snapshot for s3 at eqiad (db1095.eqiad.wmnet:3313) taken on 2020-07-20 07:03:58 (925 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:38] RECOVERY - snapshot of s2 in codfw on icinga1001 is OK: Last snapshot for s2 at codfw (db2098.codfw.wmnet:3312) taken on 2020-07-20 01:32:06 (795 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:38] RECOVERY - snapshot of s1 in codfw on icinga1001 is OK: Last snapshot for s1 at codfw (db2097.codfw.wmnet:3311) taken on 2020-07-19 20:36:43 (1010 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:38] RECOVERY - snapshot of s5 in eqiad on icinga1001 is OK: Last snapshot for s5 at eqiad (db1145.eqiad.wmnet:3315) taken on 2020-07-20 03:07:52 (650 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:38] RECOVERY - snapshot of s4 in codfw on icinga1001 is OK: Last snapshot for s4 at codfw (db2099.codfw.wmnet:3314) taken on 2020-07-19 23:58:53 (1283 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:38] RECOVERY - snapshot of s4 in eqiad on icinga1001 is OK: Last snapshot for s4 at eqiad (db1145.eqiad.wmnet:3314) taken on 2020-07-19 23:52:46 (1262 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:39] RECOVERY - snapshot of s2 in eqiad on icinga1001 is OK: Last snapshot for s2 at eqiad (db1095.eqiad.wmnet:3312) taken on 2020-07-20 01:11:54 (812 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:39] RECOVERY - snapshot of x1 in codfw on icinga1001 is OK: Last snapshot for x1 at codfw (db2101.codfw.wmnet:3320) taken on 2020-07-18 08:07:45 (218 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:40] RECOVERY - snapshot of x1 in eqiad on icinga1001 is OK: Last snapshot for x1 at eqiad (db1102.eqiad.wmnet:3320) taken on 2020-07-20 06:54:00 (184 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:40] RECOVERY - snapshot of s8 in codfw on icinga1001 is OK: Last snapshot for s8 at codfw (db2100.codfw.wmnet:3318) taken on 2020-07-19 20:50:16 (1162 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:41] RECOVERY - snapshot of s8 in eqiad on icinga1001 is OK: Last snapshot for s8 at eqiad (db1116.eqiad.wmnet:3318) taken on 2020-07-19 20:48:12 (1108 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:41] RECOVERY - snapshot of s5 in codfw on icinga1001 is OK: Last snapshot for s5 at codfw (db2099.codfw.wmnet:3315) taken on 2020-07-20 03:04:27 (648 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:42] RECOVERY - snapshot of s7 in codfw on icinga1001 is OK: Last snapshot for s7 at codfw (db2100.codfw.wmnet:3317) taken on 2020-07-20 04:28:37 (1000 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:03:42] PROBLEM - Check the last execution of generate-mysqld-exporter-config on prometheus3001 is CRITICAL: CRITICAL: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:04:09] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10Eevans) >>! In T256863#6315661, @wiki_willy wrote: >>>! In T256863#6315360, @Eevans wrote: >>>>! In T256863#6313866, @wiki_willy wrote: >>> Hi @Eevans - it looks like this was originally scheduled to be re... [16:05:40] PROBLEM - Prometheus k8s cache not updating on prometheus4001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus4001&var-datasource=ulsfo+prometheus/ops [16:07:33] 10Operations, 10Analytics, 10Traffic: Add backend field to webrequest Hive table - https://phabricator.wikimedia.org/T257354 (10Milimetric) p:05Triage→03High We want to do this, but there are a lot of high priority tasks before it, so ping us if it becomes more urgent. [16:08:25] 10Operations, 10SRE-Access-Requests: Request for SSH access to analytics-privatadata-users group - https://phabricator.wikimedia.org/T258413 (10CBogen) [16:09:27] 10Operations, 10SRE-Access-Requests: Request for SSH access to analytics-privatadata-users group - https://phabricator.wikimedia.org/T258413 (10CBogen) Hi @Abit - I'm told I should have my manager comment on the ticket that the request is appropriate and I need access to this data (which I do for analyzing sea... [16:13:10] PROBLEM - Prometheus k8s cache not updating on prometheus3001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus3001&var-datasource=esams+prometheus/ops [16:13:50] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T258364 (10CGlenn) [16:13:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor inline comment about a default value, otherwise LGTM" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) (owner: 10Jeena Huneidi) [16:14:46] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T258364 (10CGlenn) Hi @Joe I updated the ticket. I would like the duration to run until July 2021 [16:19:51] ACKNOWLEDGEMENT - Check systemd state on prometheus3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Kormat New hosts, not fully set up yet. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:51] ACKNOWLEDGEMENT - Check the last execution of generate-mysqld-exporter-config on prometheus3001 is CRITICAL: CRITICAL: Status of the systemd unit generate-mysqld-exporter-config Kormat New hosts, not fully set up yet. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:19:51] ACKNOWLEDGEMENT - Prometheus k8s cache not updating on prometheus3001 is CRITICAL: instance=127.0.0.1 job=prometheus Kormat New hosts, not fully set up yet. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus3001&var-datasource=esams+prometheus/ops [16:19:51] ACKNOWLEDGEMENT - Check systemd state on prometheus4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Kormat New hosts, not fully set up yet. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:51] ACKNOWLEDGEMENT - Check the last execution of generate-mysqld-exporter-config on prometheus4001 is CRITICAL: CRITICAL: Status of the systemd unit generate-mysqld-exporter-config Kormat New hosts, not fully set up yet. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:19:51] ACKNOWLEDGEMENT - Prometheus k8s cache not updating on prometheus4001 is CRITICAL: instance=127.0.0.1 job=prometheus Kormat New hosts, not fully set up yet. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus4001&var-datasource=ulsfo+prometheus/ops [16:19:51] ACKNOWLEDGEMENT - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Kormat New hosts, not fully set up yet. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:52] ACKNOWLEDGEMENT - Check the last execution of generate-mysqld-exporter-config on prometheus5001 is CRITICAL: CRITICAL: Status of the systemd unit generate-mysqld-exporter-config Kormat New hosts, not fully set up yet. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:19:52] ACKNOWLEDGEMENT - Prometheus k8s cache not updating on prometheus5001 is CRITICAL: instance=127.0.0.1 job=prometheus Kormat New hosts, not fully set up yet. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus5001&var-datasource=eqsin+prometheus/ops [16:20:13] (03PS16) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [16:20:38] (03CR) 10jerkins-bot: [V: 04-1] customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [16:21:17] 10Operations, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10Kormat) The new prometheus hosts have started alerting today. I've acked the current alerts until you folks have time to look into it. The `generate-mysqld-exporter... [16:25:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] Change hostname of `mobileapps` to the dockerized instance. [puppet] - 10https://gerrit.wikimedia.org/r/613144 (https://phabricator.wikimedia.org/T256794) (owner: 10Jgiannelos) [16:27:25] !log akosiaris@cumin1001 conftool action : set/weight=8; selector: dc=codfw,service=mobileapps,name=scb.* [16:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:42] !log increase codfw mobileapps kubernetes traffic to 25% T218733. Take #2 [16:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:47] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [16:31:45] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@10afb4b]: airflow: Turn off catchup on cirrus_namespace_map [16:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:10] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@10afb4b]: airflow: Turn off catchup on cirrus_namespace_map (duration: 00m 25s) [16:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:16] 10Operations, 10SRE-Access-Requests: Request for SSH access to analytics-privatadata-users group - https://phabricator.wikimedia.org/T258413 (10Abit) I hereby comment that @CBogen's request is appropriate and she needs access to this data to analyze search data for SDAW metrics and MediaSearch feature developm... [16:35:30] (03PS2) 10Effie Mouzeli: role::parsoid: Add missing exporters for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/613307 [16:41:46] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:13] 10Operations, 10Analytics-Radar, 10Patch-For-Review: Move yarn.wikimedia.org to a separate Buster VM - https://phabricator.wikimedia.org/T258152 (10Milimetric) [16:45:04] (03CR) 10Andrew Bogott: "thanks for the patch! I'm having a hard time figuring out where role::osm::master is applied; do you know where it's used?" [puppet] - 10https://gerrit.wikimedia.org/r/613727 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [16:50:38] 10Operations, 10Wiki-Setup (Delete / Redirect): Merge or delete grantswiki - https://phabricator.wikimedia.org/T229950 (10Meno25) [16:50:51] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1024 - https://phabricator.wikimedia.org/T257949 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr @jclark-ctr - it doesn't appear to be a drive issue. Do you have any more BBUs around? >>! In T257949#6318336, @fgiunchedi wrote: > This appears to be a BBU fault: >... [16:58:19] (03PS2) 10Ryan Kemper: cirrussearch: Allow 2 dewiki->content shards/node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613669 [17:00:04] gehel and onimisionipe: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200720T1700). [17:15:27] (03PS1) 10Ammarpad: UrlShortener: Remove config renaming hack [extensions/UrlShortener] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614772 (https://phabricator.wikimedia.org/T255491) [17:21:25] (03Abandoned) 10Nray: Max-width layout: Make page container fill viewport if content height is small [skins/Vector] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614771 (https://phabricator.wikimedia.org/T257518) (owner: 10Nray) [17:29:39] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/613727 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [17:34:23] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/613727 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [17:38:56] 10Operations, 10Release-Engineering-Team, 10Patch-For-Review, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Nuria) pin here cause @calbon also needs kerberos credentials [17:43:19] (03PS1) 10CDanis: admin: add kerberos for chrisalbon [puppet] - 10https://gerrit.wikimedia.org/r/614798 (https://phabricator.wikimedia.org/T256412) [17:44:18] (03CR) 10CDanis: [C: 03+2] admin: add kerberos for chrisalbon [puppet] - 10https://gerrit.wikimedia.org/r/614798 (https://phabricator.wikimedia.org/T256412) (owner: 10CDanis) [17:44:47] 10Operations, 10Release-Engineering-Team, 10Patch-For-Review, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10CDanis) 05Open→03Resolved Email with temporary Kerberos password sent. [17:51:06] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10Ejegg) OK, the settings change is deployed and tested to work (at least on Firefox). As @Pcoombe... [17:52:22] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "Shipping this (concurrently updating the "real" private repo)" [labs/private] - 10https://gerrit.wikimedia.org/r/610705 (https://phabricator.wikimedia.org/T254646) (owner: 10Ryan Kemper) [17:54:42] (03CR) 10Dwisehaupt: [C: 03+1] "Looks good. Shipit." (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/612472 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [18:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning backport window(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200720T1800). [18:00:05] dont|panic, Ammarpad, and ryankemper: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:21] I'm here today :P [18:10:03] (03PS1) 10BryanDavis: toolforge: Add missing k8s-status tool to legacy_redirector [puppet] - 10https://gerrit.wikimedia.org/r/614820 [18:12:34] (03CR) 10BryanDavis: "I haven't done any sort of comprehensive audit, but I noticed that my browser's memory of https://tools.wmflabs.org/k8s-status/ was redire" [puppet] - 10https://gerrit.wikimedia.org/r/614820 (owner: 10BryanDavis) [18:19:19] I can deploy today! [18:19:31] hooray! [18:19:33] hi dont|panic ! [18:19:44] (03PS2) 10Urbanecm: Change of 'rollbacker' group settings at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614575 (https://phabricator.wikimedia.org/T258339) (owner: 10Tks4Fish) [18:19:45] hey Urbanecm! [18:19:58] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614575 (https://phabricator.wikimedia.org/T258339) (owner: 10Tks4Fish) [18:20:21] (03PS17) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [18:20:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:20:49] (03Merged) 10jenkins-bot: Change of 'rollbacker' group settings at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614575 (https://phabricator.wikimedia.org/T258339) (owner: 10Tks4Fish) [18:20:50] (03CR) 10jerkins-bot: [V: 04-1] customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [18:21:09] (03PS2) 10Urbanecm: Adding 'rollbacker' group for arzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613658 (https://phabricator.wikimedia.org/T258100) (owner: 10Tks4Fish) [18:21:43] dont|panic: ready at mwdebug1001 :) [18:21:52] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613658 (https://phabricator.wikimedia.org/T258100) (owner: 10Tks4Fish) [18:21:55] (03CR) 10Jeena Huneidi: Kask: Use Releng Cassandra Image (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) (owner: 10Jeena Huneidi) [18:22:11] jawiki ok, checking arzwiki [18:22:22] (03PS2) 10ArielGlenn: dumps rsync refactor, better opts and flags handling [puppet] - 10https://gerrit.wikimedia.org/r/614755 (https://phabricator.wikimedia.org/T254856) [18:22:32] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:22:38] arzwiki is not there yet dont|panic :) [18:22:42] oh [18:22:43] (03Merged) 10jenkins-bot: Adding 'rollbacker' group for arzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613658 (https://phabricator.wikimedia.org/T258100) (owner: 10Tks4Fish) [18:22:51] I was going to say it wasn't :P [18:23:04] (03PS10) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [18:23:08] sorry, I should be more clear :) [18:23:28] syncing jawiki [18:24:22] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ee7ac95e16f55e850b318f7354842795e08e0270: Change of rollbacker group settings at jawiki (T258339) (duration: 00m 57s) [18:24:24] PROBLEM - Disk space on webperf1002 is CRITICAL: DISK CRITICAL - free space: /srv 11473 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [18:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:29] T258339: Changing rollbacker settings on jawiki - https://phabricator.wikimedia.org/T258339 [18:24:37] dont|panic: jawiki done [18:24:41] ty [18:24:55] dont|panic: what about arzwiki? :-) [18:25:15] all good too :) [18:25:19] syncing :) [18:25:34] (03PS11) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [18:26:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: dfed4727c6f9e003f9e1949b2995a0cf0ad4f1cc: Adding rollbacker group for arzwiki (T258100) (duration: 00m 57s) [18:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:44] T258100: Creation of Rollbacker group on arz.wikipedia.org - https://phabricator.wikimedia.org/T258100 [18:26:52] dont|panic: arzwiki done too :) [18:26:59] thanks a lot man :D [18:27:01] Ammarpad: hey, you here? :) [18:27:03] dont|panic: no problem! [18:27:37] @Urbanecm Yes [18:28:10] Ammarpad: I'll merge your patch, could you do a similar change for labs as well? Am.ir already pushed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/614757 out :-) [18:28:16] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614735 (https://phabricator.wikimedia.org/T255491) (owner: 10Ammarpad) [18:29:02] (03Merged) 10jenkins-bot: Switch $wgUrlShortenerDomainsWhitelist --> $wgUrlShortenerAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614735 (https://phabricator.wikimedia.org/T255491) (owner: 10Ammarpad) [18:30:10] @Urbanecm working on the beta config [18:30:25] Ammarpad: thanks. Please also check the production one at mwdebug1001 :) [18:32:00] (03PS1) 10Ammarpad: Labs: Switch $wgUrlShortenerDomainsWhitelist --> $wgUrlShortenerAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614825 (https://phabricator.wikimedia.org/T255491) [18:34:57] (03PS1) 10ArielGlenn: rename the main rsyncer script i prep for script that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) [18:35:42] @Urbanecm done, looks fine [18:35:48] thanks, syncing [18:36:02] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki: also include the MW apache2.conf on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/599683 (https://phabricator.wikimedia.org/T190111) (owner: 10Dzahn) [18:36:25] (03CR) 10Urbanecm: [C: 03+2] "noop for prod, labs-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614825 (https://phabricator.wikimedia.org/T255491) (owner: 10Ammarpad) [18:37:12] (03Merged) 10jenkins-bot: Labs: Switch $wgUrlShortenerDomainsWhitelist --> $wgUrlShortenerAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614825 (https://phabricator.wikimedia.org/T255491) (owner: 10Ammarpad) [18:37:13] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: df2584f181f08da0e1191f97e619e912e587b48d: Switch $wgUrlShortenerDomainsWhitelist --> $wgUrlShortenerAllowedDomains (T255491) (duration: 00m 57s) [18:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:18] T255491: Replace usages of Whitelist/Blacklist in UrlShortener - https://phabricator.wikimedia.org/T255491 [18:37:20] Ammarpad: done :) [18:37:30] the beta patch will arrive to beta within 30 minutes [18:38:07] ryankemper: hello, do you plan to self-service? :-) [18:38:12] (ie. deploy the patch yourself) [18:38:57] @Urbanecm OK, thanks [18:40:03] Urbanecm: I can if necessary, yeah [18:40:22] would be best if you can do it, I have zero knowledge about what it does [18:40:33] over to you then :) [18:40:58] Ah well the patch specifically doesn’t require any testing so it [18:41:06] it’s just the normal process* [18:42:13] I'm just not comfortable deploying patches when I don't know what they do :) [18:42:32] Understood [18:42:53] Will deploy in ~10 mins [18:45:42] (03PS2) 10ArielGlenn: rename the current rsyncer script preparing for script that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) [18:47:16] (03PS3) 10ArielGlenn: rename the current rsyncer script preparing for script that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) [18:52:28] (03PS2) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (group2; all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610149 [18:52:34] (03Abandoned) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (group2; all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610149 (owner: 10Krinkle) [18:55:13] (03PS1) 10CRusnov: puppetdb microservice: Fix host query [puppet] - 10https://gerrit.wikimedia.org/r/614827 [18:56:03] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/614827 (owner: 10CRusnov) [18:57:36] (03PS1) 10Vidhi-Mody: Test: Verfiy pushing a patch to Gerrit [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614828 [18:57:38] (03PS1) 10Vidhi-Mody: Selenium: Update to WebdriverIO v5 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614829 (https://phabricator.wikimedia.org/T255471) [19:01:31] 10Operations, 10LDAP-Access-Requests: Add nskaggs to WMF ldap group - https://phabricator.wikimedia.org/T258437 (10nskaggs) [19:02:16] 10Operations, 10LDAP-Access-Requests: Add nskaggs to WMF ldap group - https://phabricator.wikimedia.org/T258437 (10nskaggs) [19:03:13] 10Operations, 10LDAP-Access-Requests: Add nskaggs to WMF ldap group - https://phabricator.wikimedia.org/T258437 (10nskaggs) [19:03:21] (03CR) 10Andrew Bogott: [C: 04-1] "If this replaces modules/wmflib/lib/hiera/backend/httpyaml_backend.rb in all cases I'd expect the same patch to remove that file." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613112 (owner: 10Jbond) [19:03:59] !log mforns@deploy1001 Started deploy [analytics/refinery@af86a05]: Regular analytics weekly train [analytics/refinery@af86a05be470ed8283f6585afb5cc231b26944a2] [19:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:55] (03PS1) 10MusikAnimal: Enable on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614832 (https://phabricator.wikimedia.org/T257506) [19:09:45] !log mforns@deploy1001 Finished deploy [analytics/refinery@af86a05]: Regular analytics weekly train [analytics/refinery@af86a05be470ed8283f6585afb5cc231b26944a2] (duration: 05m 46s) [19:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:34] !log mforns@deploy1001 Started deploy [analytics/refinery@af86a05] (thin): Regular analytics weekly train THIN [analytics/refinery@af86a05be470ed8283f6585afb5cc231b26944a2] [19:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:41] (03CR) 10DannyS712: [C: 04-1] "Commit message is unclear" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614832 (https://phabricator.wikimedia.org/T257506) (owner: 10MusikAnimal) [19:10:42] !log mforns@deploy1001 Finished deploy [analytics/refinery@af86a05] (thin): Regular analytics weekly train THIN [analytics/refinery@af86a05be470ed8283f6585afb5cc231b26944a2] (duration: 00m 07s) [19:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:25] (03PS2) 10MusikAnimal: Enable $wgWatchlistExpiry on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614832 (https://phabricator.wikimedia.org/T257506) [19:12:32] (03CR) 10CRusnov: [C: 03+2] "Self merging as it is a harmless tested change that at worst will break netbox reports in order to unblock myself on import work." [puppet] - 10https://gerrit.wikimedia.org/r/614827 (owner: 10CRusnov) [19:13:08] (03CR) 10MusikAnimal: "> Patch Set 1: Code-Review-1" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614832 (https://phabricator.wikimedia.org/T257506) (owner: 10MusikAnimal) [19:14:22] (03PS1) 10Jdlrobson: Enable Vector opt in preference everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614837 (https://phabricator.wikimedia.org/T254228) [19:18:43] (03CR) 10DannyS712: [C: 03+1] "Looking forward to testing it out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614832 (https://phabricator.wikimedia.org/T257506) (owner: 10MusikAnimal) [19:20:29] (03PS3) 10Effie Mouzeli: role::parsoid: Add missing exporters for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/613307 [19:20:38] (03PS3) 10Ryan Kemper: cirrussearch: Allow 2 dewiki->content shards/node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613669 [19:21:29] (03PS1) 10Bstorm: icinga: add nskaggs to icinga contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/614838 (https://phabricator.wikimedia.org/T255220) [19:22:59] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10Papaul) @Eevans @akosiaris we have 2 spare on site that we can use to replace this server wmf6413 and wmf6414 in netbox both servers are: HP ProLiant DL360 Gen9 with 64GB RAM, Intel(R) Xeon(R) CPU E5-263... [19:23:15] (03PS1) 10ArielGlenn: script for rsyncing dumps via secondary storage server [puppet] - 10https://gerrit.wikimedia.org/r/614839 (https://phabricator.wikimedia.org/T254856) [19:23:37] I've got a question regarding step 6 of https://wikitech.wikimedia.org/wiki/Backport_windows#Doing_the_deploy: [19:23:43] > If there are no errors and the fix seems to work (if testable in that manner), then the backport team member deploys the patch to the entire fleet [19:23:52] What's the process for deploying the patch to the entire fleet? [19:26:23] i.e. by that point a `scap pull` has been performed on `mwdebug002`, so is rolling it fleet-wide a matter of running `scap pull` on `deploy1001`? Or is it just a matter of git pulling the latest master? [19:28:47] (03PS12) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [19:32:22] ryankemper: no it's running scap sync-file wmf-config/InitialiseSettings.php "Phab Ticket: log message" from deploy1001:/srv/mediawiki-staging and monitoring logstash for possible errors [19:34:07] 10Operations, 10Release-Engineering-Team, 10Patch-For-Review, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Halfak) 05Resolved→03Open Still waiting on deployment-prep access so that @calbon can do a beta deploy of ORES. [19:35:19] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Cmjohnson) {F31942312}. @Jclark-ctr TSR report is attached [19:36:57] ryankemper: https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers is more detailed [19:37:09] 10Operations, 10SRE-Access-Requests: Requesting access to WMCS for USER[S] - https://phabricator.wikimedia.org/T258438 (10nskaggs) [19:37:13] dcausse: super helpful, thanks [19:37:20] 10Operations, 10SRE-Access-Requests: Requesting access to WMCS for nskaggs - https://phabricator.wikimedia.org/T258438 (10nskaggs) [19:43:28] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Ladsgroup) I, in my volunteer capacity and out of frustration, requested bot flag of two bots that eac... [19:50:48] (03PS1) 10Nskaggs: Add nskaggs key and grant access to WMCS related groups [puppet] - 10https://gerrit.wikimedia.org/r/614847 [19:50:51] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/614847 (owner: 10Nskaggs) [19:57:22] (03PS2) 10Nskaggs: Add nskaggs key and grant access to WMCS related groups [puppet] - 10https://gerrit.wikimedia.org/r/614847 (https://phabricator.wikimedia.org/T258438) [19:58:19] (03CR) 10jerkins-bot: [V: 04-1] Add nskaggs key and grant access to WMCS related groups [puppet] - 10https://gerrit.wikimedia.org/r/614847 (https://phabricator.wikimedia.org/T258438) (owner: 10Nskaggs) [19:59:50] (03PS3) 10Nskaggs: Add nskaggs key and grant access to WMCS related groups [puppet] - 10https://gerrit.wikimedia.org/r/614847 (https://phabricator.wikimedia.org/T258438) [20:00:02] RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:05] halfak and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200720T2000). [20:05:38] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:05:47] (03CR) 10Nskaggs: [C: 03+1] icinga: add nskaggs to icinga contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/614838 (https://phabricator.wikimedia.org/T255220) (owner: 10Bstorm) [20:06:52] 10Operations, 10Release-Engineering-Team, 10Patch-For-Review, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10thcipriani) [20:06:57] (03CR) 10Herron: "> I investigated how the callout verification method works to see if we could just move otrs under the gsuite router however as far as i c" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [20:07:18] 10Operations, 10Release-Engineering-Team, 10Patch-For-Review, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10thcipriani) 05Open→03Resolved >>! In T256412#6320739, @Halfak wrote: > Still waiting on deployment-prep access so that... [20:07:49] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki::maintenance: load mod_security2 also on mwmaint*, not just mw* [puppet] - 10https://gerrit.wikimedia.org/r/607848 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [20:08:15] (03CR) 10Bstorm: [C: 03+2] icinga: add nskaggs to icinga contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/614838 (https://phabricator.wikimedia.org/T255220) (owner: 10Bstorm) [20:08:33] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Prevention): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle) [20:11:16] 10Operations, 10Performance-Team, 10serviceops, 10Sustainability (Incident Prevention): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) a:05aaron→03Krinkle Todo for me: Update docs on . [20:16:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to WMCS for nskaggs - https://phabricator.wikimedia.org/T258438 (10Aklapper) > Name of approving party (hiring manager for WMF staff): Birgit Mueller Heads-up to @bmueller for sign-off [20:20:22] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1024 - https://phabricator.wikimedia.org/T257949 (10Jclark-ctr) @wiki_willy @fgiunchedi we do have 3 bbu remaining on site [20:29:50] (03CR) 10DannyS712: "Why does this need to be cherry picked?" [extensions/UrlShortener] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614772 (https://phabricator.wikimedia.org/T255491) (owner: 10Ammarpad) [20:48:30] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to WMCS for nskaggs - https://phabricator.wikimedia.org/T258438 (10Peachey88) [21:00:04] Reedy and sbassett: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200720T2100). [21:05:05] * sbassett plans to deploy a minor update to an existing PS.php code block in a minute [21:15:46] !log Revised mitigation deployed for T257687 [21:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:20] PROBLEM - SSH on an-worker1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:26:52] RECOVERY - SSH on an-worker1079 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:50:27] (03PS1) 10BryanDavis: wmcs: Add redirector site for wmcloud.org and www.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/614882 (https://phabricator.wikimedia.org/T258415) [21:56:39] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/compiler1003/24006/" [puppet] - 10https://gerrit.wikimedia.org/r/614882 (https://phabricator.wikimedia.org/T258415) (owner: 10BryanDavis) [21:58:26] (03PS2) 10BryanDavis: wmcs: Add redirector site for wmcloud.org and www.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/614882 (https://phabricator.wikimedia.org/T258415) [22:00:30] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/613727 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [22:02:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:04:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:04:40] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/613727 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [22:06:48] (03CR) 10BryanDavis: "PCC with the original resource names is easier to reason about: https://puppet-compiler.wmflabs.org/compiler1002/24007/" [puppet] - 10https://gerrit.wikimedia.org/r/614882 (https://phabricator.wikimedia.org/T258415) (owner: 10BryanDavis) [22:08:51] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10Jclark-ctr) [22:09:18] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson name rack position asset_tag switchport an-worker1096 A2 39 WMF4839 27 an-worker1097 B4 36 WMF4840 47 an-worker1098 B7... [22:09:44] (03PS1) 10Jdlrobson: Enable side bar instrumentation at 20% for all test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T254228) [22:09:46] (03PS1) 10Jdlrobson: Enable desktop improvements for anons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614889 [22:09:48] (03PS1) 10Jdlrobson: Switch test wikis to new version of vector by default (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614890 (https://phabricator.wikimedia.org/T254227) [22:09:50] (03PS1) 10Jdlrobson: Switch test wikis to new version of vector by default (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614891 (https://phabricator.wikimedia.org/T254227) [22:10:00] (03CR) 10BryanDavis: wmcs: Add redirector site for wmcloud.org and www.wmcloud.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/614882 (https://phabricator.wikimedia.org/T258415) (owner: 10BryanDavis) [22:10:08] (03PS1) 10Bstorm: shinken: add Nicholas Skaggs to shinken contacts [puppet] - 10https://gerrit.wikimedia.org/r/614892 (https://phabricator.wikimedia.org/T255220) [22:15:13] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Jclark-ctr) Confirmed: Service Request 1030121866 was successfully submitted. [22:16:23] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1024 - https://phabricator.wikimedia.org/T257949 (10Jclark-ctr) @fgiunchedi if you would like bbu swapped i am available tomorrow morning eastern time [22:22:57] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10MBeat33) Thank you @Ejegg! One question, for donors who don't enable cookies in their browsers w... [22:28:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1097.eqiad.wmnet - https://phabricator.wikimedia.org/T257406 (10Jclark-ctr) [22:31:25] (03CR) 10Bstorm: [C: 03+2] shinken: add Nicholas Skaggs to shinken contacts [puppet] - 10https://gerrit.wikimedia.org/r/614892 (https://phabricator.wikimedia.org/T255220) (owner: 10Bstorm) [22:32:20] (03PS1) 10Effie Mouzeli: hosts: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) [22:34:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: Decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Jclark-ctr) [22:35:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: Decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Jclark-ctr) [22:36:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Jclark-ctr) [22:38:50] (03PS1) 10Effie Mouzeli: hosts: make rdb2007, rdb2008 a redis cluster [puppet] - 10https://gerrit.wikimedia.org/r/614897 (https://phabricator.wikimedia.org/T255250) [22:40:38] I'm gearing up to self-deploy https://gerrit.wikimedia.org/r/c/613669/ - it was going to go out with this morning's backport window but I couldn't get to it until now [22:41:21] I'd like to deploy it now (before the evening window), but not sure if there's a problem with me deploying it outside the window (the evening window starts in ~19 minutes) [22:41:35] (This is a `mediawiki-config` change if that context helps) [22:43:01] ryankemper: greg-g and thcipriani are usually the folks to ask for an out-of-order deploy [22:43:24] just for the "ok" if you aren't so BOLD as to jump the queue :) [22:44:30] Thanks [22:44:33] I can confirm I'm not that bold :P [22:45:05] Probably might as well just wait the 15 mins at this point though to make things easier [22:45:14] jouncebot: next [22:45:15] In 0 hour(s) and 14 minute(s): Evening backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200720T2300) [22:45:41] ryankemper: fine with me if you're fine deploying and watching [22:45:59] Great [22:46:21] Yeah, this is a simple change so I'm comfortable driving it [22:48:20] (03CR) 10Ryan Kemper: [C: 03+2] cirrussearch: Allow 2 dewiki->content shards/node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613669 (owner: 10Ryan Kemper) [22:49:20] (03PS1) 10Dwisehaupt: Add icinga check for recurring contributions in a processing state [puppet] - 10https://gerrit.wikimedia.org/r/614898 (https://phabricator.wikimedia.org/T258013) [22:49:26] (03Merged) 10jenkins-bot: cirrussearch: Allow 2 dewiki->content shards/node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613669 (owner: 10Ryan Kemper) [22:52:11] Patch is merged, working on wrangling the git repo into the right state on `deployment.eqiad.wmnet` [22:54:34] Fetched and `git log -p HEAD..@{u}` showed the expected patch, did `git rebase` and now `HEAD` is where it should be [22:54:48] perfect [22:55:08] Running `scap pull` on `mwdebug1002` [22:55:59] (03PS1) 10Urbanecm: Add English aliases for WS-specific namespaces to lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614900 (https://phabricator.wikimedia.org/T257672) [22:56:00] At this point there's nothing on my end to test, glancing at logs to make sure nothing's on fire before proceeding to the rest of the fleet [22:57:46] (03PS1) 10Urbanecm: Add ProofreadPage namespace translation for lij [extensions/ProofreadPage] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614775 (https://phabricator.wikimedia.org/T257672) [22:58:14] Everything looks good, proceeding to fleet [22:58:21] `scap sync-file wmf-config/InitialiseSettings.php "613669: cirrussearch: Allow 2 dewiki->content shards/node | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/613669"` from `deploy1001:/srv/mediawiki-staging` [22:59:49] !log ryankemper@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 613669: cirrussearch: Allow 2 dewiki->content shards/node | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/613669 (duration: 00m 57s) [22:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200720T2300). [23:00:35] ryankemper: if you're done, can I deploy some more patches? [23:00:44] I'm all done here, go ahead Urbanecm [23:00:48] thanks [23:01:10] (03PS2) 10Urbanecm: Add English aliases for WS-specific namespaces to lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614900 (https://phabricator.wikimedia.org/T257672) [23:01:28] (03CR) 10Urbanecm: [C: 03+2] Add English aliases for WS-specific namespaces to lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614900 (https://phabricator.wikimedia.org/T257672) (owner: 10Urbanecm) [23:01:47] (03CR) 10jerkins-bot: [V: 04-1] Add ProofreadPage namespace translation for lij [extensions/ProofreadPage] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614775 (https://phabricator.wikimedia.org/T257672) (owner: 10Urbanecm) [23:02:18] (03Merged) 10jenkins-bot: Add English aliases for WS-specific namespaces to lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614900 (https://phabricator.wikimedia.org/T257672) (owner: 10Urbanecm) [23:02:49] (03CR) 10Urbanecm: "recheck" [extensions/ProofreadPage] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614775 (https://phabricator.wikimedia.org/T257672) (owner: 10Urbanecm) [23:05:40] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 2147774caaa0819f8b5d71cc16bc021d94677702: Add English aliases for WS-specific namespaces to lijwikisource (T257672) (duration: 00m 57s) [23:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:45] T257672: Create Wikisource Ligurian - https://phabricator.wikimedia.org/T257672 [23:06:50] !log run mwscript namespaceDupes.php --wiki=lijwikisource -- fix (T257672) [23:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:06] (03CR) 10Urbanecm: [V: 03+2] "apparent false positive" [extensions/ProofreadPage] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614775 (https://phabricator.wikimedia.org/T257672) (owner: 10Urbanecm) [23:08:13] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Add ProofreadPage namespace translation for lij [extensions/ProofreadPage] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614775 (https://phabricator.wikimedia.org/T257672) (owner: 10Urbanecm) [23:12:12] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.41/extensions/ProofreadPage/ProofreadPage.namespaces.php: 03ed74f0b9b8f55d01f9112c31f2f6ea17990f9c: Add ProofreadPage namespace translation for lij (T257672) (duration: 00m 57s) [23:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:18] T257672: Create Wikisource Ligurian - https://phabricator.wikimedia.org/T257672 [23:16:36] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10Ejegg) Yep, that link should work @MBeat33 . You can append a language code after a slash if you... [23:17:52] (03PS1) 10Effie Mouzeli: changeprop/changeprop-jobqueue: swap rdb2003 with rdb2007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) [23:22:39] (03CR) 10Effie Mouzeli: [C: 04-2] "Do not merge before 614894" [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [23:24:58] (03CR) 10Effie Mouzeli: [C: 04-2] "Can be merged after 614894" [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [23:27:08] (03CR) 10Effie Mouzeli: [C: 04-2] "Don't merge before coordinating with service owners" [puppet] - 10https://gerrit.wikimedia.org/r/614897 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [23:30:30] (03PS2) 10Effie Mouzeli: hosts: make rdb2007, rdb2008 a redis cluster [puppet] - 10https://gerrit.wikimedia.org/r/614897 (https://phabricator.wikimedia.org/T255250) [23:36:59] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10MBeat33) Awesome, thanks for confirming that.