[00:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210312T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:09:14] jouncebot: now [00:09:14] For the next 0 hour(s) and 50 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210312T0000) [00:10:12] (03CR) 10Dzahn: [C: 04-2] "mw2215 need to be moved away from scap proxy" [puppet] - 10https://gerrit.wikimedia.org/r/670631 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [00:11:16] (03CR) 10Dzahn: "I was back and forth myself about that "combine scap proxy and mcrouter proxy"-part, but eventually I thought they should not hurt each ot" [puppet] - 10https://gerrit.wikimedia.org/r/670949 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [00:15:57] Urbanecm: it seems the window is empty, no deploy? [00:16:11] (03CR) 10Jdlrobson: Do not log script errors without file uri (031 comment) [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670879 (https://phabricator.wikimedia.org/T266517) (owner: 10Jdlrobson) [00:16:32] (03PS1) 10Dzahn: DHCP: remove mw2243 through mw2250 [puppet] - 10https://gerrit.wikimedia.org/r/670988 [00:17:53] if there is no deployment I will do stuff that influences scap [00:19:09] mutante: yeah, nothing scheduled and i'm about to sign off / not planning anything train related, so i believe you're clear. [00:19:28] brennen: cool, thank you [00:19:46] removing old hardware, changing which host is scap proxy / canary, codfw only [00:19:56] (03PS2) 10Cwhite: logstash: rename logEvent exception into error.message [puppet] - 10https://gerrit.wikimedia.org/r/670986 (https://phabricator.wikimedia.org/T234565) [00:22:27] (03CR) 10Dzahn: [C: 03+2] scap: switch proxy for codfw A3 from mw2215 to mw2300 [puppet] - 10https://gerrit.wikimedia.org/r/670944 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [00:23:26] (03PS1) 10Cwhite: logstash: grok field name out of error.message [puppet] - 10https://gerrit.wikimedia.org/r/670991 (https://phabricator.wikimedia.org/T234565) [00:28:47] (03PS2) 10Cwhite: logstash: grok field name out of error.message [puppet] - 10https://gerrit.wikimedia.org/r/670991 (https://phabricator.wikimedia.org/T234565) [00:53:14] (03PS1) 10Dzahn: site/conftool: turn mw2274 and mw2276 into canaries [puppet] - 10https://gerrit.wikimedia.org/r/670992 (https://phabricator.wikimedia.org/T277119) [00:55:15] (03PS2) 10Dzahn: site/conftool: turn mw2374 and mw2376 into API canaries [puppet] - 10https://gerrit.wikimedia.org/r/670992 (https://phabricator.wikimedia.org/T277119) [00:55:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:57:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2215.codfw.wmnet [00:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:58:40] !log shutting down mw2215 [00:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:34] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Puppetize mailman3 - https://phabricator.wikimedia.org/T256536 (10Ladsgroup) So far everything is fine except hyperkitty archiver not being able to archive things. The error says because it fails to make requests internally and yup, it fails like this: `... [01:20:15] (03PS4) 10Dzahn: site: decom mw2215,mw2216, codfw API canaries [puppet] - 10https://gerrit.wikimedia.org/r/670631 (https://phabricator.wikimedia.org/T277119) [01:30:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2215.codfw.wmnet [01:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:34] (03CR) 10Dzahn: [C: 03+2] site: decom mw2215,mw2216, codfw API canaries [puppet] - 10https://gerrit.wikimedia.org/r/670631 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [01:36:54] (03PS3) 10Dzahn: site/conftool: turn mw2374 and mw2376 into API canaries [puppet] - 10https://gerrit.wikimedia.org/r/670992 (https://phabricator.wikimedia.org/T277119) [01:53:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:22] (03CR) 10Krinkle: Do not log script errors without file uri (031 comment) [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670879 (https://phabricator.wikimedia.org/T266517) (owner: 10Jdlrobson) [02:26:24] (03CR) 10Jdlrobson: Do not log script errors without file uri (032 comments) [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670879 (https://phabricator.wikimedia.org/T266517) (owner: 10Jdlrobson) [05:45:44] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Puppetize mailman3 - https://phabricator.wikimedia.org/T256536 (10Legoktm) Looks like the same issue as {T190111}. Based on the debugging there, I was able to get it to work with: ` root@mailman-mailman02:~# curl -H "Host: localhost" "http://mailman-mail... [06:09:32] (03PS1) 10Marostegui: db1088: Remove it from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/671028 (https://phabricator.wikimedia.org/T276025) [06:10:37] (03CR) 10Marostegui: [C: 03+2] db1088: Remove it from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/671028 (https://phabricator.wikimedia.org/T276025) (owner: 10Marostegui) [06:11:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1088 from dbctl T276025', diff saved to https://phabricator.wikimedia.org/P14803 and previous config saved to /var/cache/conftool/dbconfig/20210312-061118-marostegui.json [06:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:27] T276025: decommission db1088.eqiad.wmnet - https://phabricator.wikimedia.org/T276025 [06:12:45] 10SRE, 10Wikimedia-Apache-configuration, 10Privacy, 10Security: Apache 2.4 exposes server status page by default? - https://phabricator.wikimedia.org/T113090 (10Legoktm) For reference, there is a request in the Debian bug tracker to fix this: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=777546 Our c... [06:13:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1082 for schema change', diff saved to https://phabricator.wikimedia.org/P14804 and previous config saved to /var/cache/conftool/dbconfig/20210312-061306-marostegui.json [06:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 10%: Repool db1082 after schema change', diff saved to https://phabricator.wikimedia.org/P14805 and previous config saved to /var/cache/conftool/dbconfig/20210312-062850-root.json [06:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:43] !log Deploy schema change on s2 codfw master, lag will appear - T276150 T276156 [06:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:51] T276150: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 [06:30:51] T276156: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 [06:43:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 30%: Repool db1082 after schema change', diff saved to https://phabricator.wikimedia.org/P14806 and previous config saved to /var/cache/conftool/dbconfig/20210312-064353-root.json [06:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1146:3314 for table checking T276742', diff saved to https://phabricator.wikimedia.org/P14807 and previous config saved to /var/cache/conftool/dbconfig/20210312-065008-marostegui.json [06:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:15] T276742: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 [06:58:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 60%: Repool db1082 after schema change', diff saved to https://phabricator.wikimedia.org/P14808 and previous config saved to /var/cache/conftool/dbconfig/20210312-065857-root.json [06:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:17] (03PS1) 10Marostegui: db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/671033 (https://phabricator.wikimedia.org/T276742) [07:02:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2148 T276742', diff saved to https://phabricator.wikimedia.org/P14809 and previous config saved to /var/cache/conftool/dbconfig/20210312-070219-marostegui.json [07:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:27] T276742: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 [07:02:55] (03CR) 10Marostegui: [C: 03+2] db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/671033 (https://phabricator.wikimedia.org/T276742) (owner: 10Marostegui) [07:12:26] (03PS1) 10Marostegui: install_server: Do not reimage db2149 [puppet] - 10https://gerrit.wikimedia.org/r/671034 (https://phabricator.wikimedia.org/T275633) [07:13:01] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2149 [puppet] - 10https://gerrit.wikimedia.org/r/671034 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [07:14:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 100%: Repool db1082 after schema change', diff saved to https://phabricator.wikimedia.org/P14810 and previous config saved to /var/cache/conftool/dbconfig/20210312-071400-root.json [07:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2108 T276742', diff saved to https://phabricator.wikimedia.org/P14811 and previous config saved to /var/cache/conftool/dbconfig/20210312-071628-marostegui.json [07:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:35] T276742: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 [07:16:37] !log Stop mysql on db2108 to clone db2148 [07:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:14] (03PS1) 10Marostegui: db1146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/671035 [07:41:39] (03CR) 10Marostegui: [C: 03+2] db1146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/671035 (owner: 10Marostegui) [07:43:00] (03PS3) 10Muehlenhoff: Assign mw_rc_irc role to irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/670829 (https://phabricator.wikimedia.org/T224579) [07:49:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:51:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:51:51] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210312T0800) [08:00:45] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is OK: HTTP OK: HTTP/1.0 200 OK - 23616 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:01:31] !log installing openjpeg2 security updates [08:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:36] 10SRE, 10Wikimedia-Apache-configuration, 10Developer Productivity, 10Patch-For-Review, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost (on jobrunners) - https://phabricator.wikimedia.org/T190111 (10Legoktm) This is causing an issue with our mailman3... [08:12:41] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [08:14:51] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [08:16:37] PROBLEM - SSH on mw2227.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:20:39] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host scandium.eqiad.wmnet [08:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:11] (03PS1) 10Ryan Kemper: wdqs: new query-preview for wdqs1009 (test host) [labs/private] - 10https://gerrit.wikimedia.org/r/671079 (https://phabricator.wikimedia.org/T266470) [08:29:16] (03CR) 10Gehel: [V: 03+2 C: 03+2] wdqs: new query-preview for wdqs1009 (test host) [labs/private] - 10https://gerrit.wikimedia.org/r/671079 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [08:32:02] (03PS3) 10Ryan Kemper: wdqs: impl. envoy for wdqs-test [puppet] - 10https://gerrit.wikimedia.org/r/670339 (https://phabricator.wikimedia.org/T266470) [08:33:02] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/670339 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [08:33:50] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: impl. envoy for wdqs-test [puppet] - 10https://gerrit.wikimedia.org/r/670339 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [08:35:14] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host scandium.eqiad.wmnet [08:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:17] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host pybal-test2002.codfw.wmnet [08:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:55] (03PS1) 10Ryan Kemper: Revert "wdqs: impl. envoy for wdqs-test" [puppet] - 10https://gerrit.wikimedia.org/r/671024 [08:45:45] (03CR) 10Ryan Kemper: [C: 03+2] Revert "wdqs: impl. envoy for wdqs-test" [puppet] - 10https://gerrit.wikimedia.org/r/671024 (owner: 10Ryan Kemper) [08:47:46] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2002.codfw.wmnet [08:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:43] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host pybal-test2003.codfw.wmnet [08:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:31] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2003.codfw.wmnet [08:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:45] (03CR) 10JMeybohm: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/670960 (https://phabricator.wikimedia.org/T276305) (owner: 10Ottomata) [08:59:33] !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@9a408b2]: T273847 export queries to relforge dag deployment - elastic-template handling [08:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:41] T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847 [08:59:43] !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@9a408b2]: T273847 export queries to relforge dag deployment - elastic-template handling (duration: 00m 10s) [08:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:34] !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@9a408b2]: T273847 export queries to relforge dag deployment - elastic-template handling [09:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:44] !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@9a408b2]: T273847 export queries to relforge dag deployment - elastic-template handling (duration: 00m 09s) [09:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:26] !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@9a408b2]: T273847 export queries to relforge dag deployment - elastic-template handling [09:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:32] T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847 [09:07:01] !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@9a408b2]: T273847 export queries to relforge dag deployment - elastic-template handling (duration: 01m 35s) [09:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:28] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host mx2001.wikimedia.org [09:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:26] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mx2001.wikimedia.org [09:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:27] PROBLEM - Check no envoy runtime configuration is left persistent on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:17:31] RECOVERY - SSH on mw2227.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:18:01] PROBLEM - Check that envoy is running on wdqs1009 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:21:40] (03PS1) 10Gerrit Patch Uploader: Make zhwikinews administrator obtains the grant/delete right of the cross-wiki importer T273405 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671084 [09:21:42] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671084 (owner: 10Gerrit Patch Uploader) [09:25:16] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host mx1001.wikimedia.org [09:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mx1001.wikimedia.org [09:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:17] (03PS1) 10Muehlenhoff: install_server: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671088 [09:43:36] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [09:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:37] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [09:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:30] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [09:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:57] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [09:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:42] 10SRE, 10Kubernetes: helm test fails in ci namespace - https://phabricator.wikimedia.org/T277252 (10JMeybohm) a:05JMeybohm→03None [09:48:53] 10SRE, 10Kubernetes: helm test fails in ci namespace - https://phabricator.wikimedia.org/T277252 (10JMeybohm) a:03JMeybohm The networkpolicy label selector you have defined does not match the labels of the pods you create and so the ingress rules don't get applies: https://gerrit.wikimedia.org/r/plugins/giti... [09:51:14] (03PS1) 10Muehlenhoff: aptrepo: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671091 [09:53:00] (03PS1) 10Muehlenhoff: apt: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671092 [09:56:54] (03PS10) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [09:58:03] (03PS1) 10Muehlenhoff: package_builder: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671093 [10:02:20] (03PS1) 10Muehlenhoff: standard_packages: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671095 [10:03:44] (03PS1) 10Muehlenhoff: debian: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671097 [10:04:52] (03CR) 10jerkins-bot: [V: 04-1] debian: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671097 (owner: 10Muehlenhoff) [10:05:23] (03PS1) 10Muehlenhoff: motd: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671098 [10:08:51] (03CR) 10David Caro: netbox: add NetboxServer class (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [10:09:16] (03CR) 10Ayounsi: "I didn't review the Puppet code, but that's my preferred approach." [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [10:09:35] (03PS1) 10Hashar: gerrit: make voting scores a bit more accessible [puppet] - 10https://gerrit.wikimedia.org/r/671101 (https://phabricator.wikimedia.org/T256615) [10:11:57] (03PS1) 10Muehlenhoff: base: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671103 [10:12:34] (03PS2) 10Hashar: gerrit: make voting scores a bit more accessible [puppet] - 10https://gerrit.wikimedia.org/r/671101 (https://phabricator.wikimedia.org/T256615) [10:12:47] (03CR) 10Hashar: "That is for a start. It should theoretically override the Gerrit built-in css rules." [puppet] - 10https://gerrit.wikimedia.org/r/671101 (https://phabricator.wikimedia.org/T256615) (owner: 10Hashar) [10:22:01] hashar: when you get some time, please LMK what you think of https://gerrit.wikimedia.org/r/c/integration/config/+/670782 [10:31:36] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, can get a PCC run too?" [puppet] - 10https://gerrit.wikimedia.org/r/670916 (owner: 10Jbond) [10:33:35] 10SRE, 10netops, 10cloud-services-team (Kanban): cloudgw eqiad1: review & allocate subnets and VLANs - https://phabricator.wikimedia.org/T277020 (10ayounsi) > cloudgw <-> neutron: vlan 1107 (cloud-gw-transport-eqiad) 185.15.56.224/29 (new allocation, not registered on netbox yet) How many IPs do you need in... [10:34:42] godog: XioNoX: [10:34:44] sorry [10:34:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [10:34:50] its started again :S [10:34:59] hahhaa [10:35:05] 10SRE, 10netops, 10Patch-For-Review: Auhoritative ports list - https://phabricator.wikimedia.org/T277146 (10jbond) When drafting the patch in https://gerrit.wikimedia.org/r/c/operations/puppet/+/670917/7 i came up with the following structure ` lang=puppet type Netbase::Service = Struct[{ port => Stdli... [10:36:39] (03PS5) 10Jbond: prometheus::node_puppet_agent: update requires [puppet] - 10https://gerrit.wikimedia.org/r/670916 [10:36:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/669968 (owner: 10Cwhite) [10:36:56] (03CR) 10Zabe: "Isn’t this a duplicate of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/660795/4#message-fb497a797d766c48eeae6c1c52633280" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671084 (owner: 10Gerrit Patch Uploader) [10:36:58] jbond42: hahaha! [10:38:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "Not Phabricator maintainer but +1" [puppet] - 10https://gerrit.wikimedia.org/r/670951 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [10:38:32] (03CR) 10Filippo Giunchedi: [C: 03+1] "Not Phabricator maintainer but +1" [puppet] - 10https://gerrit.wikimedia.org/r/670950 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [10:39:00] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: rename logEvent exception into error.message [puppet] - 10https://gerrit.wikimedia.org/r/670986 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [10:39:21] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: grok field name out of error.message [puppet] - 10https://gerrit.wikimedia.org/r/670991 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [10:44:35] (03PS5) 10Hnowlan: aqs: test cassandra 3.11 on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/670905 (https://phabricator.wikimedia.org/T274119) [10:45:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I don't know how haproxy works for doing this. But overall the patch makes sense." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670964 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [10:45:54] (03PS1) 10Jcrespo: dbbackups: Reduce External store retention now that bacula store it long term [puppet] - 10https://gerrit.wikimedia.org/r/671112 (https://phabricator.wikimedia.org/T138562) [10:47:16] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28525/console" [puppet] - 10https://gerrit.wikimedia.org/r/670905 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [10:48:05] (03PS2) 10Jcrespo: dbbackups: Reduce External store retention now that bacula is used long term [puppet] - 10https://gerrit.wikimedia.org/r/671112 (https://phabricator.wikimedia.org/T138562) [10:48:23] 10SRE, 10Wikimedia-Apache-configuration, 10Developer Productivity, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost (on jobrunners) - https://phabricator.wikimedia.org/T190111 (10fgiunchedi) >>! In T190111#6907630, @Legoktm wrote: > This is causing an... [10:50:13] PROBLEM - Check systemd state on ms-be1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:40] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reduce External store retention now that bacula is used long term [puppet] - 10https://gerrit.wikimedia.org/r/671112 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:52:53] (03PS3) 10Jcrespo: dbbackups: Reduce External store retention now that bacula is used long term [puppet] - 10https://gerrit.wikimedia.org/r/671112 (https://phabricator.wikimedia.org/T138562) [10:55:01] (03CR) 10Elukey: [C: 03+1] "LGTM as test, if it works let's use a separate role! :)" [puppet] - 10https://gerrit.wikimedia.org/r/670905 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [11:01:02] (03PS1) 10Jcrespo: bacula: Reduce read-write es db backup retention to 60 days [puppet] - 10https://gerrit.wikimedia.org/r/671114 (https://phabricator.wikimedia.org/T79922) [11:03:12] 10SRE, 10netops, 10cloud-services-team (Kanban): cloudgw eqiad1: review & allocate subnets and VLANs - https://phabricator.wikimedia.org/T277020 (10aborrero) A /30 works, but we cannot have per-interface addresses, which makes the VRRP setup less elegant. That's why a /29 is better. But our puppet code is c... [11:03:47] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] aqs: test cassandra 3.11 on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/670905 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [11:04:39] (03CR) 10Jcrespo: "I know it requires manual commands afterwards :-)" [puppet] - 10https://gerrit.wikimedia.org/r/671114 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [11:05:11] (03CR) 10Jcrespo: [C: 03+2] bacula: Reduce read-write es db backup retention to 60 days [puppet] - 10https://gerrit.wikimedia.org/r/671114 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [11:05:15] hey jynus - am I okay to merge your backup retention setting changes? [11:05:18] (03PS2) 10Jcrespo: bacula: Reduce read-write es db backup retention to 60 days [puppet] - 10https://gerrit.wikimedia.org/r/671114 (https://phabricator.wikimedia.org/T79922) [11:05:23] ok, hnowlan [11:06:03] done [11:08:52] (03CR) 10Alexandros Kosiaris: [C: 03+1] package_builder: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671093 (owner: 10Muehlenhoff) [11:14:31] RECOVERY - Check systemd state on ms-be1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:10] (03PS1) 10Muehlenhoff: Fix typo in cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/671120 [11:16:18] (03CR) 10Hnowlan: "Unfortunately it seems this doesn't work - still seeing the DEPLOY_HEAD behaviour when using `# sudo -u deploy-service /usr/bin/scap deplo" [puppet] - 10https://gerrit.wikimedia.org/r/670784 (owner: 10Hnowlan) [11:18:10] (03CR) 10Muehlenhoff: [C: 03+2] Fix typo in cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/671120 (owner: 10Muehlenhoff) [11:22:54] !log corrected git_server for logstash-logback-encoder, cassandra/twcs and cassandra/metrics-collector on deploy1002 [11:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:56] (03PS1) 10Vgutierrez: lvs: Set depool_threshold to .8 for upload & text [puppet] - 10https://gerrit.wikimedia.org/r/671124 (https://phabricator.wikimedia.org/T247888) [11:43:37] PROBLEM - cassandra-b service on aqs1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:43:45] PROBLEM - cassandra-a service on aqs1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:52:18] new host --^ [11:55:38] !log upgrade memcached on mc1022, mc2022 [11:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:07] PROBLEM - traffic_server tls process restarted on cp3051 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3051&var-layer=tls [12:08:10] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.88:9042 on aqs1010 is CRITICAL: connect to address 10.64.0.88 and port 9042: Connection refused Hnowlan New hosts - testing cassandra 3.11 for AQS https://phabricator.wikimedia.org/T93886 [12:08:10] ACKNOWLEDGEMENT - cassandra-a service on aqs1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive Hnowlan New hosts - testing cassandra 3.11 for AQS https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:08:10] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is CRITICAL: connect to address 10.64.0.120 and port 9042: Connection refused Hnowlan New hosts - testing cassandra 3.11 for AQS https://phabricator.wikimedia.org/T93886 [12:08:10] ACKNOWLEDGEMENT - cassandra-b service on aqs1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive Hnowlan New hosts - testing cassandra 3.11 for AQS https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:12:08] !log restart ats-tls on cp3051 [12:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:49] RECOVERY - traffic_server tls process restarted on cp3051 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3051&var-layer=tls [12:17:42] (03CR) 10Impartial just: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671084 (owner: 10Gerrit Patch Uploader) [12:17:55] (03PS1) 10Muehlenhoff: profile::kerberos::client: Default to use DNS canonicalisation [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) [12:20:57] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:25:43] 10SRE, 10netops, 10Patch-For-Review: Auhoritative ports list - https://phabricator.wikimedia.org/T277146 (10jbond) > For now (and as i just realised there are only 4) i think ill do the last option implemented [12:26:26] (03PS11) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [12:27:02] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [12:27:36] (03CR) 10Jbond: [C: 03+2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/28523/acmechief2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/670916 (owner: 10Jbond) [12:30:19] (03PS12) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [12:33:41] (03PS1) 10Hnowlan: aqs: add aqs1011 to cassandra 3.11 test cluster [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) [12:36:11] 10SRE, 10Packaging: Copy cassandra packages to buster-wikimedia - https://phabricator.wikimedia.org/T274119 (10hnowlan) Promoted/removed as appropriate: ` hnowlan@apt1001:~$ sudo -i reprepro lsbycomponent cassandra cassandra | 2.1.13 | stretch-wikimedia | main | amd64, i386 cassandra | 2... [12:39:13] 10SRE, 10Packaging: Copy cassandra packages to buster-wikimedia - https://phabricator.wikimedia.org/T274119 (10hnowlan) 05Open→03Resolved [12:39:15] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10hnowlan) [12:41:06] (03CR) 10Effie Mouzeli: [C: 04-1] "iirc when we are running the train, a lot of traffic flows through the scap proxies, which could potentially lead to TKOs. While this itse" [puppet] - 10https://gerrit.wikimedia.org/r/670949 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [12:41:54] PROBLEM - SSH on mw2227.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:44:33] (03PS4) 10Kosta Harlan: linkrecommendation: Use Envoy for requests to MediaWiki API [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) [12:44:41] (03CR) 10Kosta Harlan: linkrecommendation: Use Envoy for requests to MediaWiki API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) (owner: 10Kosta Harlan) [12:45:03] (03PS5) 10Kosta Harlan: linkrecommendation: Use Envoy for requests to MediaWiki API [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) [12:51:19] 10SRE, 10Wikimedia-Mailing-lists: Make puppet for mailman3 ready for production - https://phabricator.wikimedia.org/T277286 (10Ladsgroup) [12:58:50] (03CR) 10Hashar: "I always wondered whether we had some Logstash ingested that would convert our wmf format to nice fields in ElasticSearch ( https://wikite" [puppet] - 10https://gerrit.wikimedia.org/r/670951 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [13:10:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1170:3312', diff saved to https://phabricator.wikimedia.org/P14814 and previous config saved to /var/cache/conftool/dbconfig/20210312-131033-marostegui.json [13:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:53] (03PS1) 10Marostegui: mariadb: Decommission db1088 [puppet] - 10https://gerrit.wikimedia.org/r/671135 (https://phabricator.wikimedia.org/T276025) [13:12:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:14:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1088.eqiad.wmnet [13:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:01] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:17:04] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1088 [puppet] - 10https://gerrit.wikimedia.org/r/671135 (https://phabricator.wikimedia.org/T276025) (owner: 10Marostegui) [13:24:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1088.eqiad.wmnet [13:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:00] 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1088.eqiad.wmnet - https://phabricator.wikimedia.org/T276025 (10Marostegui) [13:25:12] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1088.eqiad.wmnet - https://phabricator.wikimedia.org/T276025 (10Marostegui) [13:25:30] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [13:45:05] (03CR) 10Ayounsi: [C: 03+1] "A couple comments, I don't know enough Puppet to review the code though, but the logic looks good to me." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [13:49:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 25%: Repool db1170:3312 after schema change', diff saved to https://phabricator.wikimedia.org/P14815 and previous config saved to /var/cache/conftool/dbconfig/20210312-134940-root.json [13:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:26] (03PS2) 10Vgutierrez: lvs: Set depool_threshold to .8 for upload & text [puppet] - 10https://gerrit.wikimedia.org/r/671124 (https://phabricator.wikimedia.org/T247888) [13:53:26] (03PS1) 10Muehlenhoff: idp: Make the memcached transcoder configurable [puppet] - 10https://gerrit.wikimedia.org/r/671140 (https://phabricator.wikimedia.org/T271684) [13:54:33] (03CR) 10jerkins-bot: [V: 04-1] idp: Make the memcached transcoder configurable [puppet] - 10https://gerrit.wikimedia.org/r/671140 (https://phabricator.wikimedia.org/T271684) (owner: 10Muehlenhoff) [13:57:15] (03PS2) 10Muehlenhoff: idp: Make the memcached transcoder configurable [puppet] - 10https://gerrit.wikimedia.org/r/671140 (https://phabricator.wikimedia.org/T271684) [13:58:20] (03CR) 10jerkins-bot: [V: 04-1] idp: Make the memcached transcoder configurable [puppet] - 10https://gerrit.wikimedia.org/r/671140 (https://phabricator.wikimedia.org/T271684) (owner: 10Muehlenhoff) [14:01:53] 10SRE, 10netops, 10cloud-services-team (Kanban): cloudgw eqiad1: review & allocate subnets and VLANs - https://phabricator.wikimedia.org/T277020 (10ayounsi) OK, let's be conservative with v4 IPs, I reserved https://netbox.wikimedia.org/ipam/prefixes/393/ and matching vlan https://netbox.wikimedia.org/ipam/vl... [14:03:37] (03PS3) 10Muehlenhoff: idp: Make the memcached transcoder configurable [puppet] - 10https://gerrit.wikimedia.org/r/671140 (https://phabricator.wikimedia.org/T271684) [14:03:53] (03PS4) 10Jbond: idp: Make the memcached transcoder configurable [puppet] - 10https://gerrit.wikimedia.org/r/671140 (https://phabricator.wikimedia.org/T271684) (owner: 10Muehlenhoff) [14:04:02] moritzm: fyi Hosts vs Host ^^ [14:04:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/671140 (https://phabricator.wikimedia.org/T271684) (owner: 10Muehlenhoff) [14:04:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 50%: Repool db1170:3312 after schema change', diff saved to https://phabricator.wikimedia.org/P14816 and previous config saved to /var/cache/conftool/dbconfig/20210312-140443-root.json [14:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:09] jbond42: ah, sure! thx [14:05:15] np [14:05:32] (03PS5) 10Muehlenhoff: idp: Make the memcached transcoder configurable [puppet] - 10https://gerrit.wikimedia.org/r/671140 (https://phabricator.wikimedia.org/T271684) [14:07:45] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10OlyKalinichenkoSpeedAndFunction) Hey, Apologies for that. I've updated the SHH key and this one is unique, and it's placed only locally. Could you... [14:07:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/671140 (https://phabricator.wikimedia.org/T271684) (owner: 10Muehlenhoff) [14:17:49] (03PS6) 10Muehlenhoff: idp: Make the memcached transcoder configurable [puppet] - 10https://gerrit.wikimedia.org/r/671140 (https://phabricator.wikimedia.org/T271684) [14:17:53] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:18:56] I am not near a computer to look at this :/ [14:19:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/671140 (https://phabricator.wikimedia.org/T271684) (owner: 10Muehlenhoff) [14:19:30] (03PS13) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [14:19:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 75%: Repool db1170:3312 after schema change', diff saved to https://phabricator.wikimedia.org/P14817 and previous config saved to /var/cache/conftool/dbconfig/20210312-141947-root.json [14:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:11] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [14:22:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:23:47] (03PS1) 10JMeybohm: Aggregate IPPools in codfw and eqiad, enable codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/671144 (https://phabricator.wikimedia.org/T277191) [14:24:40] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28526/console" [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [14:30:46] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/686/" [puppet] - 10https://gerrit.wikimedia.org/r/671140 (https://phabricator.wikimedia.org/T271684) (owner: 10Muehlenhoff) [14:32:41] (03PS2) 10Hnowlan: aqs: add aqs1011 to cassandra 3.11 test cluster [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) [14:34:00] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28527/console" [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [14:34:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 100%: Repool db1170:3312 after schema change', diff saved to https://phabricator.wikimedia.org/P14818 and previous config saved to /var/cache/conftool/dbconfig/20210312-143450-root.json [14:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:58] (03PS1) 10Muehlenhoff: Switch idp-test to serial transcoder [puppet] - 10https://gerrit.wikimedia.org/r/671166 (https://phabricator.wikimedia.org/T271684) [14:38:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/671166 (https://phabricator.wikimedia.org/T271684) (owner: 10Muehlenhoff) [14:38:18] (03CR) 10Andrew Bogott: [C: 03+2] cinderutils::ensure: Gracefully handle lvm legacy cases [puppet] - 10https://gerrit.wikimedia.org/r/670961 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [14:42:25] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::instance: replace hiera_include with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [14:42:32] (03PS6) 10Andrew Bogott: profile::wmcs::instance: replace hiera_include with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [14:44:49] (03PS3) 10BBlack: lvs: Set depool_threshold to .8 for upload & text [puppet] - 10https://gerrit.wikimedia.org/r/671124 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [14:45:31] (03CR) 10BBlack: [C: 03+1] lvs: Set depool_threshold to .8 for upload & text [puppet] - 10https://gerrit.wikimedia.org/r/671124 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [14:46:42] uh.. what was wrong with the Bug tag line on my commit message bblack? [14:47:04] oh.. the actual task number [14:49:12] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1088.eqiad.wmnet - https://phabricator.wikimedia.org/T276025 (10Cmjohnson) a:05wiki_willy→03Cmjohnson [14:50:17] (03PS1) 10Alexandros Kosiaris: admin: Remove staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/671169 [14:50:19] (03PS1) 10Alexandros Kosiaris: admin/: Remove codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/671170 (https://phabricator.wikimedia.org/T277191) [14:50:39] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:29] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28528/" [puppet] - 10https://gerrit.wikimedia.org/r/671166 (https://phabricator.wikimedia.org/T271684) (owner: 10Muehlenhoff) [14:53:35] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:56:18] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10Cmjohnson) @elukey @Ottomata I would like to do this Monday morning my time around 11am local. 1600UTC [14:56:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:19] (03PS1) 10JMeybohm: kubernetes codfw: Apply role/hiera to new masters [puppet] - 10https://gerrit.wikimedia.org/r/671171 (https://phabricator.wikimedia.org/T277191) [14:59:43] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1088.eqiad.wmnet - https://phabricator.wikimedia.org/T276025 (10Cmjohnson) [14:59:51] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:59:51] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1088.eqiad.wmnet - https://phabricator.wikimedia.org/T276025 (10Cmjohnson) 05Open→03Resolved [14:59:56] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [15:02:15] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10elukey) @Cmjohnson perfect, @razzi might be around as well, in case we'll let you to sync and do the work :) [15:02:26] vgutierrez: yeah sorry, just seemed easier to fix it in the GUI commitmsg editor :) [15:03:01] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 232796928 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:04:02] it has been a recent recurrent theme for me, thinking about how sometimes process is much heavier than results in situations (in this case: do I ping someone on IRC and/or loop through a -1 code review step just to fix an obvious minor numeric typo in the bug number in a commit msg?). [15:05:17] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 88680 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:07:49] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 13.38 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [15:09:04] (03PS1) 10Elukey: profile::hadoop::common: move defaults from hiera to the profile [puppet] - 10https://gerrit.wikimedia.org/r/671172 [15:14:29] (03CR) 10Andrew Bogott: "I did a before and after check of cloud-wide catalog failures and this didn't break anything :)" [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [15:15:07] (03PS2) 10JMeybohm: kubernetes codfw: Apply role/hiera to new masters [puppet] - 10https://gerrit.wikimedia.org/r/671171 (https://phabricator.wikimedia.org/T277191) [15:15:09] (03PS1) 10JMeybohm: kubernetes codfw: Populate new worker hiera keys for k8s update [puppet] - 10https://gerrit.wikimedia.org/r/671174 (https://phabricator.wikimedia.org/T277191) [15:15:22] (03CR) 10Andrew Bogott: [C: 03+1] wmcs.backups: Retry a VM backup 3 times before failing [puppet] - 10https://gerrit.wikimedia.org/r/668097 (https://phabricator.wikimedia.org/T276096) (owner: 10David Caro) [15:16:42] (03PS2) 10Elukey: profile::hadoop::common: move defaults from hiera to the profile [puppet] - 10https://gerrit.wikimedia.org/r/671172 [15:16:49] (03CR) 10Cwhite: [C: 03+2] logstash: grok field name out of error.message [puppet] - 10https://gerrit.wikimedia.org/r/670991 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:16:53] (03CR) 10Cwhite: [C: 03+2] logstash: rename logEvent exception into error.message [puppet] - 10https://gerrit.wikimedia.org/r/670986 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:21:44] (03PS3) 10Elukey: profile::hadoop::common: move defaults from hiera to the profile [puppet] - 10https://gerrit.wikimedia.org/r/671172 [15:22:55] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::common: move defaults from hiera to the profile [puppet] - 10https://gerrit.wikimedia.org/r/671172 (owner: 10Elukey) [15:23:24] (03CR) 10JMeybohm: [C: 03+1] "I very much like this! 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/671169 (owner: 10Alexandros Kosiaris) [15:23:29] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Cmjohnson) updated port locations 1005 28 1006 29 1007 30 1008 31 [15:24:55] (03PS4) 10Elukey: profile::hadoop::common: move defaults from hiera to the profile [puppet] - 10https://gerrit.wikimedia.org/r/671172 [15:26:15] (03PS1) 10JMeybohm: kubernetes staging-eqiad: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/671175 (https://phabricator.wikimedia.org/T276305) [15:28:22] (03PS1) 10Ayounsi: Remove servers interface names from switches interfaces descriptions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/671176 (https://phabricator.wikimedia.org/T277006) [15:29:24] (03PS5) 10Elukey: profile::hadoop::common: move defaults from hiera to the profile [puppet] - 10https://gerrit.wikimedia.org/r/671172 [15:32:47] (03PS1) 10Daimona Eaytoy: Revert "Rewite MoveLeadParagraphTransform based on mobile apps approach" [extensions/MobileFrontend] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/671150 [15:34:07] (03PS6) 10Elukey: profile::hadoop::common: move defaults from hiera to the profile [puppet] - 10https://gerrit.wikimedia.org/r/671172 [15:35:19] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::common: move defaults from hiera to the profile [puppet] - 10https://gerrit.wikimedia.org/r/671172 (owner: 10Elukey) [15:36:08] (03PS7) 10Elukey: profile::hadoop::common: move defaults from hiera to the profile [puppet] - 10https://gerrit.wikimedia.org/r/671172 [15:36:30] (03PS1) 10Muehlenhoff: Add approval for snapshot admins [puppet] - 10https://gerrit.wikimedia.org/r/671178 (https://phabricator.wikimedia.org/T276465) [15:36:41] (03PS2) 10Daimona Eaytoy: Revert "Rewite MoveLeadParagraphTransform based on mobile apps approach" [extensions/MobileFrontend] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/671150 (https://phabricator.wikimedia.org/T277302) [15:36:52] jouncebot: next [15:36:52] In 16 hour(s) and 23 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210313T0800) [15:38:33] (03PS1) 10Muehlenhoff: Add approcal for swift-roots [puppet] - 10https://gerrit.wikimedia.org/r/671180 (https://phabricator.wikimedia.org/T276465) [15:39:04] (03PS2) 10Muehlenhoff: Add approval for swift-roots [puppet] - 10https://gerrit.wikimedia.org/r/671180 (https://phabricator.wikimedia.org/T276465) [15:39:21] Anybody from the relevant teams (RelEng? SRE?) around to discuss a deployment on friday? [15:39:31] https://gerrit.wikimedia.org/r/671150 is the patch [15:41:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:41:27] (03PS1) 10Muehlenhoff: Add approval for graphite-admins [puppet] - 10https://gerrit.wikimedia.org/r/671182 (https://phabricator.wikimedia.org/T276465) [15:42:08] (03PS3) 10Bstorm: paws: add a hiera-controlled ip blocklist [puppet] - 10https://gerrit.wikimedia.org/r/670964 (https://phabricator.wikimedia.org/T276615) [15:42:26] Daimona: releng more likely, consider posting in their channel too [15:42:32] (03PS2) 10BBlack: wikimedia.org: Add Apple Business Manager TXT record [dns] - 10https://gerrit.wikimedia.org/r/663794 (https://phabricator.wikimedia.org/T274592) (owner: 10Vgutierrez) [15:43:13] {{done}}, ty [15:43:28] (03CR) 10Bstorm: "> Patch Set 2: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670964 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [15:43:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:44:16] Majavah: Daimona: I am there [15:44:44] (03PS8) 10Elukey: profile::hadoop::common: move defaults from hiera to the profile [puppet] - 10https://gerrit.wikimedia.org/r/671172 [15:46:29] Majavah: Daimona: my recommendation is to attach this as a sub task of https://phabricator.wikimedia.org/T274938 which would raise awareness about the regression [15:46:53] then I guess we want MobileFrontend folks to review the revert. And from there we will deploy [15:46:54] Right, I forgot to do that [15:46:59] so I htink the bulk of the work is done [15:47:08] and the rest will happen when american folks show up [15:47:12] (to review the patch) [15:47:29] and my colleague that was running the train this week would be able to take care of the deploy if need be [15:47:32] Good enough, ty [15:47:44] or mobilefrontend team would be able to do the backport deploy by themselves [15:49:05] (03CR) 10Bstorm: [C: 03+2] paws: add a hiera-controlled ip blocklist [puppet] - 10https://gerrit.wikimedia.org/r/670964 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [15:50:42] (03PS1) 10Cwhite: logstash: use DATA pattern rather than WORD [puppet] - 10https://gerrit.wikimedia.org/r/671184 [15:52:23] (03PS1) 10Elukey: role::redis::misc::master: increase maxmemory for ORES instances [puppet] - 10https://gerrit.wikimedia.org/r/671186 [15:53:05] Daimona: I think it will follow the usual flow and will get acted on over the next few hours. [15:53:25] Sure, no hurry, I'm here for another couple of hours [15:54:04] Daimona: don't worry, mobile team is subscribed to the task now. I am pretty sure they will act on it [15:56:15] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/28535/" [puppet] - 10https://gerrit.wikimedia.org/r/671172 (owner: 10Elukey) [15:56:22] (03CR) 10JMeybohm: [C: 03+1] admin/: Remove codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/671170 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris) [15:56:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28538/console" [puppet] - 10https://gerrit.wikimedia.org/r/671186 (owner: 10Elukey) [15:58:00] (03CR) 10Cwhite: [C: 03+2] logstash: use DATA pattern rather than WORD [puppet] - 10https://gerrit.wikimedia.org/r/671184 (owner: 10Cwhite) [15:58:03] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [15:58:11] (03PS1) 10Herron: pontoon: add hiera settings for o11y-grafana [puppet] - 10https://gerrit.wikimedia.org/r/671187 [15:58:24] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Cmjohnson) 05Open→03Resolved db1162 is back online - updated netbox and resolving the task [16:01:48] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:16] (03PS2) 10Elukey: role::redis::misc::master: increase maxmemory for ORES instances [puppet] - 10https://gerrit.wikimedia.org/r/671186 [16:03:52] (03CR) 10BBlack: [C: 03+2] wikimedia.org: Add Apple Business Manager TXT record [dns] - 10https://gerrit.wikimedia.org/r/663794 (https://phabricator.wikimedia.org/T274592) (owner: 10Vgutierrez) [16:05:18] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 170 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:06:41] (03CR) 10Hnowlan: "Looks good so far! I'm not sure what more is needed right now, might be worth tagging in someone from serviceops." (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) (owner: 10Jgiannelos) [16:07:36] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28540/console" [puppet] - 10https://gerrit.wikimedia.org/r/671186 (owner: 10Elukey) [16:08:20] (03CR) 10Muehlenhoff: "Looks good to me, but not familiar enough for meaningful code review.." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670913 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [16:08:51] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Apple Business Manager: verify ownership of wikimedia.org - https://phabricator.wikimedia.org/T274592 (10BBlack) 05Open→03Resolved a:03Vgutierrez @bcampbell sorry for the delays, this has repeatedly fallen through the cracks, but it's reviewed + merged now... [16:10:53] 10Puppet, 10Analytics-Radar, 10Cassandra, 10observability, and 2 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948 (10colewhite) [16:10:57] 10SRE, 10Analytics-Clusters, 10CirrusSearch, 10Wikidata, and 3 others: Upgrade prometheus-jmx-exporter - https://phabricator.wikimedia.org/T276595 (10colewhite) 05Open→03Resolved a:03colewhite prometheus-jmx-exporter 0.15.0 is deployed to our apt repo. [16:14:15] 10Puppet, 10SRE, 10Wikimedia-Mailing-lists: Make puppet for mailman3 ready for production - https://phabricator.wikimedia.org/T277286 (10Peachey88) [16:14:53] (03PS4) 10Hnowlan: postgres: add script for automatic resyncing [puppet] - 10https://gerrit.wikimedia.org/r/666110 (https://phabricator.wikimedia.org/T275381) [16:15:55] (03CR) 10jerkins-bot: [V: 04-1] postgres: add script for automatic resyncing [puppet] - 10https://gerrit.wikimedia.org/r/666110 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [16:17:21] (03PS5) 10Hnowlan: postgres: add script for automatic resyncing [puppet] - 10https://gerrit.wikimedia.org/r/666110 (https://phabricator.wikimedia.org/T275381) [16:17:48] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28542/console" [puppet] - 10https://gerrit.wikimedia.org/r/666110 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [16:18:29] (03CR) 10Elukey: [V: 03+1 C: 04-1] "interesting, on rdb2xxx I see:" [puppet] - 10https://gerrit.wikimedia.org/r/671186 (owner: 10Elukey) [16:20:16] (03PS3) 10Elukey: role::redis::misc::master: increase maxmemory for ORES instances [puppet] - 10https://gerrit.wikimedia.org/r/671186 [16:23:08] (03CR) 10Hnowlan: [C: 03+2] postgres: add script for automatic resyncing [puppet] - 10https://gerrit.wikimedia.org/r/666110 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [16:25:03] (03CR) 10Elukey: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/28544/" [puppet] - 10https://gerrit.wikimedia.org/r/671186 (owner: 10Elukey) [16:29:14] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 50 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:33:31] (03PS1) 10Bstorm: paws: ip blocklist needs to tcp reject if in tcp mode or change mode [puppet] - 10https://gerrit.wikimedia.org/r/671190 (https://phabricator.wikimedia.org/T276615) [16:35:23] (03PS14) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [16:35:46] (03PS2) 10Bstorm: paws: ip blocklist needs to tcp reject if in tcp mode or change mode [puppet] - 10https://gerrit.wikimedia.org/r/671190 (https://phabricator.wikimedia.org/T276615) [16:36:47] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [16:38:02] (03PS3) 10Hnowlan: aqs: add aqs1011 to cassandra 3.11 test cluster [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) [16:38:50] (03CR) 10Ottomata: "Hm, I think this is ok, but I'm not sure it is really the right thing to do." [puppet] - 10https://gerrit.wikimedia.org/r/671172 (owner: 10Elukey) [16:38:53] (03CR) 10Bstorm: "Since this terminates SSL for the ingress, it should be fine to use http mode here, I think. We inherited the tcp mode stuff from using ha" [puppet] - 10https://gerrit.wikimedia.org/r/671190 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [16:42:06] (03CR) 10ArielGlenn: [C: 03+1] "Heh, good catch." [puppet] - 10https://gerrit.wikimedia.org/r/671178 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [16:42:25] (03CR) 10Ottomata: [C: 03+1] profile::hadoop::common: move defaults from hiera to the profile [puppet] - 10https://gerrit.wikimedia.org/r/671172 (owner: 10Elukey) [16:44:12] (03PS3) 10Bstorm: paws: ip blocklist needs to tcp reject if in tcp mode or change mode [puppet] - 10https://gerrit.wikimedia.org/r/671190 (https://phabricator.wikimedia.org/T276615) [16:46:06] (03CR) 10Bstorm: paws: ip blocklist needs to tcp reject if in tcp mode or change mode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671190 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [16:46:16] RECOVERY - SSH on mw2227.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:49:43] (03PS15) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [16:59:47] (03CR) 10Elukey: "Hugh I think it is the time to create a new role, it should be way smoother, what do you think? Something like role::aqs_next" [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [17:00:16] RECOVERY - cassandra-a service on aqs1010 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:00:22] RECOVERY - cassandra-a CQL 10.64.0.88:9042 on aqs1010 is OK: TCP OK - 0.001 second response time on 10.64.0.88 port 9042 https://phabricator.wikimedia.org/T93886 [17:00:47] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: New buster host [17:00:47] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: New buster host [17:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:41] (03CR) 10Bstorm: "I've validated this to be at least harmless by live-hacking it into the standby proxy." [puppet] - 10https://gerrit.wikimedia.org/r/671190 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [17:02:08] (03CR) 10Jbond: P:base: add ability to manage services file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670918 (owner: 10Jbond) [17:02:31] (03CR) 10Bstorm: [C: 03+2] paws: ip blocklist needs to tcp reject if in tcp mode or change mode [puppet] - 10https://gerrit.wikimedia.org/r/671190 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [17:03:43] (03PS1) 10Jdlrobson: Revert "Fix client error logging" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/671160 [17:03:48] 10SRE, 10ops-eqiad: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T277316 (10ops-monitoring-bot) [17:06:17] (03CR) 10Jbond: [C: 03+1] Add approval for swift-roots [puppet] - 10https://gerrit.wikimedia.org/r/671180 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [17:06:45] (03CR) 10Jbond: [C: 03+1] Add approval for graphite-admins [puppet] - 10https://gerrit.wikimedia.org/r/671182 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [17:07:36] (03CR) 10Jbond: [C: 03+1] Add approval for snapshot admins [puppet] - 10https://gerrit.wikimedia.org/r/671178 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [17:11:39] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671187 (owner: 10Herron) [17:16:53] (03CR) 10LSobanski: [C: 03+1] "We don't really own Swift yet but I guess I'm as good as anyone at this point." [puppet] - 10https://gerrit.wikimedia.org/r/671180 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [17:19:24] (03CR) 10BryanDavis: "> I'm open to opinions on that, but I'm inclined to make another patchset that skips toolsdb so as not to increase usage of and reliance o" [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [17:24:28] (03CR) 10Jbond: "happy to merge this but will at least wait for a +1 from ZPapierski" [puppet] - 10https://gerrit.wikimedia.org/r/670943 (owner: 10Ebernhardson) [17:26:42] (03CR) 10Ottomata: [C: 03+1] "Talked with Luca in IRC this is good after all!" [puppet] - 10https://gerrit.wikimedia.org/r/671172 (owner: 10Elukey) [17:29:49] (03CR) 10Dzahn: "Thank you, Andrew!" [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [17:31:34] (03CR) 10Dzahn: [C: 03+1] Add approval for graphite-admins [puppet] - 10https://gerrit.wikimedia.org/r/671182 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [17:32:24] (03CR) 10Hnowlan: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [17:33:11] (03CR) 10Dzahn: [C: 03+2] gerrit: make voting scores a bit more accessible [puppet] - 10https://gerrit.wikimedia.org/r/671101 (https://phabricator.wikimedia.org/T256615) (owner: 10Hashar) [17:33:30] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 56549688 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:35:54] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:36:18] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 99 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:41:16] (03CR) 10Dzahn: k8s: Add docker-registry credentials to pull restricted images (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [17:41:19] (03PS1) 10Jdlrobson: Use master version of clientError.js [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/671202 [17:42:20] (03CR) 10Mforns: "Hey Razzi, do you think this will work?" [puppet] - 10https://gerrit.wikimedia.org/r/666948 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [17:43:26] (03CR) 10Thcipriani: [C: 03+1] "+1 insofar as it seems to match the diff: https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaEvents/+/refs/heads/wmf/1.36.0-wmf." [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/671202 (owner: 10Jdlrobson) [17:43:43] (03Abandoned) 10Jdlrobson: Revert "Fix client error logging" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/671160 (owner: 10Jdlrobson) [17:44:15] (03CR) 10Dzahn: "Fair enough, i'll pick another one that isn't a scap proxy. Just the "another row" part surprised me a bit. I would have tried to stay in " [puppet] - 10https://gerrit.wikimedia.org/r/670949 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [17:48:08] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [17:48:20] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10RobH) [17:48:58] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 47 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:49:08] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10RobH) a:03Papaul [17:49:22] (03CR) 10Dzahn: "suggesting mw2299 instead of mw2300 now. it's not a scap proxy, it's new hardware from 2019 so doesn't have to move, but still same row as" [puppet] - 10https://gerrit.wikimedia.org/r/670949 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [17:49:43] 10SRE, 10Kubernetes: helm test fails in ci namespace - https://phabricator.wikimedia.org/T277252 (10jeena) aaah 😵 of course! Thanks @JMeybohm ! [17:49:45] 10ops-codfw, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10RobH) [17:49:47] (03PS1) 10Reedy: Remove wgEnableRestAPI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671203 [17:50:19] (03PS4) 10Hnowlan: aqs: add aqs1011 to cassandra 3.11 test cluster, add aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) [17:51:04] (03CR) 10jerkins-bot: [V: 04-1] aqs: add aqs1011 to cassandra 3.11 test cluster, add aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [17:51:41] 10ops-codfw, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10RobH) [17:55:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:56:21] (03CR) 10DannyS712: "is the portals file meant to be updated as part of this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671203 (owner: 10Reedy) [17:57:30] (03PS2) 10Reedy: Remove wgEnableRestAPI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671203 [17:58:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:58:28] 10ops-codfw, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10RobH) [17:59:39] 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) [18:00:05] 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) [18:00:31] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2229.codfw.wmnet [18:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:40] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2230.codfw.wmnet [18:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:57] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2244.codfw.wmnet [18:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:04] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2245.codfw.wmnet [18:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:21] (03PS1) 10Mstyles: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [18:03:24] (03PS3) 10Dduvall: pipeline: add building the webserver image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669807 (owner: 10Giuseppe Lavagetto) [18:03:50] !log depooling mw2244,mw2245 (API on old hardware), mw2229,mw2230 (app on old hardware) - T277119 [18:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:58] T277119: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 [18:04:58] (03CR) 10DannyS712: [C: 03+1] Remove wgEnableRestAPI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671203 (owner: 10Reedy) [18:06:03] (03PS5) 10Hnowlan: aqs: add aqs1011 to cassandra 3.11 test cluster, add aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) [18:06:24] (03CR) 10jerkins-bot: [V: 04-1] aqs: add aqs1011 to cassandra 3.11 test cluster, add aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [18:06:51] (03PS2) 10Mstyles: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [18:10:27] (03PS2) 10Dzahn: mcrouter: replace proxy for codfw A3, mw2235->mw2299 [puppet] - 10https://gerrit.wikimedia.org/r/670949 (https://phabricator.wikimedia.org/T277119) [18:10:37] (03PS6) 10Hnowlan: aqs: add aqs1011 to cassandra 3.11 test cluster, add aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) [18:11:41] (03CR) 10jerkins-bot: [V: 04-1] aqs: add aqs1011 to cassandra 3.11 test cluster, add aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [18:12:01] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28547/console" [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [18:21:13] !log dns[34]001.wikimedia.org - upgrade gdnsd to 3.6.0 (half the servers have been on this for a couple weeks now, just finishing up the rollout) [18:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:22] (03CR) 10Legoktm: [C: 04-1] site/conftool: turn mw2374 and mw2376 into API canaries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670992 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [18:24:07] (03PS27) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [18:24:09] (03PS1) 10Andrew Bogott: prepare_cinder_volume.py: Add optional arg for mount options [puppet] - 10https://gerrit.wikimedia.org/r/671208 (https://phabricator.wikimedia.org/T272114) [18:24:11] (03PS1) 10Andrew Bogott: prepare_cinder_volume.py: Add optional mount mode [puppet] - 10https://gerrit.wikimedia.org/r/671209 (https://phabricator.wikimedia.org/T272114) [18:24:13] (03PS1) 10Andrew Bogott: cinderutils::ensure: support specifying mount options and file mode [puppet] - 10https://gerrit.wikimedia.org/r/671210 (https://phabricator.wikimedia.org/T272114) [18:24:48] !log dns[15]001.wikimedia.org - upgrade gdnsd to 3.6.0 (half the servers have been on this for a couple weeks now, just finishing up the rollout) [18:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:59] (03CR) 10jerkins-bot: [V: 04-1] prepare_cinder_volume.py: Add optional arg for mount options [puppet] - 10https://gerrit.wikimedia.org/r/671208 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [18:25:06] (03CR) 10jerkins-bot: [V: 04-1] prepare_cinder_volume.py: Add optional mount mode [puppet] - 10https://gerrit.wikimedia.org/r/671209 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [18:27:28] (03PS2) 10Andrew Bogott: prepare_cinder_volume.py: Add optional arg for mount options [puppet] - 10https://gerrit.wikimedia.org/r/671208 (https://phabricator.wikimedia.org/T272114) [18:27:30] (03PS2) 10Andrew Bogott: prepare_cinder_volume.py: Add optional mount mode [puppet] - 10https://gerrit.wikimedia.org/r/671209 (https://phabricator.wikimedia.org/T272114) [18:27:32] (03PS2) 10Andrew Bogott: cinderutils::ensure: support specifying mount options and file mode [puppet] - 10https://gerrit.wikimedia.org/r/671210 (https://phabricator.wikimedia.org/T272114) [18:27:34] (03PS28) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [18:28:14] (03CR) 10jerkins-bot: [V: 04-1] prepare_cinder_volume.py: Add optional mount mode [puppet] - 10https://gerrit.wikimedia.org/r/671209 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [18:28:32] !log authdns1001.wikimedia.org,dns2001.wikimedia.org - upgrade gdnsd to 3.6.0 (half the servers have been on this for a couple weeks now, just finishing up the rollout) [18:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:42] (03PS3) 10Andrew Bogott: prepare_cinder_volume.py: Add optional mount mode [puppet] - 10https://gerrit.wikimedia.org/r/671209 (https://phabricator.wikimedia.org/T272114) [18:29:44] (03PS3) 10Andrew Bogott: cinderutils::ensure: support specifying mount options and file mode [puppet] - 10https://gerrit.wikimedia.org/r/671210 (https://phabricator.wikimedia.org/T272114) [18:29:46] (03PS29) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [18:31:15] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Marostegui) Thanks Chris - I can access the host now. I will reimage it and populate it with data on Monday. [18:33:43] (03PS1) 10Dzahn: site/conftool: decom mw2217, mw2218, mw2219 [puppet] - 10https://gerrit.wikimedia.org/r/671212 (https://phabricator.wikimedia.org/T277119) [18:37:12] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T277316 (10Marostegui) 05Open→03Resolved a:03Marostegui The backplane of this host was replaced. The RAID is now in optimal status ` root@db1162:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Informatio... [18:42:11] (03CR) 10Dduvall: [C: 03+1] "I ran the multiversion build against this patch and it completed successfully. The published image is at docker-registry.wikimedia.org/wik" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669807 (owner: 10Giuseppe Lavagetto) [18:48:02] (03CR) 10Dzahn: site/conftool: turn mw2374 and mw2376 into API canaries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670992 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [18:52:20] (03PS4) 10Dzahn: site/conftool: turn mw2374 and mw2376 into API canaries [puppet] - 10https://gerrit.wikimedia.org/r/670992 (https://phabricator.wikimedia.org/T277119) [19:00:48] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 14.19 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [19:11:36] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Puppetize mailman3 - https://phabricator.wikimedia.org/T256536 (10Ladsgroup) > (which I think we should disable because the VPS proxy enforces HTTPS so only internal traffic can hit it on HTTP). AFAIK this is not behind the VPS proxy. The DNS is bound to... [19:12:13] (03PS5) 10Dzahn: site/conftool: turn mw2374 and mw2376 into API canaries [puppet] - 10https://gerrit.wikimedia.org/r/670992 (https://phabricator.wikimedia.org/T277119) [19:15:36] (03CR) 10Legoktm: [C: 03+1] site/conftool: turn mw2374 and mw2376 into API canaries [puppet] - 10https://gerrit.wikimedia.org/r/670992 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [19:22:29] (03CR) 10Herron: pontoon: add hiera settings for o11y-grafana (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671187 (owner: 10Herron) [19:32:17] (03PS1) 10BBlack: split ncredir for wm.com, set ncache ttl to 600 [dns] - 10https://gerrit.wikimedia.org/r/671224 [19:35:20] (03CR) 10BBlack: [C: 03+2] split ncredir for wm.com, set ncache ttl to 600 [dns] - 10https://gerrit.wikimedia.org/r/671224 (owner: 10BBlack) [19:40:44] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/28548/mw2374.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/670992 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [19:41:08] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2374.codfw.wmnet [19:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:14] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2376.codfw.wmnet [19:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:30] !log mw2374, mw2376 - depooling to turn them into canaries [19:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:32] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2374.codfw.wmnet [19:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:04] !log start in-place reindex testwiki in eqiad, codfw, cloudelastic cirrus clusters for T269493 [19:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:11] T269493: Add hasrecommendation: search keyword - https://phabricator.wikimedia.org/T269493 [19:47:44] !log dzahn@cumin1001 conftool action : set/weight=1; selector: name=mw2374.codfw.wmnet,service=canary [19:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:50] !log dzahn@cumin1001 conftool action : set/weight=1; selector: name=mw2376.codfw.wmnet,service=canary [19:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:09] (03PS2) 10Dzahn: site/conftool: decom mw2217, mw2218, mw2219 [puppet] - 10https://gerrit.wikimedia.org/r/671212 (https://phabricator.wikimedia.org/T277119) [19:57:05] (03PS16) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [19:58:11] 10SRE, 10netops, 10Patch-For-Review: Auhoritative ports list - https://phabricator.wikimedia.org/T277146 (10jbond) >>! In T277146#6908177, @jbond wrote: >> For now (and as i just realised there are only 4) i think ill do the last option ~~implemented~~ apple talk dropped [20:04:34] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests (HTTP 415 errors) - https://phabricator.wikimedia.org/T269914 (10Cyberpower678) Is this still happening? I noticed a task that was commented out in my crontab file, was... [20:06:50] (03CR) 10Effie Mouzeli: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/670949 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [20:13:05] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests (HTTP 415 errors) - https://phabricator.wikimedia.org/T269914 (10Legoktm) >>! In T269914#6909641, @Cyberpower678 wrote: > Is this still happening? I noticed a task that... [20:14:23] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install new linecards into routers - https://phabricator.wikimedia.org/T277339 (10RobH) [20:14:29] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2217.codfw.wmnet [20:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:36] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install new linecards into routers - https://phabricator.wikimedia.org/T277339 (10RobH) [20:14:39] jouncebot: now [20:14:39] For the next 11 hour(s) and 45 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210312T0800) [20:14:53] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2218.codfw.wmnet [20:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:59] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2219.codfw.wmnet [20:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2217.codfw.wmnet [20:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:08] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10RobH) [20:16:21] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10RobH) [20:17:52] (03PS1) 10Eevans: Update sessionstore staging to Kask 2021-03-12-195445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/671249 (https://phabricator.wikimedia.org/T274262) [20:19:23] (03CR) 10Eevans: [V: 03+2 C: 03+2] Update sessionstore staging to Kask 2021-03-12-195445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/671249 (https://phabricator.wikimedia.org/T274262) (owner: 10Eevans) [20:22:04] !log eevans@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [20:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:04] 10SRE, 10Kubernetes: helm test fails in ci namespace - https://phabricator.wikimedia.org/T277252 (10jeena) In case anyone one else comes across this, the network policy doesn't do anything on minikube unless you start minikube with `--network-plugin=cni --cni=the cni you will use` and then deploy the cni to mi... [20:26:03] 10ops-codfw, 10DC-Ops: (Need By: TBD) install MPC7E-MRATE FPC into cr[12]-codfw - https://phabricator.wikimedia.org/T277341 (10RobH) [20:26:13] 10ops-codfw, 10DC-Ops: (Need By: TBD) install MPC7E-MRATE FPC into cr[12]-codfw - https://phabricator.wikimedia.org/T277341 (10RobH) [20:32:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2217.codfw.wmnet [20:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2218.codfw.wmnet [20:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:48] PROBLEM - mediawiki-installation DSH group on mw2219 is CRITICAL: Host mw2219 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:35:48] (03PS1) 10Eevans: Update sessionstore prod to Kask 2021-03-12-195445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/671252 (https://phabricator.wikimedia.org/T274262) [20:37:13] (03CR) 10Eevans: [C: 04-1] "Waiting for a day other than Friday to deploy to production." [deployment-charts] - 10https://gerrit.wikimedia.org/r/671252 (https://phabricator.wikimedia.org/T274262) (owner: 10Eevans) [20:46:03] (03PS1) 10Hashar: Revert "gerrit: make voting scores a bit more accessible" [puppet] - 10https://gerrit.wikimedia.org/r/671228 (https://phabricator.wikimedia.org/T256615) [20:47:46] (03CR) 10Dzahn: [C: 03+2] Revert "gerrit: make voting scores a bit more accessible" [puppet] - 10https://gerrit.wikimedia.org/r/671228 (https://phabricator.wikimedia.org/T256615) (owner: 10Hashar) [20:47:59] mutante: yeah my fault. I did not test it ;D [20:48:16] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2218.codfw.wmnet [20:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:51] hashar: not a problem, it was a thing that was easiest to simply merge and see [20:49:17] that GerritSite.css looks mostly obsolete. It is apparently only used for the login page nowadays [20:49:26] which has lead me to confusion [20:49:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2219.codfw.wmnet [20:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:44] I will have a look at how we can override css, but it seems it is no more possible :\ [20:49:44] yea, I think I remember Paladox working on login page design [20:50:33] oh [20:50:43] yes you cannot use that file to ovveride css in PolyGerrit [20:50:48] you have to use gerrit-theme [20:51:23] hashar mutante ^ [20:51:29] reverted on gerrit1001 now [20:51:43] ah our Chief Gerrit Officer is around! :] [20:52:21] so I just wanted to .vote { font-weight: bold; } [20:52:40] and apparently with Gerrit 3.1 / Polymer 3 we can not override things anymore [20:53:02] that has to be a css variable such as --vote-chip-font-weight (which does not exist) [20:53:42] maybe there is way in javascript to lookup the component that holds the voting chip and then adjust its css. But I haven't found any hints to do so [20:53:43] you add -- here https://github.com/GerritCodeReview/gerrit/blob/a070eb88d03dc50d81e7c4f338f90bded99583ae/polygerrit-ui/app/elements/shared/gr-label-info/gr-label-info_html.ts#L30 [20:53:50] or [20:53:54] you can use --vote-chip-styles i think [20:54:20] html/dom/css have changed so much :D [20:54:58] yeh [20:54:58] ah nice lead [20:55:16] and here is the dom-module "gr-voting-styles" [20:55:17] (03CR) 10Krinkle: "We've done this once or twice before. It might be worth checking how we did it then just in case there's something we missed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670913 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [20:58:11] (03CR) 10Krinkle: Support having multiple IRC feed servers (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670913 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [20:58:40] (03PS2) 10Krinkle: Define IRC feed servers as an array in {Production,Labs}Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670914 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [20:58:49] (03PS2) 10Krinkle: Remove back-compat from when IRC feed servers was a string [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670915 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [20:59:07] (03CR) 10Krinkle: [C: 03+1] Remove back-compat from when IRC feed servers was a string [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670915 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [20:59:13] (03CR) 10Krinkle: [C: 03+1] Define IRC feed servers as an array in {Production,Labs}Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670914 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [21:02:06] paladox: yeah that is a good lead. Though from our wm-common-style , I can't override the Gerrit builtin --vote-chip-styles :D [21:02:26] you should be able to. [21:03:03] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:03:05] --vote-chip-styles { css.... } [21:06:24] --vote-chip-styles { font-size: 1.1em; font-weight: bold; } [21:06:49] yeah [21:06:55] but I have no clue where to put that ;] [21:07:07] my few tries in our wm-common-styles does not seem to change anything [21:07:29] I guess cause there is a shadow dom [21:08:40] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests (HTTP 415 errors) - https://phabricator.wikimedia.org/T269914 (10bd808) `lang=irc [20:33] < CP678> legoktm: Okay, my bad it wasn't commented it was starred.  (*)  I... [21:08:45] hashar you can put it in the html [21:08:54] html {} [21:12:42] hashar https://phabricator.wikimedia.org/P14821 [21:12:59] yeah I tried that [21:13:05] doesn't change anything :\ [21:14:14] oh [21:15:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2219.codfw.wmnet [21:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:04] oh [21:16:06] actually [21:16:31] (03CR) 10Dzahn: [C: 03+2] site/conftool: decom mw2217, mw2218, mw2219 [puppet] - 10https://gerrit.wikimedia.org/r/671212 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [21:17:18] hashar what about https://phabricator.wikimedia.org/P14822? [21:17:45] ah [21:26:07] (03PS1) 10Dzahn: site/conftool: decom mw2220 through mw2223 [puppet] - 10https://gerrit.wikimedia.org/r/671259 (https://phabricator.wikimedia.org/T277119) [21:27:35] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:28:39] (03PS2) 10Dzahn: site/conftool: decom mw2224 through mw2229 [puppet] - 10https://gerrit.wikimedia.org/r/670957 (https://phabricator.wikimedia.org/T277119) [21:28:46] hashar does adding : work? --vote-chip-styles: {...} per "Use custom CSS mixins" [21:28:55] https://polymer-library.polymer-project.org/3.0/docs/devguide/custom-css-properties [21:29:04] so if i do that [21:29:35] I do see at the top of the DOM under body > custom-style > style thta it adds a bunch of --vote-chip-styles* properties [21:29:45] so it looks like it just applies --vote-chip-styles [21:29:53] but does not take in account what I list in {...} [21:30:15] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests (HTTP 415 errors) - https://phabricator.wikimedia.org/T269914 (10Legoktm) a:03Cyberpower678 [21:31:57] (03PS1) 10Dzahn: site/conftool: decom mw2239 through mw2242 [puppet] - 10https://gerrit.wikimedia.org/r/671260 (https://phabricator.wikimedia.org/T277119) [21:32:16] paladox: I give up for tonight. I guess I should reach out to the mailing list next week ;) [21:32:41] ok :) [21:35:41] paladox: thank you :] [21:35:50] :) [21:37:11] good night everyone! [21:47:29] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10wiki_willy) Hi @ayounsi - when we budgeted these last year, I think it was for general expansion of 10g switches. Do you have any specific racks you want to put this in? (like W... [21:50:26] (03PS1) 10Ryan Kemper: wdqs: new query-preview cert for wdqs-test [puppet] - 10https://gerrit.wikimedia.org/r/671267 (https://phabricator.wikimedia.org/T266470) [21:51:31] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: new query-preview cert for wdqs-test [puppet] - 10https://gerrit.wikimedia.org/r/671267 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [21:52:08] !log puppetmaster1001 sudo puppet cert clean testreduce.discovery.wmnet (T266509) [21:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:17] (03PS1) 10Ryan Kemper: Revert "Revert "wdqs: impl. envoy for wdqs-test"" [puppet] - 10https://gerrit.wikimedia.org/r/671229 [21:52:17] T266509: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 [21:52:56] (03CR) 10Ryan Kemper: [C: 03+2] Revert "Revert "wdqs: impl. envoy for wdqs-test"" [puppet] - 10https://gerrit.wikimedia.org/r/671229 (owner: 10Ryan Kemper) [21:56:44] (03PS1) 10Ryan Kemper: wdqs: new query-preview cert for wdqs-test [puppet] - 10https://gerrit.wikimedia.org/r/671273 (https://phabricator.wikimedia.org/T266470) [21:56:46] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Puppetize mailman3 - https://phabricator.wikimedia.org/T256536 (10Legoktm) If we do: `lang=diff ALLOWED_HOSTS = [ "localhost", # Archiving API from Mailman, keep it. 'lists.wmcloud.org', + 'mailman-mailman02.mailman.eqiad.wmflabs', '0.... [21:57:38] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: new query-preview cert for wdqs-test [puppet] - 10https://gerrit.wikimedia.org/r/671273 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [21:58:16] 10SRE, 10netops, 10Patch-For-Review: Auhoritative ports list - https://phabricator.wikimedia.org/T277146 (10jbond) just a note to self, wondered if it my be better to index names by port number but we have issues with the following conflicts. i think this works with the current patches however the netbase::... [21:58:47] PROBLEM - SSH on mw2227.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:59:15] RECOVERY - Memcached on mwdebug1001 is OK: TCP OK - 0.001 second response time on 10.64.32.123 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [21:59:20] (03PS1) 10Dzahn: ssl: add regenerated TLS cert for testreduce with new SAN [puppet] - 10https://gerrit.wikimedia.org/r/671275 (https://phabricator.wikimedia.org/T266509) [22:00:23] (03CR) 10Dzahn: [C: 03+2] "openssl x509 -in testreduce.discovery.wmnet.crt -text -noout | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/671275 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [22:02:43] (03PS1) 10Legoktm: mailman3: Don't talk to hyperkitty over localhost [puppet] - 10https://gerrit.wikimedia.org/r/671279 (https://phabricator.wikimedia.org/T256536) [22:03:22] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) [22:04:03] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) @ssastry Done! https://parsoid-rt-tests.wikimedia.org/ has been reactivated. It needed the parsoid-rt-tests.w... [22:05:45] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) [22:06:34] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) 05Open→03Resolved {F34156177} [22:10:20] !log imported mailman-puppetmaster.mailman.eqiad1.wikimedia.cloud facts to puppet-compiler [22:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:18:15] (03CR) 10Ladsgroup: [C: 03+1] mailman3: Don't talk to hyperkitty over localhost [puppet] - 10https://gerrit.wikimedia.org/r/671279 (https://phabricator.wikimedia.org/T256536) (owner: 10Legoktm) [22:20:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:28:43] (03PS1) 10Herron: wip: initial grafana::grizzly module and profile [puppet] - 10https://gerrit.wikimedia.org/r/671283 [22:30:37] (03PS1) 10Bstorm: paws: block using the Jupyterhub from Tor [puppet] - 10https://gerrit.wikimedia.org/r/671286 (https://phabricator.wikimedia.org/T276615) [22:31:09] PROBLEM - Check no envoy runtime configuration is left persistent on wdqs1010 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [22:33:06] (03CR) 10Bstorm: "Since the basic mechanism works, this aims to expand it with a dynamically generated list of Tor exit nodes. I always hate blocking tor, b" [puppet] - 10https://gerrit.wikimedia.org/r/671286 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [22:34:20] (03CR) 10Bstorm: paws: block using the Jupyterhub from Tor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671286 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [22:37:55] (03PS1) 10Cwhite: logstash: add normalize level filter to scap migration [puppet] - 10https://gerrit.wikimedia.org/r/671290 [22:39:30] (03CR) 10jerkins-bot: [V: 04-1] logstash: add normalize level filter to scap migration [puppet] - 10https://gerrit.wikimedia.org/r/671290 (owner: 10Cwhite) [22:42:06] (03PS2) 10Cwhite: logstash: add normalize level filter to scap migration [puppet] - 10https://gerrit.wikimedia.org/r/671290 [22:42:43] PROBLEM - Check no envoy runtime configuration is left persistent on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [22:50:44] ACKNOWLEDGEMENT - Check no envoy runtime configuration is left persistent on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused Ryan Kemper Related to https://phabricator.wikimedia.org/T266470, will revert soon if root cause isnt found quickly https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [22:50:44] ACKNOWLEDGEMENT - Check no envoy runtime configuration is left persistent on wdqs1010 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused Ryan Kemper Related to https://phabricator.wikimedia.org/T266470, will revert soon if root cause isnt found quickly https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [22:53:50] !log T266470 Manually disabled service notifications for `Check no envoy runtime configuration is left persistent`, will need to circle back on Monday to restore notifications [22:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:57] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [23:03:13] (03PS1) 10Legoktm: Add mailman VPS secrets [labs/private] - 10https://gerrit.wikimedia.org/r/671298 [23:03:42] (03PS2) 10Legoktm: mailman3: Don't talk to hyperkitty over localhost [puppet] - 10https://gerrit.wikimedia.org/r/671279 (https://phabricator.wikimedia.org/T256536) [23:03:44] (03PS1) 10Legoktm: Add hiera for mailman Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/671299 [23:04:39] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add mailman VPS secrets [labs/private] - 10https://gerrit.wikimedia.org/r/671298 (owner: 10Legoktm) [23:05:55] (03CR) 10Legoktm: [C: 03+2] Add hiera for mailman Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/671299 (owner: 10Legoktm) [23:06:03] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) I run a test on mwdebug1001 where I switched on mcrouter its onhost memcached from plain to ssl: ` "onhost": { "servers": [... [23:06:31] (03PS17) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [23:07:16] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [23:14:07] (03PS3) 10Cwhite: logstash: add normalize level filter to scap migration [puppet] - 10https://gerrit.wikimedia.org/r/671290 [23:14:18] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28551/console" [puppet] - 10https://gerrit.wikimedia.org/r/671279 (https://phabricator.wikimedia.org/T256536) (owner: 10Legoktm) [23:16:34] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman3: Don't talk to hyperkitty over localhost [puppet] - 10https://gerrit.wikimedia.org/r/671279 (https://phabricator.wikimedia.org/T256536) (owner: 10Legoktm) [23:17:31] (03PS18) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)