[00:16:03] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 958.67 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:30:03] (03PS2) 10Bstorm: host monitoring: add optional contact group for mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/543916 (https://phabricator.wikimedia.org/T223458) [00:35:27] (03CR) 10CRusnov: [C: 03+1] "This LGTM, would like additional sign-off." [puppet] - 10https://gerrit.wikimedia.org/r/543252 (https://phabricator.wikimedia.org/T235458) (owner: 10Brian Wolff) [00:37:51] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10Bstorm) T229156 according to that ticket, this is the disk that came from Dell for that ticket. [00:45:39] (03PS1) 10CRusnov: mailman: add alias and redirect for multimedia-team [puppet] - 10https://gerrit.wikimedia.org/r/545122 (https://phabricator.wikimedia.org/T235550) [00:47:12] (03CR) 10CRusnov: "We shall need this merged shortly after the rename is executed." [puppet] - 10https://gerrit.wikimedia.org/r/545122 (https://phabricator.wikimedia.org/T235550) (owner: 10CRusnov) [00:51:29] Hmm gerrit-replication seems down? [00:51:30] Oh! [00:51:32] Misplet [00:54:22] (03PS1) 10CRusnov: netbox: Enable CSV dump rotations. [puppet] - 10https://gerrit.wikimedia.org/r/545123 [01:02:40] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10Bstorm) There are two disks in there with the larger size. {F30874039} Note that the size matches this: T229156#5399581 -- which is this disk replaced in the last ticket. This suggests the disk was repla... [01:09:36] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10Bstorm) Service request 986376069 does not show anything terribly useful. Since T229156 shows the disk at its current size, I have to imagine that Dell sent us larger disks during that request. I thought t... [01:11:23] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10Bstorm) @wiki_willy I think we need to follow up with Dell about that. They should have some kind of tracking on the disk serial numbers, etc. that they have been sending us, right? [01:50:37] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:01:22] (03PS3) 10Huji: Change the language of Votewiki to Persian (fa) temporarily for the annual ArbCom elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [02:01:41] (03CR) 10Huji: [C: 03+1] "This can be merged now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [02:02:05] (03CR) 10jerkins-bot: [V: 04-1] Change the language of Votewiki to Persian (fa) temporarily for the annual ArbCom elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [02:13:37] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 45 probes of 470 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:15:40] (03CR) 10DannyS712: "Error: /src/wmf-config/InitialiseSettings.php should not be executable" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [02:19:11] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 25 probes of 470 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:42:50] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp3047 [puppet] - 10https://gerrit.wikimedia.org/r/545127 (https://phabricator.wikimedia.org/T231433) [03:42:52] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3047 [puppet] - 10https://gerrit.wikimedia.org/r/545128 (https://phabricator.wikimedia.org/T231433) [03:43:59] !log Switch from nginx to ats-tls on cp3047 - T231433 [03:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:04] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [03:44:33] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp3047 [puppet] - 10https://gerrit.wikimedia.org/r/545127 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [03:46:23] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:46:24] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp3047 [puppet] - 10https://gerrit.wikimedia.org/r/545128 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [03:49:37] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:52:48] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [04:11:34] (03PS1) 10Vgutierrez: hiera: Set nginx on port 4443 for cache upload on esams [puppet] - 10https://gerrit.wikimedia.org/r/545129 (https://phabricator.wikimedia.org/T231433) [04:11:36] (03PS1) 10Vgutierrez: hiera: Set ats-tls on port 443 for cache upload nodes on esams [puppet] - 10https://gerrit.wikimedia.org/r/545130 (https://phabricator.wikimedia.org/T231433) [04:18:48] !log Switch from nginx to ats-tls on cp3049 - T231433 [04:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:53] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:19:08] (03CR) 10Vgutierrez: [C: 03+2] "pcc shows a NOOP for the whole cluster and the expected changes on cp3049: https://puppet-compiler.wmflabs.org/compiler1001/18969/" [puppet] - 10https://gerrit.wikimedia.org/r/545129 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:21:13] (03CR) 10Vgutierrez: [C: 03+2] "pcc shows a NOOP for the whole cluster and the expected changes for cp3049: https://puppet-compiler.wmflabs.org/compiler1002/18970/" [puppet] - 10https://gerrit.wikimedia.org/r/545130 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:27:16] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [04:30:12] (03PS1) 10CRusnov: coherence: Check unracked devices for connected console ports [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/545132 [04:30:16] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp2024 [puppet] - 10https://gerrit.wikimedia.org/r/545133 (https://phabricator.wikimedia.org/T231433) [04:30:18] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp2024 [puppet] - 10https://gerrit.wikimedia.org/r/545134 (https://phabricator.wikimedia.org/T231433) [04:30:48] (03CR) 10jerkins-bot: [V: 04-1] coherence: Check unracked devices for connected console ports [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/545132 (owner: 10CRusnov) [04:30:58] !log Switch from nginx to ats-tls on cp2024 - T231433 [04:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:02] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:31:30] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp2024 [puppet] - 10https://gerrit.wikimedia.org/r/545133 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:32:46] 10Operations: Puppet breakage in automation-framework VMs - https://phabricator.wikimedia.org/T234452 (10crusnov) This should be fixed now. [04:35:10] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp2024 [puppet] - 10https://gerrit.wikimedia.org/r/545134 (https://phabricator.wikimedia.org/T231433) [04:35:12] (03PS1) 10Vgutierrez: hiera: Move cp2024.yaml to the proper directory [puppet] - 10https://gerrit.wikimedia.org/r/545136 (https://phabricator.wikimedia.org/T231433) [04:36:09] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move cp2024.yaml to the proper directory [puppet] - 10https://gerrit.wikimedia.org/r/545136 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:37:56] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp2024 [puppet] - 10https://gerrit.wikimedia.org/r/545134 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:43:26] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [04:50:19] (03PS1) 10Vgutierrez: hiera: Set nginx on port 4443 for cache upload on codfw [puppet] - 10https://gerrit.wikimedia.org/r/545138 (https://phabricator.wikimedia.org/T231433) [04:50:21] (03PS1) 10Vgutierrez: hiera: Set ats-tls on port 443 for cache upload nodes on codfw [puppet] - 10https://gerrit.wikimedia.org/r/545139 (https://phabricator.wikimedia.org/T231433) [04:58:06] !log Switch from nginx to ats-tls on cp2026 - T231433 [04:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:10] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:58:16] (03CR) 10Vgutierrez: [C: 03+2] "pcc shows a NOOP for the whole cluster and the expected changes on cp2026: https://puppet-compiler.wmflabs.org/compiler1002/18971/" [puppet] - 10https://gerrit.wikimedia.org/r/545138 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:59:59] (03CR) 10Vgutierrez: "pcc shows a NOOP for the whole cluster and the expected changes on cp2026: https://puppet-compiler.wmflabs.org/compiler1002/18972/" [puppet] - 10https://gerrit.wikimedia.org/r/545139 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:00:10] (03CR) 10Vgutierrez: [C: 03+2] hiera: Set ats-tls on port 443 for cache upload nodes on codfw [puppet] - 10https://gerrit.wikimedia.org/r/545139 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:00:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2084:3314 after compression', diff saved to https://phabricator.wikimedia.org/P9420 and previous config saved to /var/cache/conftool/dbconfig/20191022-050048-marostegui.json [05:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2089:3315 for compression T235599', diff saved to https://phabricator.wikimedia.org/P9421 and previous config saved to /var/cache/conftool/dbconfig/20191022-050204-marostegui.json [05:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:09] T235599: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 [05:05:14] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:06:28] (03PS1) 10Marostegui: report_users: Remove dbproxy1004,dbproxy1009 [software] - 10https://gerrit.wikimedia.org/r/545142 (https://phabricator.wikimedia.org/T231280) [05:07:57] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp1088 [puppet] - 10https://gerrit.wikimedia.org/r/545143 (https://phabricator.wikimedia.org/T231433) [05:07:59] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp1088 [puppet] - 10https://gerrit.wikimedia.org/r/545144 (https://phabricator.wikimedia.org/T231433) [05:08:01] !log Switch from nginx to ats-tls on cp1088 - T231433 [05:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:05] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [05:08:34] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp1088 [puppet] - 10https://gerrit.wikimedia.org/r/545143 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:09:10] (03CR) 10Marostegui: [C: 03+2] report_users: Remove dbproxy1004,dbproxy1009 [software] - 10https://gerrit.wikimedia.org/r/545142 (https://phabricator.wikimedia.org/T231280) (owner: 10Marostegui) [05:09:34] (03Merged) 10jenkins-bot: report_users: Remove dbproxy1004,dbproxy1009 [software] - 10https://gerrit.wikimedia.org/r/545142 (https://phabricator.wikimedia.org/T231280) (owner: 10Marostegui) [05:10:02] (03PS4) 10Marostegui: db-eqiad.php: Temporary pool pc1010 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542890 (https://phabricator.wikimedia.org/T227142) [05:10:10] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp1088 [puppet] - 10https://gerrit.wikimedia.org/r/545144 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:14:50] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp2025 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:14:56] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Marostegui) [05:15:04] PROBLEM - HTTPS Unified RSA on cp2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:15:20] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2025 is CRITICAL: connect to address 10.192.48.29 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:15:36] PROBLEM - Check systemd state on cp2025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:52] (03PS1) 10Marostegui: wmnet: Remove db2060 DNS production entries [dns] - 10https://gerrit.wikimedia.org/r/545150 (https://phabricator.wikimedia.org/T231625) [05:16:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:16:38] 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:43] (03PS1) 10Marostegui: site.pp: Remove puppet references for db2060 [puppet] - 10https://gerrit.wikimedia.org/r/545151 (https://phabricator.wikimedia.org/T231625) [05:16:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:54] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2060.codfw.wmnet` - db2060.codfw.wmnet (**PASS**)... [05:17:28] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove puppet references for db2060 [puppet] - 10https://gerrit.wikimedia.org/r/545151 (https://phabricator.wikimedia.org/T231625) (owner: 10Marostegui) [05:17:49] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2060 DNS production entries [dns] - 10https://gerrit.wikimedia.org/r/545150 (https://phabricator.wikimedia.org/T231625) (owner: 10Marostegui) [05:18:12] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Marostegui) a:05RobH→03Papaul [05:18:27] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Marostegui) Host ready for on-site and switch disablement steps [05:18:50] PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:19:12] PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:19:23] hmm that's not expected [05:19:25] * vgutierrez checking [05:20:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Marostegui) [05:20:17] wonderful.. I didn't have cp2025 listed on T231433 [05:20:18] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [05:20:42] !log depooling cp2025 to fix ATS/nginx configuration - T231433 [05:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:15] (03PS1) 10Gergő Tisza: Set GrowthExperiments task suggester config on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545155 (https://phabricator.wikimedia.org/T234426) [05:22:04] RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp2025 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345584 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:22:22] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp2025 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:22:30] RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp2025 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345558 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:22:40] RECOVERY - HTTPS Unified RSA on cp2025 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345548 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:22:58] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2025 is OK: HTTP OK: HTTP/1.0 200 OK - 19521 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:23:50] RECOVERY - Check systemd state on cp2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:27] !log repooling cp2025 - T231433 [05:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:59] 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:26:37] 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10serviceops: Investigate recurrent latency spikes for the MediaWiki appservers - https://phabricator.wikimedia.org/T235872 (10jijiki) [05:27:10] (03PS1) 10Vgutierrez: hiera: Set nginx on port 4443 for cache upload on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/545156 (https://phabricator.wikimedia.org/T231433) [05:27:12] (03PS1) 10Vgutierrez: hiera: Set ats-tls on port 443 for cache upload nodes on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/545157 (https://phabricator.wikimedia.org/T231433) [05:28:51] (03CR) 10Giuseppe Lavagetto: Parsoid/PHP: Load the extension on all Parsoid nodes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac) [05:28:56] 10Operations, 10ops-esams, 10DC-Ops: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [05:31:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "As I already suggested previously, this is the wrong approach." [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [05:32:02] 10Operations, 10ops-esams, 10DC-Ops: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10ayounsi) [05:32:05] (03PS2) 10Vgutierrez: hiera: Set ats-tls on port 443 for cache upload nodes on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/545157 (https://phabricator.wikimedia.org/T231433) [05:32:21] (03PS1) 10Marostegui: mariadb: Set db1070 to spare [puppet] - 10https://gerrit.wikimedia.org/r/545158 (https://phabricator.wikimedia.org/T235464) [05:32:58] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1070 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545159 (https://phabricator.wikimedia.org/T235464) [05:33:47] !log Switch from nginx to ats-tls on cp1090 - T231433 [05:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:51] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [05:33:55] (03CR) 10Vgutierrez: [C: 03+2] "pcc shows a NOOP on the whole cluster and the expected changes on cp1090: https://puppet-compiler.wmflabs.org/compiler1002/18973/" [puppet] - 10https://gerrit.wikimedia.org/r/545156 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:34:56] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db1070 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545159 (https://phabricator.wikimedia.org/T235464) (owner: 10Marostegui) [05:35:40] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1070 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545159 (https://phabricator.wikimedia.org/T235464) (owner: 10Marostegui) [05:35:51] (03CR) 10Vgutierrez: [C: 03+2] "pcc shows a generalized NOOP and the expected changes on cp1076 and cp1090: https://puppet-compiler.wmflabs.org/compiler1001/18974/" [puppet] - 10https://gerrit.wikimedia.org/r/545157 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:39:01] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db1070 from config T235464 (duration: 00m 53s) [05:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:06] T235464: decommission db1070.eqiad.wmnet - https://phabricator.wikimedia.org/T235464 [05:40:01] !log Remove db1070 from tendril and zarcillo - T235464 [05:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1070 from config T235464 (duration: 00m 51s) [05:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:55] !log Stop mysql on db1070 - T235464 [05:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:43] 10Operations, 10DBA: decommission db1070.eqiad.wmnet - https://phabricator.wikimedia.org/T235464 (10Marostegui) [05:43:30] (03CR) 10Marostegui: [C: 03+2] mariadb: Set db1070 to spare [puppet] - 10https://gerrit.wikimedia.org/r/545158 (https://phabricator.wikimedia.org/T235464) (owner: 10Marostegui) [05:47:42] (03PS1) 10Vgutierrez: hiera: Unify common ats-tls settings for cache upload [puppet] - 10https://gerrit.wikimedia.org/r/545162 (https://phabricator.wikimedia.org/T231433) [05:47:44] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:47:47] (03CR) 10Giuseppe Lavagetto: [C: 03+1] spec: remove hhvm references from tests [puppet] - 10https://gerrit.wikimedia.org/r/544847 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [05:48:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1070 from config T235464', diff saved to https://phabricator.wikimedia.org/P9422 and previous config saved to /var/cache/conftool/dbconfig/20191022-054759-marostegui.json [05:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:04] T235464: decommission db1070.eqiad.wmnet - https://phabricator.wikimedia.org/T235464 [05:51:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096 for PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9423 and previous config saved to /var/cache/conftool/dbconfig/20191022-055151-marostegui.json [05:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:59] (03CR) 10Vgutierrez: [C: 03+2] "pcc shows a NOOP across cache upload nodes on every DC: https://puppet-compiler.wmflabs.org/compiler1001/18975/" [puppet] - 10https://gerrit.wikimedia.org/r/545162 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:54:39] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) 05Open→03Resolved [05:54:44] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) [05:54:48] 10Operations, 10Traffic: Get rid of nginx puppetization for cache upload - https://phabricator.wikimedia.org/T236120 (10Vgutierrez) [05:57:15] (03PS5) 10Marostegui: db-eqiad.php: Temporary pool pc1010 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542890 (https://phabricator.wikimedia.org/T227142) [05:57:21] (03CR) 10Marostegui: db-eqiad.php: Temporary pool pc1010 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542890 (https://phabricator.wikimedia.org/T227142) (owner: 10Marostegui) [05:57:30] (03Abandoned) 10Vgutierrez: Testing buffer_upload experimental plugin - do not merge [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/543271 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez) [06:02:44] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable reloading global lua script [puppet] - 10https://gerrit.wikimedia.org/r/543022 (https://phabricator.wikimedia.org/T233274) (owner: 10Vgutierrez) [06:07:24] (03CR) 10Vgutierrez: [C: 03+2] ATS: Use a common base path for /etc/ssl and /etc/acmecerts certs [puppet] - 10https://gerrit.wikimedia.org/r/544151 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [06:32:00] !log rolling restart of ats-tls - T233274 T234803 [06:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:05] T234803: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 [06:32:06] T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 [06:41:11] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Temporary pool pc1010 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542890 (https://phabricator.wikimedia.org/T227142) (owner: 10Marostegui) [06:41:50] (03Merged) 10jenkins-bot: db-eqiad.php: Temporary pool pc1010 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542890 (https://phabricator.wikimedia.org/T227142) (owner: 10Marostegui) [06:43:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1010 T227142 (duration: 00m 52s) [06:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:14] T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 [06:47:28] (03CR) 10Kosta Harlan: [C: 04-1] Set GrowthExperiments task suggester config on beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545155 (https://phabricator.wikimedia.org/T234426) (owner: 10Gergő Tisza) [06:51:18] (03PS1) 10Giuseppe Lavagetto: Add parsoid-php to the discovery records to switchover [cookbooks] - 10https://gerrit.wikimedia.org/r/545167 [06:53:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I think btw it should be possible to use debdeploy and debmonitor for eevans as well. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/544966 (https://phabricator.wikimedia.org/T200803) (owner: 10Alexandros Kosiaris) [06:53:38] (03PS2) 10Gergő Tisza: Set GrowthExperiments task suggester config on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545155 (https://phabricator.wikimedia.org/T234426) [06:55:37] (03CR) 10Giuseppe Lavagetto: "I think the point set forward by eevans makes sense. Unless we have a way to manually trigger the process, we might be better off writing " [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) (owner: 10Alexandros Kosiaris) [06:57:01] (03CR) 10Kosta Harlan: [C: 03+1] Set GrowthExperiments task suggester config on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545155 (https://phabricator.wikimedia.org/T234426) (owner: 10Gergő Tisza) [07:02:45] (03CR) 10ArielGlenn: "I looked at removal scripts for the relevnt packages and it looks ok. We could do a test on a snapshot host as soon as one becomes idle, w" [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [07:03:55] (03CR) 10Effie Mouzeli: "> As I already suggested previously, this is the wrong approach." [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [07:04:51] (03CR) 10Effie Mouzeli: "> I looked at removal scripts for the relevnt packages and it looks" [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [07:14:11] (03PS2) 10Muehlenhoff: Update microcode check [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) [07:17:59] !log installing tcpdump security updates [07:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:42] (03CR) 10Elukey: "Does it need to be on stat1007? In theory a cleaner solution, in my opinion, would be a Ganeti VM in the analytics VLAN dedicated to this " [puppet] - 10https://gerrit.wikimedia.org/r/544989 (owner: 10EBernhardson) [07:36:42] (03PS2) 10DCausse: Bump experimental-highlighter to 5.6.4.1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/543188 (https://phabricator.wikimedia.org/T236123) [07:37:04] 10Operations, 10Performance-Team, 10Traffic: Can't load flame or coal graphs on performance.wikimedia.org (HTTP 502) - https://phabricator.wikimedia.org/T236102 (10Gilles) Confirmed on WMCS: ` HTTP/2 502 date: Tue, 22 Oct 2019 07:34:28 GMT content-type: text/html server: ATS/8.0.5 cache-control: no-store c... [07:38:18] (03PS4) 10Elukey: swap: Redirect stderr to /dev/null to prevent cronspam [puppet] - 10https://gerrit.wikimedia.org/r/543866 (https://phabricator.wikimedia.org/T132324) (owner: 10Jcrespo) [07:40:14] (03CR) 10Elukey: [C: 03+2] "I keep postponing this due to other tasks and time off work, so let's merge this and stop cronspam, I'll come back to it :)" [puppet] - 10https://gerrit.wikimedia.org/r/543866 (https://phabricator.wikimedia.org/T132324) (owner: 10Jcrespo) [07:40:28] 10Operations: Puppet breakage in automation-framework VMs - https://phabricator.wikimedia.org/T234452 (10Volans) >>! In T234452#5593612, @crusnov wrote: > This should be fixed now. Are you sure? Puppet is still broken on all of them AFAICT (I just checked randomly some of them). This is on the puppetmaster: ` T... [07:41:06] 10Operations, 10Performance-Team, 10Traffic: Can't load flame or coal graphs on performance.wikimedia.org (HTTP 502) - https://phabricator.wikimedia.org/T236102 (10ema) The certificate for performance.discovery.wmnet does not include performance.wikimedia.org in SubjectAltName, hence ATS fails to connect to... [07:48:26] (03PS1) 10Ema: ssl: re-issue cert for performance.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/545203 (https://phabricator.wikimedia.org/T210411) [07:48:51] (03PS4) 10Urbanecm: Change the language of Votewiki to Persian (fa) temporarily for the annual ArbCom elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [07:49:39] (03CR) 10jerkins-bot: [V: 04-1] Change the language of Votewiki to Persian (fa) temporarily for the annual ArbCom elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [07:50:20] (03CR) 10Ema: [C: 03+2] ssl: re-issue cert for performance.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/545203 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [07:53:49] !log Stop MySQL on db1116 pc1007 db1096:3315, db1096:3316 for PDU maintenance T227142 [07:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:53] T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 [07:54:20] (03PS5) 10Urbanecm: Change the language of Votewiki to Persian (fa) temporarily for the annual ArbCom elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [07:56:32] 10Operations, 10Traffic: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [07:56:56] (03CR) 10Giuseppe Lavagetto: LVS: add config for parsoid-php service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543243 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [07:58:06] PROBLEM - MariaDB Slave IO: pc1 on pc2007 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@pc1007.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on pc1007.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:58:26] PROBLEM - MariaDB Slave IO: pc1 on pc2010 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@pc1007.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on pc1007.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:58:31] ^ me [07:58:35] I will silence those [08:03:48] 10Operations, 10Traffic: Trigger envoy reload upon TLS certificate update - https://phabricator.wikimedia.org/T236125 (10ema) [08:04:00] 10Operations, 10Traffic: Trigger envoy reload upon TLS certificate update - https://phabricator.wikimedia.org/T236125 (10ema) p:05Triage→03Normal [08:04:35] (03PS1) 10Vgutierrez: acme_chief: Grant access to all cp nodes to the unified cert [puppet] - 10https://gerrit.wikimedia.org/r/545204 (https://phabricator.wikimedia.org/T234803) [08:05:40] !log Stop MySQL on labsdb1012 for PDU work T227142 [08:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:44] T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 [08:07:33] (03CR) 10Ema: [C: 03+1] acme_chief: Grant access to all cp nodes to the unified cert [puppet] - 10https://gerrit.wikimedia.org/r/545204 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [08:08:06] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Grant access to all cp nodes to the unified cert [puppet] - 10https://gerrit.wikimedia.org/r/545204 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [08:09:05] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Can't load flame or coal graphs on performance.wikimedia.org (HTTP 502) - https://phabricator.wikimedia.org/T236102 (10ema) 05Open→03Resolved a:03ema Done, thanks for the bug report @ori! [08:09:42] (03Abandoned) 10Vgutierrez: package_builder: Fix debhelper dependencies on stretch [puppet] - 10https://gerrit.wikimedia.org/r/533896 (owner: 10Vgutierrez) [08:18:18] (03PS5) 10Giuseppe Lavagetto: blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) [08:18:20] (03PS5) 10Giuseppe Lavagetto: scaffold: Add option for TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (https://phabricator.wikimedia.org/T236008) [08:18:22] (03PS4) 10Giuseppe Lavagetto: scaffold: only expose one port as a service by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 [08:22:53] (03PS1) 10Vgutierrez: ATS: Reload TLS material on acme_chief::cert updates [puppet] - 10https://gerrit.wikimedia.org/r/545206 (https://phabricator.wikimedia.org/T234803) [08:23:37] (03PS1) 10Ema: kibana: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/545207 (https://phabricator.wikimedia.org/T210411) [08:24:04] (03CR) 10Muehlenhoff: "Wrt the concern about picking an explicit version; we can also set this via ListShellHook (as already done for "elastic" and "elastic55")" [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) (owner: 10Alexandros Kosiaris) [08:24:27] (03PS17) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [08:25:39] (03PS1) 10Vgutierrez: ATS: Deploy acme-chief version of the unified certificate globally [puppet] - 10https://gerrit.wikimedia.org/r/545208 (https://phabricator.wikimedia.org/T234803) [08:28:25] (03CR) 10Volans: [C: 03+1] "LGTM python wise, I'll leave it to you for the flag logic" [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [08:29:35] (03PS1) 10Ema: Add kibana-ssl LVS service [puppet] - 10https://gerrit.wikimedia.org/r/545209 (https://phabricator.wikimedia.org/T210411) [08:30:23] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks, merging. The changes to the flag logic are based on my tests on servers on the fleet, there'll be a few tweaks for the blacklist, " [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [08:30:49] (03PS3) 10Muehlenhoff: Update microcode check [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) [08:32:39] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [08:33:30] (03CR) 10Muehlenhoff: [C: 03+2] Update microcode check [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [08:39:24] (03PS2) 10Vgutierrez: ATS,tlsproxy: Reload TLS material on acme_chief::cert updates [puppet] - 10https://gerrit.wikimedia.org/r/545206 (https://phabricator.wikimedia.org/T234803) [08:39:26] (03PS2) 10Vgutierrez: ATS: Deploy acme-chief version of the unified certificate globally [puppet] - 10https://gerrit.wikimedia.org/r/545208 (https://phabricator.wikimedia.org/T234803) [08:40:22] (03PS18) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [08:45:38] (03CR) 10Jbond: [C: 03+1] "lgtm minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543916 (https://phabricator.wikimedia.org/T223458) (owner: 10Bstorm) [08:45:46] (03PS3) 10Vgutierrez: ATS: Deploy acme-chief version of the unified certificate globally [puppet] - 10https://gerrit.wikimedia.org/r/545208 (https://phabricator.wikimedia.org/T234803) [08:48:46] (03PS2) 10Ema: kibana: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/545207 (https://phabricator.wikimedia.org/T210411) [08:48:55] (03CR) 10Volans: [C: 04-1] "Requires first a netbox deploy." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545123 (owner: 10CRusnov) [08:51:58] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: config readable by logstash only by default [puppet] - 10https://gerrit.wikimedia.org/r/544218 (https://phabricator.wikimedia.org/T235891) (owner: 10Filippo Giunchedi) [08:52:09] (03PS3) 10Ema: kibana: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/545207 (https://phabricator.wikimedia.org/T210411) [08:52:11] (03PS2) 10Filippo Giunchedi: logstash: config readable by logstash only by default [puppet] - 10https://gerrit.wikimedia.org/r/544218 (https://phabricator.wikimedia.org/T235891) [08:58:25] (03CR) 10Ema: [C: 03+2] kibana: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/545207 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:04:04] (03Abandoned) 10Vgutierrez: acme_chief cloud: Ensure that python3-designateclient is installed [puppet] - 10https://gerrit.wikimedia.org/r/528624 (owner: 10Vgutierrez) [09:05:23] (03PS3) 10Arturo Borrero Gonzalez: wikimedia.cloud: add initial zone file [dns] - 10https://gerrit.wikimedia.org/r/544175 (https://phabricator.wikimedia.org/T235846) [09:09:15] (03CR) 10Volans: "See comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545094 (owner: 10Dzahn) [09:10:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s4 codfw - T231018', diff saved to https://phabricator.wikimedia.org/P9424 and previous config saved to /var/cache/conftool/dbconfig/20191022-091051-marostegui.json [09:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:56] T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018 [09:13:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s4 eqiad - T231018', diff saved to https://phabricator.wikimedia.org/P9425 and previous config saved to /var/cache/conftool/dbconfig/20191022-091327-marostegui.json [09:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Minor comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545072 (https://phabricator.wikimedia.org/T235863) (owner: 10Jhedden) [09:16:04] 10Operations, 10Traffic: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10fgiunchedi) [09:17:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimedia.cloud: add initial zone file [dns] - 10https://gerrit.wikimedia.org/r/544175 (https://phabricator.wikimedia.org/T235846) (owner: 10Arturo Borrero Gonzalez) [09:20:07] (03CR) 10Jbond: [C: 03+1] "LGTM, minor comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544990 (owner: 10EBernhardson) [09:22:23] (03PS3) 10DCausse: Bump experimental-highlighter to 6.5.4.1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/543188 (https://phabricator.wikimedia.org/T236123) [09:23:03] (03CR) 10Volans: "recheck" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/545132 (owner: 10CRusnov) [09:25:26] (03CR) 10Giuseppe Lavagetto: "> Wrt the concern about picking an explicit version; we can also set" [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) (owner: 10Alexandros Kosiaris) [09:29:09] (03PS6) 10Giuseppe Lavagetto: blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) [09:29:11] (03PS6) 10Giuseppe Lavagetto: scaffold: Add option for TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (https://phabricator.wikimedia.org/T236008) [09:29:13] (03PS5) 10Giuseppe Lavagetto: scaffold: only expose one port as a service by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 [09:30:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:32:00] there was a spike [09:32:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:32:48] Error from line 130 of /srv/mediawiki/php-1.35.0-wmf.2/extensions/Graph/includes/ApiGraph.php: Call to a member function getExtensionData() on boolean [09:33:03] (exception) [09:33:34] ^heads up to releng to notify owners if it is not tracked already [09:34:29] oh, it is tracked already: https://phabricator.wikimedia.org/T235356 [09:41:14] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:42:50] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:45:32] (03PS2) 10Ema: Add kibana-ssl LVS service [puppet] - 10https://gerrit.wikimedia.org/r/545209 (https://phabricator.wikimedia.org/T210411) [09:48:55] (03PS3) 10Ema: Add kibana-ssl LVS service [puppet] - 10https://gerrit.wikimedia.org/r/545209 (https://phabricator.wikimedia.org/T210411) [09:49:31] (03CR) 10Vgutierrez: [C: 03+1] Add kibana-ssl LVS service [puppet] - 10https://gerrit.wikimedia.org/r/545209 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:51:02] (03CR) 10Ema: [C: 03+2] Add kibana-ssl LVS service [puppet] - 10https://gerrit.wikimedia.org/r/545209 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:53:01] lvs1016: restart pybal to add new service kibana-ssl T210411 [09:53:02] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [09:53:57] ema: I think you missed the log cmd [09:54:03] ah! [09:54:09] !log lvs1016: restart pybal to add new service kibana-ssl T210411 [09:54:11] <3 [09:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:14] thanks! [09:54:51] !log lvs2006: restart pybal to add new service kibana-ssl T210411 [09:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:37] (03CR) 10Vgutierrez: [C: 03+2] ATS,tlsproxy: Reload TLS material on acme_chief::cert updates [puppet] - 10https://gerrit.wikimedia.org/r/545206 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [09:56:45] (03CR) 10Vgutierrez: ATS,tlsproxy: Reload TLS material on acme_chief::cert updates [puppet] - 10https://gerrit.wikimedia.org/r/545206 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [09:57:14] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana is broken https://wikitech.wikimedia.org/wiki/Confd [09:57:27] uh [09:57:48] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.33:443]) https://wikitech.wikimedia.org/wiki/PyBal [09:57:58] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana is broken https://wikitech.wikimedia.org/wiki/Confd [09:58:04] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.33:443]) https://wikitech.wikimedia.org/wiki/PyBal [09:58:29] _joe_: can you see anything strange with /srv/config-master/pybal/eqiad/kibana on puppetmaster1001? [09:59:06] <_joe_> ema: /var/log/confd.log says anything? [09:59:41] <_joe_> [invalid]: server pool cannot be empty! [09:59:46] <_joe_> pool some servers :P [10:00:01] <_joe_> they're all in status pooled=inactive [10:00:03] new service.. everything is depooled :) [10:00:09] ah, right [10:00:17] the error is a bit misleading [10:00:19] <_joe_> nothig to worry about too much :P [10:00:22] <_joe_> what error? [10:00:39] the one saying that compilation is broken [10:00:51] it made me think of a syntax error in the template [10:00:55] <_joe_> it is true that the compilation is broken, it doesn't pass the verification step [10:01:11] <_joe_> well an error can be due to either the template or the data right? [10:02:11] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: service=kibana-ssl [10:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:22] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:03:40] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:04:41] _joe_: indeed. Thanks! [10:07:11] (03PS3) 10Vgutierrez: ATS,tlsproxy: Reload TLS material on acme_chief::cert updates [puppet] - 10https://gerrit.wikimedia.org/r/545206 (https://phabricator.wikimedia.org/T234803) [10:07:13] (03PS4) 10Vgutierrez: ATS: Deploy acme-chief version of the unified certificate globally [puppet] - 10https://gerrit.wikimedia.org/r/545208 (https://phabricator.wikimedia.org/T234803) [10:10:01] is some manual intervention required now to make confd happy again on the puppetmaster? [10:11:10] (03CR) 10Hashar: [C: 03+1] "And from https://phabricator.wikimedia.org/T208566#5371633 that will let us do some cleanup after that, notably get rid of rake_modules/f" [puppet] - 10https://gerrit.wikimedia.org/r/526104 (https://phabricator.wikimedia.org/T228657) (owner: 10Alexandros Kosiaris) [10:12:46] (03CR) 10Vgutierrez: [C: 03+2] ATS,tlsproxy: Reload TLS material on acme_chief::cert updates [puppet] - 10https://gerrit.wikimedia.org/r/545206 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [10:14:49] !log puppetmaster1001: rm /var/run/confd-template/.kibana-ssl*.err to make confd icinga check happy T210411 [10:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:53] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [10:15:21] 10Operations, 10Traffic: Trigger envoy reload upon TLS certificate update - https://phabricator.wikimedia.org/T236125 (10Joe) Given we have the hot-restarted now, that's probably a good idea. [10:15:26] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kibana on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [10:15:26] RECOVERY - Confd template for /srv/config-master/pybal/codfw/kibana on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [10:15:27] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Marostegui) The following hosts are ready for this maintenance * pc1007 * labsdb1012 * db1116 * db1096 * dbproxy1013 * db1066. Note ** this host is powered OFF as it is ready to... [10:18:37] !log lvs1015: restart pybal to add new service kibana-ssl T210411 [10:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:21] !log lvs2003: restart pybal to add new service kibana-ssl T210411 [10:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:25] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [10:28:23] (03PS1) 10Ema: kibana: add discovery configuration [puppet] - 10https://gerrit.wikimedia.org/r/545225 [10:32:26] !log shutting down db1115 in preparation for PDU maintanance, this will make tendril and dbtree unavailable for 2 hours T227142 [10:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:30] T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 [10:32:54] (03PS2) 10Ema: kibana: add discovery configuration [puppet] - 10https://gerrit.wikimedia.org/r/545225 [10:35:51] (03CR) 10Ema: [C: 03+2] kibana: add discovery configuration [puppet] - 10https://gerrit.wikimedia.org/r/545225 (owner: 10Ema) [10:35:56] (03CR) 10Volans: [C: 03+1] "LGTM once the new discovery record is live." [cookbooks] - 10https://gerrit.wikimedia.org/r/545167 (owner: 10Giuseppe Lavagetto) [10:37:47] (03CR) 10Filippo Giunchedi: "FWIW I'm still seeing some warnings from throttle from time to time, I'm guessing when we take down one of logstash frontends and more mes" [puppet] - 10https://gerrit.wikimedia.org/r/543904 (owner: 10Herron) [10:37:51] !log ema@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kibana [10:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:30] PROBLEM - HTTP-dbtree on dbmonitor2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 281 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [10:40:20] ^expected, downtimed, but I was too late it was aalready on soft [10:43:18] (03PS1) 10Ema: kibana: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/545232 (https://phabricator.wikimedia.org/T227432) [10:46:17] (03CR) 10Ema: [C: 03+2] kibana: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/545232 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:48:28] (03PS1) 10Filippo Giunchedi: logstash: remove deprecated elasticsearch options [puppet] - 10https://gerrit.wikimedia.org/r/545236 (https://phabricator.wikimedia.org/T235891) [10:48:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks fine (sans John's comment), but that changes existing sudo rules, so needs approval in the next SRE meeting, maybe Guillaume can tak" [puppet] - 10https://gerrit.wikimedia.org/r/544990 (owner: 10EBernhardson) [10:53:06] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10jcrespo) db1115 is now down, I took the opportunity to upgrade all its system packages, but didn't touch mariadb. [10:54:12] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:54:13] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:49] (03PS1) 10Ema: Revert "kibana: add discovery record" [dns] - 10https://gerrit.wikimedia.org/r/545241 [10:55:26] !log rebooting rpki2001 for some microcode tests [10:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:51] (03CR) 10Ema: [C: 03+2] Revert "kibana: add discovery record" [dns] - 10https://gerrit.wikimedia.org/r/545241 (owner: 10Ema) [10:58:47] (03CR) 10Huji: [C: 03+1] Change the language of Votewiki to Persian (fa) temporarily for the annual ArbCom elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191022T1100). [11:00:04] MatmaRex: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:17] hello [11:00:39] o/ [11:00:47] I can SWAT today! [11:01:04] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [11:01:58] MatmaRex: +2'ed your backports, will let you know once they're ready to test [11:02:17] Urbanecm: hi, thanks. i'm trying out the same patch we revereted yesterday, i've been told this time it will *really* work ;) [11:02:24] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [11:02:40] (03PS19) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [11:02:41] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Jclark-ctr) starting PDU Maintenance [11:02:44] (03PS1) 10Jcrespo: mariadb/backups: Prepare dbmonitor[12]001 to reimage to buster [puppet] - 10https://gerrit.wikimedia.org/r/545246 (https://phabricator.wikimedia.org/T224589) [11:02:48] MatmaRex: cool! [11:03:49] (03PS1) 10Muehlenhoff: Fix detection of virtual hosts in microcode code [puppet] - 10https://gerrit.wikimedia.org/r/545247 [11:04:31] (03CR) 10Muehlenhoff: mariadb/backups: Prepare dbmonitor[12]001 to reimage to buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545246 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [11:05:08] PROBLEM - Confd template for /var/lib/gdnsd/discovery-kibana.state on authdns1001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-kibana.state https://wikitech.wikimedia.org/wiki/Confd [11:05:25] (03CR) 10Jcrespo: [C: 04-1] mariadb/backups: Prepare dbmonitor[12]001 to reimage to buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545246 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [11:05:36] MatmaRex: fyi, I'm going to deploy a config patch while waiting on CI [11:05:55] okay [11:06:03] (03CR) 10Urbanecm: [C: 03+2] Change the language of Votewiki to Persian (fa) temporarily for the annual ArbCom elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [11:06:36] (03PS2) 10Jcrespo: mariadb/backups: Prepare dbmonitor[12]001 to reimage to buster [puppet] - 10https://gerrit.wikimedia.org/r/545246 (https://phabricator.wikimedia.org/T224589) [11:06:52] (03Merged) 10jenkins-bot: Change the language of Votewiki to Persian (fa) temporarily for the annual ArbCom elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [11:06:54] (03CR) 10Jcrespo: "Thank you very much for the catch, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/545246 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [11:07:10] (03CR) 10Muehlenhoff: [C: 03+2] Fix detection of virtual hosts in microcode code [puppet] - 10https://gerrit.wikimedia.org/r/545247 (owner: 10Muehlenhoff) [11:09:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/545246 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [11:09:57] (03CR) 10Effie Mouzeli: [C: 03+2] spec: remove hhvm references from tests [puppet] - 10https://gerrit.wikimedia.org/r/544847 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:09:59] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 0593f34: Change the language of Votewiki to Persian (fa) temporarily for the annual ArbCom elections (T230614) (duration: 00m 54s) [11:09:59] (03CR) 10Jcrespo: "@Marostegui From IRC: what would you think about me doing T224589 while you monitor the PDU stuff?" [puppet] - 10https://gerrit.wikimedia.org/r/545246 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [11:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:03] T230614: Carry out the 2019 fawiki elections on votewiki - https://phabricator.wikimedia.org/T230614 [11:10:31] ^it is all bots all the way down [11:10:49] * jynus prepares when bots will take over my job [11:11:19] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10aborrero) [11:11:49] hope you have some pension plan :P [11:12:33] Step 1: automate my job away. Step 2: ?. Step 3: Profit! [11:13:08] jynus marostegui FYI with db1115 down/unavailable then /usr/local/sbin/mysqld_exporter_config.py doesn't work on prometheus hosts, noticed via puppet failures [11:13:18] pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'db1115.eqiad.wmnet' ([Errno 111] Connection refused)") [11:13:21] yeah [11:13:27] that is expected, but shouldn't alert? [11:13:36] does it create issues on prometheus? [11:14:08] oh, I see, it is the puppet thing [11:14:25] no immediate issue afaict no, but obviously the db targets are not being updated if needed [11:14:26] but it doesn't "break" puppet, right? [11:14:32] it does [11:14:36] oh? [11:14:41] I think that was on purpose [11:14:44] as in the puppet run fails, because the exec fails [11:14:45] to notice it [11:14:53] yeah I think this is the correct behaviour [11:14:55] but it doesn't blamk the config [11:15:01] is what I meant? [11:15:12] (03PS6) 10Urbanecm: Allow certain users to create account at closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542755 (https://phabricator.wikimedia.org/T222117) [11:15:29] yeah the config is fine [11:15:38] I will ack the error with a timeout [11:15:49] It creates a warning so I didn't notice it [11:15:58] in the future we will have ha for that db [11:16:14] MatmaRex: your commits are ready at mwdebug1001, please test and let me know [11:16:15] but we cannot right now because tendril [11:17:11] ack, thanks jynus [11:17:26] sorry for the issues, I didn't rememeber that [11:17:37] I will document db1115 dependencies on wikitech [11:17:39] np, no issues actually [11:17:47] "it works" as intended [11:17:49] :-D [11:18:01] Urbanecm: looking [11:18:09] thanks [11:19:23] Urbanecm: are you sure? i'm still seeing the old JS code [11:19:39] MatmaRex: verifying, give me a moment [11:21:22] jouncebot: now [11:21:22] For the next 0 hour(s) and 38 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191022T1100) [11:21:29] MatmaRex: made a mistake, it should be fine now [11:21:32] (03CR) 10Marostegui: [C: 03+1] mariadb/backups: Prepare dbmonitor[12]001 to reimage to buster [puppet] - 10https://gerrit.wikimedia.org/r/545246 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [11:22:14] PROBLEM - Confd template for /var/lib/gdnsd/discovery-kibana.state on authdns2001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-kibana.state https://wikitech.wikimedia.org/wiki/Confd [11:22:40] Urbanecm: thanks, seems to be working as expected now. give me a minute more to test the logging stuff [11:22:47] MatmaRex: sure [11:24:28] godog: thanks for the ping, I didn't remember that either [11:25:51] Urbanecm: everything looks good! [11:26:03] MatmaRex: good! Going to sync 'em all! [11:28:20] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.2/extensions/VisualEditor/: SWAT: 2bc4420 (T235707); 680a98b (T233320); d83265d (T234564) (duration: 00m 53s) [11:28:34] MatmaRex: synced! [11:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:35] T233320: VisualEditor <-> RESTBase communication and ETags - https://phabricator.wikimedia.org/T233320 [11:28:35] T234564: Logstash discards messages from MediaWiki if they contain uncommon keys in the $context array - https://phabricator.wikimedia.org/T234564 [11:28:36] T235707: VE not successfully loading on pages with video or audio embeds: "TypeError: href is null" from ve.dm.MWInternalLinkAnnotation.js:141:2 - https://phabricator.wikimedia.org/T235707 [11:29:13] Urbanecm: thank you [11:29:16] (03PS1) 10Muehlenhoff: Check for both kvm/qemu in systemd-detec-virt [puppet] - 10https://gerrit.wikimedia.org/r/545254 [11:29:21] you're welcome [11:29:24] !log EU SWAT done [11:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:19] (03CR) 10jerkins-bot: [V: 04-1] Check for both kvm/qemu in systemd-detec-virt [puppet] - 10https://gerrit.wikimedia.org/r/545254 (owner: 10Muehlenhoff) [11:34:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3311, db1105:3312 for firmware upgrade T235877', diff saved to https://phabricator.wikimedia.org/P9428 and previous config saved to /var/cache/conftool/dbconfig/20191022-113437-marostegui.json [11:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:42] T235877: db1105 rebooted itself - https://phabricator.wikimedia.org/T235877 [11:34:56] Daimona: can https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/SocialProfile/+/545003/ be un-WIP? [11:35:06] I can re+2 after that [11:35:09] !log Stop MySQL on db1105:3311, db1105:3312 for firmware upgrade - T235877 [11:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:30] hauskater: Oh, sure. I forgot to. [11:35:54] +2 [11:35:57] again [11:36:36] PROBLEM - Confd template for /var/lib/gdnsd/discovery-kibana.state on multatuli is CRITICAL: File not found: /var/lib/gdnsd/discovery-kibana.state https://wikitech.wikimedia.org/wiki/Confd [11:45:50] 10Operations, 10Gerrit: Editing in Gerrit isn't saved after the update/migration to gerrit1001 - https://phabricator.wikimedia.org/T236143 (10MoritzMuehlenhoff) [11:47:17] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10MoritzMuehlenhoff) Nothing critical, but this happens after the update/migration: https://phabricator.wiki... [11:48:14] 10Operations, 10Patch-For-Review: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) a:03jcrespo [11:48:17] (03PS2) 10Muehlenhoff: Check for both kvm/qemu in systemd-detec-virt [puppet] - 10https://gerrit.wikimedia.org/r/545254 [11:51:01] (03PS3) 10Jcrespo: mariadb/backups: Prepare dbmonitor[12]001 to reimage to buster [puppet] - 10https://gerrit.wikimedia.org/r/545246 (https://phabricator.wikimedia.org/T224589) [11:51:02] !log Restarted CI Jenkins on contint1001 [11:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: rename k8s::apilb role/profile to k8s::haproxy [puppet] - 10https://gerrit.wikimedia.org/r/544191 (https://phabricator.wikimedia.org/T234037) (owner: 10Arturo Borrero Gonzalez) [11:52:21] (03CR) 10Jcrespo: [C: 03+2] mariadb/backups: Prepare dbmonitor[12]001 to reimage to buster [puppet] - 10https://gerrit.wikimedia.org/r/545246 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [11:54:21] (03PS3) 10Muehlenhoff: Check for both kvm/qemu in systemd-detec-virt [puppet] - 10https://gerrit.wikimedia.org/r/545254 [11:57:17] !log starting to cut branch for train 1.35-wmf.3 [11:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191022T1200) [12:02:00] 10Operations, 10MediaWiki-Maintenance-scripts, 10cloud-services-team (Kanban): processEchoEmailBatch.php failing for labtestwiki - https://phabricator.wikimedia.org/T236145 (10Marostegui) [12:12:23] (03PS1) 10Awight: Reference Previews: full beta deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545260 (https://phabricator.wikimedia.org/T235083) [12:14:59] !log reimage to buster dbmonitor2001.wikimedia.org T224589 [12:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:03] T224589: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 [12:15:20] (03PS1) 10Paladox: Test inline editor [puppet] - 10https://gerrit.wikimedia.org/r/545262 [12:15:48] (03PS2) 10Paladox: Test inline editor [puppet] - 10https://gerrit.wikimedia.org/r/545262 [12:16:31] (03Abandoned) 10Paladox: Test inline editor [puppet] - 10https://gerrit.wikimedia.org/r/545262 (owner: 10Paladox) [12:19:36] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Cmjohnson) @wiki_willy can you order a new disk? I tried logging in but being prompted for a password so I cannot get disk info. [12:19:49] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Cmjohnson) a:05Cmjohnson→03wiki_willy [12:20:35] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1105 rebooted itself - https://phabricator.wikimedia.org/T235877 (10Cmjohnson) Updated all F/W on db1105 - Raid -Bios - Backplane - Idrac [12:20:53] (03PS1) 10Marostegui: Revert "db-eqiad.php: Temporary pool pc1010 in pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545264 [12:22:10] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1105 rebooted itself - https://phabricator.wikimedia.org/T235877 (10Marostegui) Thank you Chris! [12:22:26] RECOVERY - MariaDB Slave IO: pc1 on pc2010 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [12:22:49] 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [12:22:56] RECOVERY - MariaDB Slave IO: pc1 on pc2007 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [12:23:24] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Jclark-ctr) a:05Cmjohnson→03RobH finished PDU Maintenance . Netbox updated with new PDU [12:23:43] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Temporary pool pc1010 in pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545264 (owner: 10Marostegui) [12:24:32] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Temporary pool pc1010 in pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545264 (owner: 10Marostegui) [12:25:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1007 after PDU maintenance T227142 (duration: 00m 50s) [12:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:01] T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 [12:27:38] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:27:39] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:59] !log Compress db1096:3315 [12:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:40] !log rebooting miscweb2001 for some microcode tests [12:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2089:3315', diff saved to https://phabricator.wikimedia.org/P9429 and previous config saved to /var/cache/conftool/dbconfig/20191022-123032-marostegui.json [12:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1105:3312 and db1105:3311 after on-site maintenance T235877', diff saved to https://phabricator.wikimedia.org/P9430 and previous config saved to /var/cache/conftool/dbconfig/20191022-123257-marostegui.json [12:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:03] T235877: db1105 rebooted itself - https://phabricator.wikimedia.org/T235877 [12:33:50] RECOVERY - HTTP-dbtree on dbmonitor2001 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 7.304 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [12:37:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1096:3316 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9431 and previous config saved to /var/cache/conftool/dbconfig/20191022-123757-marostegui.json [12:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:00] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:43:36] (03PS1) 10Muehlenhoff: Extend black list for unsupported CPUs [puppet] - 10https://gerrit.wikimedia.org/r/545267 [12:44:57] 10Operations, 10MediaWiki-extensions-PdfHandler, 10Multimedia: Error creating PDF on Commons: "convert: no decode delegate for this image format" (fixed in GS 9.07) - https://phabricator.wikimedia.org/T50007 (10Seb35) 05Resolved→03Open I reopen this task with a better proposed resolution, and possibly al... [12:45:25] (03PS1) 10Cmjohnson: Adding mgmt dns for dumpsdata1003 [dns] - 10https://gerrit.wikimedia.org/r/545268 (https://phabricator.wikimedia.org/T234076) [12:46:01] 10Operations, 10Traffic, 10Patch-For-Review: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10jijiki) It appears we are having fetch errors, possibly due to timeouts as well mostly on two servers where we have enabled... [12:46:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1096:3316 db1105:3311 instance db1105:3312 after PDU and on-site maintenance', diff saved to https://phabricator.wikimedia.org/P9432 and previous config saved to /var/cache/conftool/dbconfig/20191022-124607-marostegui.json [12:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:05] (03CR) 10Muehlenhoff: [C: 03+2] Extend black list for unsupported CPUs [puppet] - 10https://gerrit.wikimedia.org/r/545267 (owner: 10Muehlenhoff) [12:54:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1096:3316 db1105:3311 instance db1105:3312 after PDU and on-site maintenance', diff saved to https://phabricator.wikimedia.org/P9433 and previous config saved to /var/cache/conftool/dbconfig/20191022-125435-marostegui.json [12:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:12] (03PS1) 10Ayounsi: Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545270 [13:00:04] liw and brennen: Time to snap out of that daydream and deploy Mediawiki train - European Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191022T1300). [13:01:52] branch cutting is still running [13:04:13] (03PS2) 10CDanis: Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545270 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [13:04:21] (03CR) 10Ema: [C: 03+1] Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545270 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [13:05:26] 10Operations, 10Traffic, 10Patch-For-Review: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) I highly suspect that's related to stricter timeouts on ats-be compared to varnish-be and atls-tls, that would... [13:05:36] (03CR) 10Ayounsi: [C: 03+2] Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545270 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [13:05:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1096:3316 db1105:3311 db1105:3312 after PDU and on-site maintenance', diff saved to https://phabricator.wikimedia.org/P9434 and previous config saved to /var/cache/conftool/dbconfig/20191022-130556-marostegui.json [13:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:13] !log depool esams for onsite work - T235805 [13:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:19] T235805: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 [13:06:32] 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [13:06:34] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1105 rebooted itself - https://phabricator.wikimedia.org/T235877 (10Marostegui) 05Open→03Resolved Host fully repooled in production. Thanks Chris! [13:10:36] (03PS1) 10Jcrespo: dbmonitor: Install the right apache module packages for >jessie [puppet] - 10https://gerrit.wikimedia.org/r/545273 (https://phabricator.wikimedia.org/T224589) [13:13:41] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime [13:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:55] 10Operations, 10ops-esams, 10DC-Ops: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by ayounsi@cumin1001 on 28 host(s) and their services with reason: Onsite work (asw) ` bast3002.wikimedia.org,cp[3007-3008,3010,3030,303... [13:15:16] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 53.83 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:18:41] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545273 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [13:24:09] PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [13:24:12] (03CR) 10Jcrespo: [C: 03+2] dbmonitor: Install the right apache module packages for >jessie [puppet] - 10https://gerrit.wikimedia.org/r/545273 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [13:24:35] womp womp kartotherian [13:24:41] ^ looking at kartotherian [13:24:52] thanks gehel ! [13:25:02] kartotherian [13:25:05] hmm [13:25:09] all maps machines in eqiad has been pegged on CPU for ~20 minutes [13:25:14] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:25:20] network traffic increased as well [13:25:56] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:25:57] tile generation is at its usual rate [13:26:26] load is super high [13:26:46] (03PS1) 10Lars Wirzenius: Group0 to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545281 [13:26:51] seems our friend is back [13:27:02] PROBLEM - Maps HTTPS on maps1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:27:58] lot of errors similar to `[2019-10-22T13:27:24.731Z] ERROR: kartotherian/12073 on maps1003: Bad geojson - unknown type object (err.levelPath=error) [13:27:58] Error: Bad geojson - unknown type object [13:27:58] ` [13:28:30] !log liw@deploy1001 Started scap: testwiki to php-1.34.0-wmf.3 and rebuild l10n cache [13:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:34] RECOVERY - Maps HTTPS on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 6.199 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:29:57] (03CR) 10Ottomata: "If we do put it on a Ganeti VM, Could/should search use their own MySQL instance there instead of the analytics-meta one?" [puppet] - 10https://gerrit.wikimedia.org/r/544989 (owner: 10EBernhardson) [13:30:48] I seem to be running late with the group0 deployment; should I overrun the train time slot or pause and continue later? [13:30:54] thcipriani, ^ [13:31:26] (03CR) 10Ottomata: cumin: update which server is the kafka-main canary (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545094 (owner: 10Dzahn) [13:31:38] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:32:06] hitrate slowly recovering https://grafana.wikimedia.org/d/000000500/varnish-caching?refresh=15m&orgId=1&from=now-3h&to=now&var-cluster=cache_text&var-cluster=cache_upload&var-site=codfw&var-site=eqiad&var-site=ulsfo&var-site=eqsin&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [13:32:07] liw: I would overrun the timeslot. Especially since nothing is scheduled for the next 2 hours or so [13:32:53] ack, thanks [13:32:55] RECOVERY - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [13:33:30] PROBLEM - Maps HTTPS on maps1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:33:58] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:34:49] ignore the esams availability alert, the DC is depooled [13:35:10] !log liw@deploy1001 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="labtestwiki" --outdir="/tmp/scap_l10n_2419219323" --threads=30 --lang en --quiet' returned non-zero exit status 1 (duration: 06m 40s) [13:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:18] ema: would be nice if the check could check for depooled DCs [13:35:23] 10Operations, 10Performance-Team, 10serviceops: Increased latency in POST requests - https://phabricator.wikimedia.org/T235755 (10Gilles) https://grafana.wikimedia.org/d/000000580/apache-backend-timing?orgId=1&from=now-30d&to=now {F30875423, size=full} Is this still an issue? The above distribution of back... [13:35:32] 10Operations, 10Patch-For-Review: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) 2 blockers: * Exec of `/usr/sbin/a2enmod php7.0` fails, as ther right module would be php7.3- No support for buster on the http module? `Httpd/Httpd::Mod_conf[php7.0]/Exec[ensure... [13:36:38] RECOVERY - Maps HTTPS on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 5.853 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:36:58] again, we don't have much more traffic than usual [13:37:10] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:37:12] so probably again some specific queries going wild [13:37:48] PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [13:38:15] hm, scap synced testwiki version change, but https://test.wikipedia.org/wiki/Special:Version still shows -wmf.2, not -wmf.3 [13:38:17] 10Operations, 10Patch-For-Review: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10Joe) >>! In T224589#5595135, @jcrespo wrote: > 2 blockers: > > * Exec of `/usr/sbin/a2enmod php7.0` fails, as ther right module would be php7.3- No support for buster on the http module?... [13:38:30] thcipriani, hashar, help? [13:38:47] liw: there was a failed log above ~3m ago [13:38:56] !log silencing LVS check for katotherian (we know there is an issue) - T236163 [13:38:56] 10Operations, 10Maps: Maps servers overloaded in eqiad - https://phabricator.wikimedia.org/T236163 (10Gehel) [13:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:59] T236163: Maps servers overloaded in eqiad - https://phabricator.wikimedia.org/T236163 [13:39:12] dang, I should not have /ignored so much... [13:39:19] RECOVERY - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [13:39:23] volans, can you copy it for me? [13:39:25] deploy1001:/srv/mediawiki-staging$ grep '"testwiki"' wikiversions.json [13:39:25] "testwiki": "php-1.35.0-wmf.2", [13:39:46] <_joe_> gehel: to be frank, either some engineering time is spent on maps, or I will disable paging [13:39:47] $ grep wmf.3 wikiversions.json [13:39:47] "labtestwiki": "php-1.35.0-wmf.3", [13:39:53] so you changed wikitech :D [13:40:01] _joe_: I agree! [13:40:04] <_joe_> I think the last time someone invested engineering time on maps was... the last outage [13:40:10] <_joe_> us that time too [13:40:22] well not wikitech sorry [13:40:31] liw: you changed the wrong wiki "labstestwiki" instead of "testwiki" :] [13:40:38] hashar, oh crap [13:40:56] not a big deal [13:40:57] hashar, I fix wikiversions and re-run? [13:41:00] so yeah fix it [13:41:07] then I think it is scap sync-wikiversion [13:41:18] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1002.eqiad.wmnet, maps1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:42:00] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:42:10] <_joe_> gehel: ok lemme do it in a few then [13:42:10] 10Operations, 10Maps: Maps servers overloaded in eqiad - https://phabricator.wikimedia.org/T236163 (10Gehel) [13:42:57] _joe_: we probably don't have a good way to selectively silence some of the pybal checks... [13:43:07] (03PS1) 10Jcrespo: dbmonitor: Deploy git repo as mwdeploy, otherwise no write permission [puppet] - 10https://gerrit.wikimedia.org/r/545282 (https://phabricator.wikimedia.org/T224589) [13:43:16] <_joe_> gehel: we do [13:43:20] hashar, I run "scap sync-wikiversions" not "scap sync..." as the train page says? [13:44:30] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:45:58] thcipriani, ^ see q for h.ashar [13:46:29] liw: /me looks [13:46:48] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:46:58] (03PS1) 10Giuseppe Lavagetto: lvs: do not page on karthoterian unavailability [puppet] - 10https://gerrit.wikimedia.org/r/545285 [13:47:06] <_joe_> gehel: ^^ [13:47:36] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545285 (owner: 10Giuseppe Lavagetto) [13:47:48] that's probably because esams is depooled [13:48:24] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:48:31] liw: oh, I see scap failed. Which server did if fail on? [13:48:53] ACKNOWLEDGEMENT - Confd template for /var/lib/gdnsd/discovery-kibana.state on authdns1001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-kibana.state Ema Known, to be fixed after esams repool https://wikitech.wikimedia.org/wiki/Confd [13:48:53] ACKNOWLEDGEMENT - Confd template for /var/lib/gdnsd/discovery-kibana.state on authdns2001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-kibana.state Ema Known, to be fixed after esams repool https://wikitech.wikimedia.org/wiki/Confd [13:49:23] (03CR) 10CDanis: [C: 03+1] lvs: do not page on karthoterian unavailability [puppet] - 10https://gerrit.wikimedia.org/r/545285 (owner: 10Giuseppe Lavagetto) [13:49:58] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:50:08] thcipriani, I don't know, I had the bots on ignore :( (unignored now, but channels is difficult to follow) [13:50:47] accordig to turnilo, there does not seem to be a specific URL that timeouts: [13:50:47] https://turnilo.wikimedia.org/#webrequest_sampled_128/4/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ASnGwBzYkryqQUNLXoEATAAYAjACcpH4+pF5eIj4A7Hg+PrE+AHRxPgBaksTYACbcvoHBoeEifgDMCQnJcekAvgC61QxqWjquaDQQ9pKGxgQw7SaUmG6ScOQYONwdkmCIMI4qICxwmlCJAO4QANYQbFkQcImYNHYgtUzYmJ5SiFDEDU3a3G7tnQZG3JRoaJombugwSiYo3GuAIUyYMwQcycyhAABYvAFTudLvhrghbnUmFBN [13:50:47] Eg0DCHi1nh0Tkw9mxsFAsKCQH0ICZNOhKJIoEdPKBuh8IPjLM0ngoIPMyRBDGNqdwso5yJk9q8QNp2pgcgQQA1CDtufhsDAEAh7sw+bodvoQOTMlSJgQzBYmHYaLYdbRuepuAAFEQAVgAsiy2fgOe8reZjbzHgQzZTxcLRSDuHAoNLsiTVUwkCxNXhtbqsa4BfNnAGpApMtKuTymFIjkt2Qaw6ajHAdfQIbMWinay02PG+lwc5oOiQsgARY2RnAws7ygfELIAZT9BEo3MBhGIDmy/tNo4tNLpDKZkjTGY9Lah+azCCY1CgADkdQg0Tc7leIHZKEg354L9UgA= [13:50:56] ofc, crappy urls [13:51:08] liw: should say in your console [13:51:35] thcipriani, which console were? [13:51:55] gehel: https://w.wiki is good for this [13:52:24] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:52:31] accordig to turnilo, there does not seem to be a specific URL that timeouts: https://w.wiki/AaP [13:52:35] !log restarted slapd on ldap-eqiad-replica01 [13:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:42] cdanis: thx! [13:52:52] gehel: also wasn't the page after 13:00? [13:53:12] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:53:19] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [13:53:37] cdanis: right! data not yet in turnilo it seems [13:54:00] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:54:03] yeah the webrequest table has an hour or two delay [13:54:46] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:55:02] looks like a bunch of requests similar to https://maps.wikimedia.org/osm-liber/%7Bz%7D/%7Bx%7D/%7By%7D.png are taking more time than expected [13:56:45] looks like a client failing to process its placeholders [13:59:07] so we fixed something similar before [13:59:11] with some regex filtering [13:59:34] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:59:49] bblack: yep [14:00:36] what's new? [14:00:56] looks like mateusbs17 has a fix ready to be deployed [14:01:17] which pushes back the input validation in kartotherian [14:01:36] and hopefully is more exhaustive than what we currently have in varnish [14:01:43] (03PS1) 10Jcrespo: dbmonitor: Install the right apache modules for buster [puppet] - 10https://gerrit.wikimedia.org/r/545286 (https://phabricator.wikimedia.org/T224589) [14:04:02] 10Operations, 10Patch-For-Review: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) Thanks, joe, I didn't see your comment so it tool me more time than I thought to find it. The above 2 patches should fix it? [14:04:19] thcipriani, thanks for the help [14:04:33] (03PS1) 10Ema: kibana: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/545287 (https://phabricator.wikimedia.org/T227432) [14:04:35] https://phabricator.wikimedia.org/T236166 - reported, train can't continue to group0 before this is fixed [14:05:01] (03PS1) 10BBlack: Move GeoDNS default from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/545288 (https://phabricator.wikimedia.org/T235805) [14:05:21] liw: happy to help. it's a strange issue [14:06:00] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:06:26] (03CR) 10BBlack: [C: 03+2] Move GeoDNS default from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/545288 (https://phabricator.wikimedia.org/T235805) (owner: 10BBlack) [14:06:30] gehel: are things under control? Waiting to do my maintenance [14:06:54] XioNoX: not entirely under control yet, but we have a good idea of how to fix it [14:07:36] your maintenance will probably not make things worse, but the noise from maps might hide some other problems [14:07:36] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:08:19] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [14:08:30] thinking of the opposite, noise from my esams maintenance might flood irc [14:08:37] (03PS1) 10Jbond: CI rspec: update puppet version used in spec tests [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T228657) [14:09:12] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:09:39] (03CR) 10jerkins-bot: [V: 04-1] CI rspec: update puppet version used in spec tests [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [14:09:41] (03CR) 10BBlack: [C: 03+1] kibana: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/545287 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:10:50] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:11:21] XioNoX: we should have a patch deployed in ~15 minutes [14:11:31] ok, thx [14:11:42] XioNoX: don't let maps stop you, you can't make the situation worse :) [14:12:26] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:13:34] !log restart asw-esams for onsite work [14:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:53] (03PS2) 10Jbond: CI rspec: update puppet version used in spec tests [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T228657) [14:14:01] I'll silence the availability alerts for esams [14:16:12] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 70.47 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:16:19] (03CR) 10jerkins-bot: [V: 04-1] CI rspec: update puppet version used in spec tests [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [14:16:34] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:17:54] PROBLEM - Host lvs3001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:17:54] PROBLEM - Host lvs3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:18:04] PROBLEM - Host lvs3003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:18:04] PROBLEM - Host lvs3004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:18:06] PROBLEM - Host maerlant.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:18:16] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:18:20] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:50] PROBLEM - Host cp3040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:19:04] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 39, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:19:20] PROBLEM - Host multatuli.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:19:28] PROBLEM - Host cp3010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:19:28] PROBLEM - Host cp3032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:19:48] PROBLEM - Host nescio.mgmt is DOWN: CRITICAL - Time to live exceeded (10.21.0.111) [14:20:44] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:20:46] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:21:08] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [14:22:53] (03PS3) 10Jbond: CI rspec: update puppet version used in spec tests [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T228657) [14:23:36] RECOVERY - Host lvs3001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.09 ms [14:23:36] RECOVERY - Host lvs3002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.48 ms [14:23:46] XioNoX: {done} but I guess slighty too late [14:23:46] RECOVERY - Host lvs3003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.46 ms [14:23:46] RECOVERY - Host lvs3004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.82 ms [14:23:48] RECOVERY - Host maerlant.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.67 ms [14:24:32] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 29.13 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:24:32] RECOVERY - Host cp3040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.31 ms [14:24:54] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:25:00] RECOVERY - Host multatuli.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.24 ms [14:25:08] RECOVERY - Host cp3010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.05 ms [14:25:08] RECOVERY - Host cp3032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.08 ms [14:25:09] volans: nop, will need a restart soon ish again [14:25:22] PROBLEM - BFD status on cr2-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:25:28] RECOVERY - Host nescio.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.65 ms [14:25:30] ok they are downtimed for 2h, lmk if you need more XioNoX [14:26:17] thx [14:29:40] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:29:44] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:31:16] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:33:20] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: (2019-09) Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10sbassett) a:05sbassett→03chasemp [14:33:34] (03PS1) 10BBlack: Move most North American traffic westwards [dns] - 10https://gerrit.wikimedia.org/r/545294 (https://phabricator.wikimedia.org/T235805) [14:34:08] (03CR) 10Jcrespo: "Ignore the mysql package, that is supposed to be deleted as soon as it goes unused: https://phabricator.wikimedia.org/T162070" [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [14:35:56] (03CR) 10BBlack: [C: 03+2] Move most North American traffic westwards [dns] - 10https://gerrit.wikimedia.org/r/545294 (https://phabricator.wikimedia.org/T235805) (owner: 10BBlack) [14:37:40] (03CR) 10Jbond: [C: 03+1] Extend wmf-userschema for additional MFA options: [puppet] - 10https://gerrit.wikimedia.org/r/543402 (owner: 10Muehlenhoff) [14:39:26] (03PS1) 10Ayounsi: Revert "Depool esams for onsite work" [dns] - 10https://gerrit.wikimedia.org/r/545298 [14:39:54] RECOVERY - BFD status on cr2-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:40:28] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10Halfak) Yes he is staff. He'll be pulling data from MariaDB for use training ORES models. [14:42:07] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline. Also tests should be included for sth this critical IMHO" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:42:59] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@85ea6e1]: Deploy kartotherian 1.1.5-wmf.0 [14:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:09] gehel ^ [14:43:31] mateusbs17: kool, let's see if the load goes down... [14:45:41] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@85ea6e1]: Deploy kartotherian 1.1.5-wmf.0 (duration: 02m 44s) [14:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:14] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:48:48] the blocker got downgraded, continuing to deploy to group0 [14:48:50] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:50:10] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:50:35] !log liw@deploy1001 Started scap: testwiki to php-1.35.0-wmf.3 and rebuild l10n cache [14:50:38] (03PS3) 10Mforns: analytics::refinery::job::data_purge: Add timer to delete old MWH dumps [puppet] - 10https://gerrit.wikimedia.org/r/539151 (https://phabricator.wikimedia.org/T208612) [14:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:36] !log stopping puppet and pybal on lvs1014 (upload+maps traffic to 1016) [14:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:08] (03CR) 10Jcrespo: "> when a failed backup is detected log it" [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:55:47] new kartotherian version deployed, load on maps servers is going down [14:56:37] 10Operations, 10MediaWiki-Maintenance-scripts, 10cloud-services-team (Kanban): processEchoEmailBatch.php failing for labtestwiki - https://phabricator.wikimedia.org/T236145 (10Andrew) This problem went away after ` bblack> Brandon Black !log stopping puppet and pybal on lvs1014 (upload+maps traffic to 1016) ` [14:57:28] PROBLEM - PyBal backends health check on lvs1014 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal [14:58:10] PROBLEM - pybal on lvs1014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:58:54] ^ known, from my log above [15:01:22] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:01:23] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:00] PROBLEM - PyBal connections to etcd on lvs1014 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=24) https://wikitech.wikimedia.org/wiki/PyBal [15:02:13] 10Operations, 10MediaWiki-Maintenance-scripts, 10cloud-services-team (Kanban): processEchoEmailBatch.php failing for labtestwiki - https://phabricator.wikimedia.org/T236145 (10Andrew) I anticipate that this will be resolved by https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/543664/ [15:03:49] !log rebooting kafka-main1005 for microcode debugging [15:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:57] scap tells me: 15:04:12 Check 'Logstash Error rate for mw1263.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.03, After: 2.00, Threshold: 1.00) [15:07:57] <_joe_> liw please stop deploying anything [15:08:03] <_joe_> moritzm: likewise, please [15:08:30] _joe_, scap sycn to testwiki is running, shall I abort_? [15:09:04] <_joe_> liw: no of course but that seems like it's rolling back already given you had a failure surge? [15:09:12] 10Operations, 10Maps: Maps servers overloaded in eqiad - https://phabricator.wikimedia.org/T236163 (10MSantos) 05Open→03Resolved a:03MSantos Deployed new version of kartotherian fixed it https://gerrit.wikimedia.org/r/c/maps/kartotherian/deploy/+/545299 [15:09:35] _joe_, yeah [15:09:42] (03CR) 10Marostegui: [C: 03+1] dbmonitor: Deploy git repo as mwdeploy, otherwise no write permission [puppet] - 10https://gerrit.wikimedia.org/r/545282 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [15:09:57] <_joe_> ok so I was asking to not do further deployments for now [15:10:07] (03CR) 10Marostegui: [C: 03+1] dbmonitor: Install the right apache modules for buster [puppet] - 10https://gerrit.wikimedia.org/r/545286 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [15:10:28] (03PS2) 10CRusnov: coherence: Check unracked devices for connected console ports [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/545132 [15:10:30] !log re-enabling lvs1014 pybal/puppet [15:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:54] RECOVERY - pybal on lvs1014 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:11:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545286 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [15:11:50] RECOVERY - PyBal backends health check on lvs1014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:13:10] RECOVERY - PyBal connections to etcd on lvs1014 is OK: OK: 24 connections established with conf1004.eqiad.wmnet:4001 (min=24) https://wikitech.wikimedia.org/wiki/PyBal [15:13:34] !log re-disabling lvs1014 ... [15:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:20] PROBLEM - pybal on lvs1014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:18:18] PROBLEM - PyBal backends health check on lvs1014 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal [15:20:32] !log rollback ns2 redirect [15:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:10] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10Nuria) Ok, approved for wmf, analytics-privatedata-users, statistics-privatedata-users on my end [15:22:00] PROBLEM - PyBal connections to etcd on lvs1014 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=24) https://wikitech.wikimedia.org/wiki/PyBal [15:24:24] (03CR) 10ArielGlenn: "Woo hoo! :)" [dns] - 10https://gerrit.wikimedia.org/r/545268 (https://phabricator.wikimedia.org/T234076) (owner: 10Cmjohnson) [15:24:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "Depool esams for onsite work" [dns] - 10https://gerrit.wikimedia.org/r/545298 (owner: 10Ayounsi) [15:25:05] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool esams for onsite work" [dns] - 10https://gerrit.wikimedia.org/r/545298 (owner: 10Ayounsi) [15:25:09] (03PS2) 10Ayounsi: Revert "Depool esams for onsite work" [dns] - 10https://gerrit.wikimedia.org/r/545298 [15:26:30] !log repool esams [15:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:14] !log liw@deploy1001 Finished scap: testwiki to php-1.35.0-wmf.3 and rebuild l10n cache (duration: 37m 39s) [15:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:34] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 49.19 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:40:07] !log rebooting lvs1014 [15:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:22] PROBLEM - Host lvs1014 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:28] RECOVERY - Host lvs1014 is UP: PING WARNING - Packet loss = 64%, RTA = 0.33 ms [15:45:42] PROBLEM - PyBal backends health check on lvs1014 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal [15:46:22] PROBLEM - pybal on lvs1014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:47:29] !log enable pybal+puppet on rebooted lvs1014 [15:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:58] RECOVERY - pybal on lvs1014 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:48:52] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10RStallman-legalteam) NDA is signed and on file. Thanks! [15:48:56] RECOVERY - PyBal backends health check on lvs1014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:50:02] RECOVERY - PyBal connections to etcd on lvs1014 is OK: OK: 24 connections established with conf1004.eqiad.wmnet:4001 (min=24) https://wikitech.wikimedia.org/wiki/PyBal [15:52:53] 10Operations, 10Gerrit: Editing in Gerrit isn't saved after the update/migration to gerrit1001 - https://phabricator.wikimedia.org/T236143 (10MoritzMuehlenhoff) This happened earlier the day, but I cannot currently reproduce it with a freshly created patch. [15:57:35] (03PS1) 10BBlack: Depool esams to test lvs1014 state [dns] - 10https://gerrit.wikimedia.org/r/545312 [15:58:04] (03CR) 10BBlack: [C: 03+2] Depool esams to test lvs1014 state [dns] - 10https://gerrit.wikimedia.org/r/545312 (owner: 10BBlack) [15:58:32] !log depooling esams temporarily to test traffic scenario on lvs1014 [15:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:49] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Nuria) Thank you! @lexnasser: please ping @Dzahn with your e-mail address/user password for wikitech [16:00:05] godog and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191022T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:05:04] PROBLEM - Confd template for /var/lib/gdnsd/discovery-kibana.state on multatuli is CRITICAL: File not found: /var/lib/gdnsd/discovery-kibana.state https://wikitech.wikimedia.org/wiki/Confd [16:05:38] (03CR) 10Jcrespo: dbmonitor: Install the right apache modules for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545286 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [16:07:12] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 100 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:07:32] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/545132 (owner: 10CRusnov) [16:10:12] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:10:33] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10wiki_willy) [16:10:36] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 38.49 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:11:14] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10wiki_willy) Procurement task created for Rob to order replacement drive. Thanks, Willy [16:13:24] We are going to stop gerrit [16:13:26] jouncebot: now [16:13:26] For the next 0 hour(s) and 46 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191022T1600) [16:14:04] !log stopping gerrit to run a fix for T222391 [16:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:10] T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 [16:15:17] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10wiki_willy) a:05wiki_willy→03Jclark-ctr [16:17:52] PROBLEM - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:16] PROBLEM - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit [16:18:20] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1529 too small - 1529 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [16:18:26] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:54] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1529 bytes in 0.008 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [16:18:56] PROBLEM - Check systemd state on deploy2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:10] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:26] PROBLEM - gerrit process on gerrit1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [16:19:44] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:18] ACKNOWLEDGEMENT - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:18] ACKNOWLEDGEMENT - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Gerrit [16:20:18] ACKNOWLEDGEMENT - gerrit process on gerrit1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Gerrit [16:20:52] !log restarting gerrit [16:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:04] RECOVERY - gerrit process on gerrit1001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [16:21:06] RECOVERY - Check systemd state on gerrit1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:30] RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_2.15.14-16-g855b179b5f (SSHD-CORE-1.6.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [16:21:32] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27019 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [16:21:34] !log running puppet on deployment servers [16:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:40] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:06] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 864 bytes in 0.044 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [16:22:10] RECOVERY - Check systemd state on deploy2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:22] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:58] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:12] (03PS2) 10Dzahn: cumin: update which server is the kafka-main canary [puppet] - 10https://gerrit.wikimedia.org/r/545094 [16:25:25] (03CR) 10Dzahn: cumin: update which server is the kafka-main canary (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545094 (owner: 10Dzahn) [16:28:18] (03CR) 10Dzahn: [C: 03+2] mariadb/ferm_misc: allow moscovium to connect to rt database [puppet] - 10https://gerrit.wikimedia.org/r/544079 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [16:28:26] (03PS3) 10Dzahn: mariadb/ferm_misc: allow moscovium to connect to rt database [puppet] - 10https://gerrit.wikimedia.org/r/544079 (https://phabricator.wikimedia.org/T180641) [16:40:25] (03PS1) 10Dzahn: site: turn cobalt into a spare system (Do not merge) [puppet] - 10https://gerrit.wikimedia.org/r/545328 [16:40:28] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 (10RobH) 05Open→03Resolved a:05RobH→03None [16:40:31] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [16:40:44] 10Operations, 10ops-codfw, 10SRE-swift-storage, 10User-fgiunchedi: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 (10fgiunchedi) 05Open→03Resolved This is completed, hosts are fully in service now. [16:42:35] 10Operations, 10serviceops: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) [16:42:44] 10Operations, 10serviceops: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) 05Open→03Stalled p:05Triage→03Normal [16:42:50] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10RobH) a:05RobH→03Jclark-ctr My understanding of this task state is as follows: * @Jclark-ctr had to emergency swap out ps1-a1-eqiad due to a failure * he left the old ps2-a1... [16:43:10] (03PS2) 10Dzahn: site: turn cobalt into a spare system (Do not merge) [puppet] - 10https://gerrit.wikimedia.org/r/545328 (https://phabricator.wikimedia.org/T236187) [16:43:24] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10wiki_willy) @RobH - ps2 was swapped last Tuesday on 10/15 [16:44:07] (03PS1) 10Dzahn: ci: remove cobalt from firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/545330 (https://phabricator.wikimedia.org/T236187) [16:47:49] (03PS1) 10Dzahn: mariadb: remove cobalt from ferm_misc rules [puppet] - 10https://gerrit.wikimedia.org/r/545333 (https://phabricator.wikimedia.org/T236187) [16:47:51] (03PS1) 10Dzahn: acme_chief: remove cobalt from authorized hosts [puppet] - 10https://gerrit.wikimedia.org/r/545334 (https://phabricator.wikimedia.org/T236187) [16:47:53] (03PS1) 10Dzahn: gerrit: remove cobalt from ssh known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/545335 (https://phabricator.wikimedia.org/T236187) [16:47:55] (03PS1) 10Dzahn: install_server: remove cobalt from DHCP and partman [puppet] - 10https://gerrit.wikimedia.org/r/545336 (https://phabricator.wikimedia.org/T236187) [16:48:56] (03PS1) 10BBlack: Revert "Move most North American traffic westwards" [dns] - 10https://gerrit.wikimedia.org/r/545338 [16:48:58] (03PS1) 10BBlack: Revert "Move GeoDNS default from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/545339 [16:49:00] (03PS1) 10BBlack: Revert "Depool esams to test lvs1014 state" [dns] - 10https://gerrit.wikimedia.org/r/545340 [16:49:02] (03PS1) 10RobH: setting new pdu models [puppet] - 10https://gerrit.wikimedia.org/r/545337 (https://phabricator.wikimedia.org/T227142) [16:50:03] (03CR) 10jerkins-bot: [V: 04-1] setting new pdu models [puppet] - 10https://gerrit.wikimedia.org/r/545337 (https://phabricator.wikimedia.org/T227142) (owner: 10RobH) [16:50:43] (03CR) 10BBlack: [C: 03+2] Revert "Move most North American traffic westwards" [dns] - 10https://gerrit.wikimedia.org/r/545338 (owner: 10BBlack) [16:50:49] (03CR) 10BBlack: [C: 03+2] Revert "Move GeoDNS default from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/545339 (owner: 10BBlack) [16:51:40] !log geodns: moving all "normal" eqiad traffic back to eqiad (in addition to the esams-diverted traffic which is still pointed mostly at eqiad right now) [16:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:34] (03PS1) 10Dzahn: gerrit: change gerrit master_host to gerrit1001, remove duplicate [puppet] - 10https://gerrit.wikimedia.org/r/545342 (https://phabricator.wikimedia.org/T222391) [16:55:17] (03PS2) 10RobH: setting new pdu models [puppet] - 10https://gerrit.wikimedia.org/r/545337 (https://phabricator.wikimedia.org/T227142) [16:55:23] ACKNOWLEDGEMENT - SSH mw1290.mgmt on mw1290.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T234153 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:56:05] (03CR) 10RobH: [C: 03+2] setting new pdu models [puppet] - 10https://gerrit.wikimedia.org/r/545337 (https://phabricator.wikimedia.org/T227142) (owner: 10RobH) [16:56:53] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10RobH) 05Open→03Resolved Ok, just logged in and confirmed the ps1 sees ps2. the rest was already configured from our deployment of ps1 except the model hadn't been updated.... [16:56:55] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [16:57:09] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10RobH) [17:00:04] cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191022T1700). [17:01:34] 10Operations, 10DC-Ops, 10serviceops: mw1252 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T236190 (10Dzahn) [17:03:04] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 45.35 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:03:32] 10Operations, 10DC-Ops, 10serviceops: mw1252 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T236190 (10Dzahn) nothing special in SEL ` /admin1-> racadm getsel Record: 1 Date/Time: 11/12/2014 09:37:12 Source: system Severity: Ok Description: Log cleared. --------------... [17:04:03] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on mw1252 is CRITICAL: 4 ge 4 daniel_zahn https://phabricator.wikimedia.org/T236190 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1252&var-datasource=eqiad+prometheus/ops [17:04:56] 10Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287 (10Dzahn) please add mw1252 to the list (T236190) [17:05:39] 10Operations, 10DC-Ops, 10serviceops: mw1252 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T236190 (10Dzahn) Support expiry date Nov. 15, 2017 so i guess we won't fix these anymore. [17:09:22] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10RobH) [17:10:37] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10RobH) a:05RobH→03Jclark-ctr @wiki_willy requested I step in and setup the software side of things, but cannot do so as serial to this PDU isn't currently working. Can you tr... [17:13:19] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [17:14:52] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 70.81 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:15:55] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10RobH) the icinga downtime was set to expire in less than an hour, so I've extended it until 2300 GMT. [17:17:23] RECOVERY - ps1-a1-eqiad-infeed-load-tower-A-phase-Z on ps1-a1-eqiad is OK: SNMP OK - ps1-a1-eqiad-infeed-load-tower-A-phase-Z 421 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:17:55] RECOVERY - ps1-a1-eqiad-infeed-load-tower-B-phase-X on ps1-a1-eqiad is OK: SNMP OK - ps1-a1-eqiad-infeed-load-tower-B-phase-X 150 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:17:57] RECOVERY - ps1-a1-eqiad-infeed-load-tower-B-phase-Y on ps1-a1-eqiad is OK: SNMP OK - ps1-a1-eqiad-infeed-load-tower-B-phase-Y 355 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:19:15] !log arlolra@deploy1001 Started deploy [parsoid/deploy@4c64c9c]: Updating Parsoid to cf01d91 [17:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:59] (03CR) 10BBlack: [C: 03+2] Revert "Depool esams to test lvs1014 state" [dns] - 10https://gerrit.wikimedia.org/r/545340 (owner: 10BBlack) [17:20:43] !log geodns: re-pooling esams (at this point, we're entirely back in our "normal" state of affairs) [17:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:49] (03PS8) 10Krinkle: [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 [17:23:55] (03CR) 10Krinkle: "Rebased to resolve composer.lock conflict." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [17:26:52] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@4c64c9c]: Updating Parsoid to cf01d91 (duration: 07m 37s) [17:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:39] RECOVERY - ps1-a1-eqiad-infeed-load-tower-A-phase-X on ps1-a1-eqiad is OK: SNMP OK - ps1-a1-eqiad-infeed-load-tower-A-phase-X 95 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:33:20] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [17:33:37] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 58.98 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:34:35] RECOVERY - ps1-a1-eqiad-infeed-load-tower-A-phase-Y on ps1-a1-eqiad is OK: SNMP OK - ps1-a1-eqiad-infeed-load-tower-A-phase-Y 296 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:36:07] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:36:27] RECOVERY - ps1-a1-eqiad-infeed-load-tower-B-phase-Z on ps1-a1-eqiad is OK: SNMP OK - ps1-a1-eqiad-infeed-load-tower-B-phase-Z 373 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:36:35] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 79.15 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:36:58] Hey all (liw thcipriani) - I wanted to do a sec-deploy for T234450 to wmf.2 and wmf.3. Any current issues where I shouldn't? [17:37:17] sbassett, yes [17:37:29] liw: Ok, train stuff? [17:37:46] sbassett: I don't believe wmf.3 made it anywhere just yet. Backporting to wmf.3 is probably sufficient in that case. [17:37:47] !log Updated Parsoid to cf01d91 (T234057, T234768, T235296, T235684, T235563) [17:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:56] T235563: Link prefix differences between Parsoid/JS & Parsoid/PHP - https://phabricator.wikimedia.org/T235563 [17:37:57] T234057: Get rid of hybrid testing code - https://phabricator.wikimedia.org/T234057 [17:37:57] T235296: Parsoid/PHP adds a data-sort-value="", lang="" whereas Parsoid/JS doesn't - https://phabricator.wikimedia.org/T235296 [17:37:57] T234768: Create Balinese Wikipedia - https://phabricator.wikimedia.org/T234768 [17:37:57] T235684: id and fallback id differences - https://phabricator.wikimedia.org/T235684 [17:38:20] sbassett: that is, it's not deployed so best not to scap sync-file or whatever for that branch, but patching on disk is fine [17:38:51] correct, wmf.3 didn't get even to testwiki today [17:39:15] thcipriani: Core patch - can I drop it in /srv/patches in the relevant wmf.2 and wmf.3 dirs and just deploy to wmf.2? Or should I just not do anything with wmf.3 right now? [17:41:00] sbassett thcipriani liw: my current thinking is that i will go ahead with train during the american deploy window @ 19:00 UTC. [17:41:38] sbassett: if you could apply the patch to wmf.3 so we don't have to do it later, that'd be good [17:41:44] ^ [17:43:07] (03PS1) 10Paladox: Revert "gerrit: enable jgit gc" [puppet] - 10https://gerrit.wikimedia.org/r/545351 (https://phabricator.wikimedia.org/T236114) [17:44:39] brennen thcipriani: ok, got thumbs up from _security for now. I'll plan to patch and deploy to wmf.2 and just patch for wmf.3. Sound good? [17:44:52] (03CR) 10Thcipriani: [C: 03+1] "Let's do this so we don't lose any data in T236114" [puppet] - 10https://gerrit.wikimedia.org/r/545351 (https://phabricator.wikimedia.org/T236114) (owner: 10Paladox) [17:45:14] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/18989/" [puppet] - 10https://gerrit.wikimedia.org/r/545342 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [17:45:19] sbassett: sounds good [17:46:50] (03CR) 10Paladox: [C: 03+1] gerrit: change gerrit master_host to gerrit1001, remove duplicate [puppet] - 10https://gerrit.wikimedia.org/r/545342 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [17:50:50] (03CR) 10Dzahn: [C: 03+2] Revert "gerrit: enable jgit gc" [puppet] - 10https://gerrit.wikimedia.org/r/545351 (https://phabricator.wikimedia.org/T236114) (owner: 10Paladox) [17:51:46] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@b4c484a]: Build structured talk pages by walking the DOM (T235213) [17:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:50] T235213: Optimize talk endpoint performance - https://phabricator.wikimedia.org/T235213 [17:54:14] !log restarting gerrit to disable jgit gc (T236114) [17:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:17] T236114: check and fix some Gerrit revs - https://phabricator.wikimedia.org/T236114 [17:57:00] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@b4c484a]: Build structured talk pages by walking the DOM (T235213) (duration: 05m 14s) [17:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:04] T235213: Optimize talk endpoint performance - https://phabricator.wikimedia.org/T235213 [17:57:13] !Deployed security fix for T234450 to wmf.2 [17:57:25] crap [17:57:34] !log Deployed security fix for T234450 to wmf.2 [17:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:43] !log Uploaded and applied (but did not deploy per releng) security fix for T234450 to wmf.3 [17:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191022T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:04:49] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 82.47 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:06:28] (03CR) 10Dzahn: Parsoid/PHP: Load the extension on all Parsoid nodes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac) [18:06:50] (03PS2) 10Dzahn: Add Mon (mnw) language [dns] - 10https://gerrit.wikimedia.org/r/544325 (https://phabricator.wikimedia.org/T235739) (owner: 10Jon Harald Søby) [18:07:26] (03CR) 10Dzahn: [C: 03+2] "approved by langcom - https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Mon" [dns] - 10https://gerrit.wikimedia.org/r/544325 (https://phabricator.wikimedia.org/T235739) (owner: 10Jon Harald Søby) [18:08:17] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Krinkle) [18:09:29] !log DNS - added new Wikipedia language "mnw" (Mon) T235739 - a language spoken in Myanmar [18:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:33] T235739: Create Mon Wikipedia - https://phabricator.wikimedia.org/T235739 [18:15:47] (03PS1) 10Paladox: Revert "Revert "gerrit: enable jgit gc"" [puppet] - 10https://gerrit.wikimedia.org/r/545367 [18:16:02] (03CR) 10Paladox: [C: 04-1] "Needs Thcipriani say so (and +1)" [puppet] - 10https://gerrit.wikimedia.org/r/545367 (owner: 10Paladox) [18:24:51] (03CR) 10Subramanya Sastry: [C: 03+1] Parsoid/PHP: Load the extension on all Parsoid nodes (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac) [18:28:30] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [18:34:29] 10Operations, 10serviceops, 10Patch-For-Review: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) [18:34:36] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [18:37:17] (03PS1) 10Ppchelko: Varnish: don't decode/encode slashes for core REST API paths. [puppet] - 10https://gerrit.wikimedia.org/r/545369 (https://phabricator.wikimedia.org/T235779) [18:48:20] (03CR) 10Dzahn: [C: 04-1] "per IRC, we should make a parameter instead to include the migration/rsync stuff or not" [puppet] - 10https://gerrit.wikimedia.org/r/545342 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [18:48:45] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:49:21] (03Abandoned) 10Dzahn: Revert "gerrit::migration: switch master to gerrit1001" [puppet] - 10https://gerrit.wikimedia.org/r/545084 (owner: 10Dzahn) [18:59:15] (03CR) 10Mobrovac: "A couple of minors, otherwise lgtm" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545369 (https://phabricator.wikimedia.org/T235779) (owner: 10Ppchelko) [18:59:39] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18990/" [puppet] - 10https://gerrit.wikimedia.org/r/545066 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [19:00:05] brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191022T1900). [19:00:45] 10Operations, 10ops-eqiad, 10DC-Ops: update puppet for new PDU models - https://phabricator.wikimedia.org/T233129 (10RobH) 05Open→03Resolved Please note this is now a checkbox on all PDU upgrade tasks, so I'm resolving this task. [19:00:47] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [19:03:10] (03PS2) 10Ppchelko: Varnish: don't decode/encode slashes for core REST API paths. [puppet] - 10https://gerrit.wikimedia.org/r/545369 (https://phabricator.wikimedia.org/T235779) [19:03:33] !log proceeding with train for 1.35.0-wmf.3 [19:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:44] (03CR) 10jerkins-bot: [V: 04-1] Varnish: don't decode/encode slashes for core REST API paths. [puppet] - 10https://gerrit.wikimedia.org/r/545369 (https://phabricator.wikimedia.org/T235779) (owner: 10Ppchelko) [19:05:00] (03PS3) 10Ppchelko: Varnish: don't decode/encode slashes for core REST API paths [puppet] - 10https://gerrit.wikimedia.org/r/545369 (https://phabricator.wikimedia.org/T235779) [19:05:25] (03CR) 10Brennen Bearnes: [C: 03+2] Group0 to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545281 (owner: 10Lars Wirzenius) [19:05:42] (03CR) 10Ppchelko: Varnish: don't decode/encode slashes for core REST API paths (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545369 (https://phabricator.wikimedia.org/T235779) (owner: 10Ppchelko) [19:06:14] (03Merged) 10jenkins-bot: Group0 to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545281 (owner: 10Lars Wirzenius) [19:06:32] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Dzahn) ^ The reason to merge this was not a comment on the general question to enable avatars. The reason was that during T222391 we noticed an undesirable dependency. During a Ger... [19:07:51] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [19:09:44] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) This is mostly done and all boxes are checked. Though only really closing it after: T236114 is r... [19:10:58] (03PS2) 10Jhedden: openstack: patch python-designateclient header values [puppet] - 10https://gerrit.wikimedia.org/r/545072 (https://phabricator.wikimedia.org/T235863) [19:11:25] (03CR) 10Jhedden: openstack: patch python-designateclient header values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545072 (https://phabricator.wikimedia.org/T235863) (owner: 10Jhedden) [19:13:32] (03CR) 10Jhedden: [C: 03+2] openstack: patch python-designateclient header values [puppet] - 10https://gerrit.wikimedia.org/r/545072 (https://phabricator.wikimedia.org/T235863) (owner: 10Jhedden) [19:16:34] (03CR) 10Mobrovac: [C: 03+1] "Applied in beta already, works." [puppet] - 10https://gerrit.wikimedia.org/r/545369 (https://phabricator.wikimedia.org/T235779) (owner: 10Ppchelko) [19:22:52] (03PS1) 10Dzahn: gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) [19:23:25] (03CR) 10jerkins-bot: [V: 04-1] gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn) [19:24:07] (03CR) 10Bstorm: host monitoring: add optional contact group for mgmt interfaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543916 (https://phabricator.wikimedia.org/T223458) (owner: 10Bstorm) [19:25:41] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:27:00] (03CR) 10Paladox: [C: 03+1] "Apart from jenkins error, +1" [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn) [19:27:22] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.3 [19:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:39] PROBLEM - SSH druid1004.mgmt on druid1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:38:30] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Cmjohnson) [19:40:22] (03PS2) 10Dzahn: gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) [19:41:39] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:42:48] !log gerrit1001: apt install colordiff # T236114 [19:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:53] T236114: check and fix some Gerrit revs - https://phabricator.wikimedia.org/T236114 [19:43:18] (03PS3) 10Dzahn: gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) [19:43:53] (03CR) 10jerkins-bot: [V: 04-1] gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn) [19:44:51] (03PS4) 10Paladox: gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn) [19:45:25] (03CR) 10jerkins-bot: [V: 04-1] gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn) [19:45:33] (03PS1) 10Hashar: gerrit: add colordiff package [puppet] - 10https://gerrit.wikimedia.org/r/545384 (https://phabricator.wikimedia.org/T236114) [19:46:07] (03PS5) 10Paladox: gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn) [19:46:41] (03CR) 10Dzahn: [C: 03+2] gerrit: add colordiff package [puppet] - 10https://gerrit.wikimedia.org/r/545384 (https://phabricator.wikimedia.org/T236114) (owner: 10Hashar) [19:46:51] (03CR) 10Bstorm: [C: 03+2] host monitoring: add optional contact group for mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/543916 (https://phabricator.wikimedia.org/T223458) (owner: 10Bstorm) [19:51:00] mutante: thanks :) [19:51:26] (03CR) 10BPirkle: rename service definition (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [19:51:26] yw hashar, thanks for hard work on fixes [19:54:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10RobH) irc update with john: These are going to take WEEKS to wipe, and are all old hdd. Rather than tie up that much onsite time swapping... [19:56:36] (03PS1) 10BBlack: geodns: eqiad non-primary for all public users [dns] - 10https://gerrit.wikimedia.org/r/545385 (https://phabricator.wikimedia.org/T235805) [20:01:03] (03CR) 10CRusnov: [C: 03+2] coherence: Check unracked devices for connected console ports [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/545132 (owner: 10CRusnov) [20:03:25] (03PS1) 10Bstorm: monitoring: set wmcs servers to email when mgmt interfaces fail [puppet] - 10https://gerrit.wikimedia.org/r/545386 (https://phabricator.wikimedia.org/T223458) [20:05:21] 10Operations, 10Traffic: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10colewhite) Of interest: all have user agent FortiGate (FortiOS 5.0) and [[ https://logstash.wikimedia.org/goto/3fa7d259cc2043eb0b56a6ae5e89298f | have appeared near simultaneously from a number of sources gl... [20:05:41] 10Operations, 10Traffic: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10colewhite) p:05Triage→03Normal [20:07:10] 10Operations, 10MediaWiki-Maintenance-scripts, 10cloud-services-team (Kanban): processEchoEmailBatch.php failing for labtestwiki - https://phabricator.wikimedia.org/T236145 (10colewhite) p:05Triage→03Normal [20:09:01] 10Operations, 10Gerrit: Editing in Gerrit isn't saved after the update/migration to gerrit1001 - https://phabricator.wikimedia.org/T236143 (10colewhite) p:05Triage→03Normal [20:09:34] !log gerrit1001 - mkdir /srv/gerrit/cobalt/git - rsyncing /srv/gerrit/git from cobalt to /srv/gerrit/cobalt/git/ on gerrit1001 (T236114) [20:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:39] T236114: check and fix some Gerrit revs - https://phabricator.wikimedia.org/T236114 [20:09:55] 10Operations, 10Traffic: interface-rps.py should have a flag to avoid CPU0 - https://phabricator.wikimedia.org/T236208 (10BBlack) p:05Triage→03Normal [20:18:46] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: WikimediaFoundation.org analytics access for CherRaye Glenn - https://phabricator.wikimedia.org/T236209 (10EdErhart-WMF) [20:22:17] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: WikimediaFoundation.org analytics access for CherRaye Glenn - https://phabricator.wikimedia.org/T236209 (10Nuria) Is CherRaye Glenn a contractor? If so when does the contract expire? [20:26:02] (03PS1) 10Dzahn: admins: add shell account for Lex Nasser [puppet] - 10https://gerrit.wikimedia.org/r/545388 (https://phabricator.wikimedia.org/T235688) [20:29:52] (03PS2) 10Dzahn: admins: add shell account for Lex Nasser [puppet] - 10https://gerrit.wikimedia.org/r/545388 (https://phabricator.wikimedia.org/T235688) [20:31:50] RECOVERY - SSH druid1004.mgmt on druid1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:49:18] (03CR) 10Mathew.onipe: query_service: prepare query_service for reusbility (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [20:51:15] (03PS2) 10Bstorm: wiki replicas: Add the labsdb1012 replica to maintain_dbusers [puppet] - 10https://gerrit.wikimedia.org/r/543924 (https://phabricator.wikimedia.org/T235791) [20:58:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:59:08] (03PS2) 10Bstorm: monitoring: set wmcs servers to email when mgmt interfaces fail [puppet] - 10https://gerrit.wikimedia.org/r/545386 (https://phabricator.wikimedia.org/T223458) [21:00:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:03:12] (03CR) 10Catrope: [C: 03+2] Set GrowthExperiments task suggester config on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545155 (https://phabricator.wikimedia.org/T234426) (owner: 10Gergő Tisza) [21:03:57] (03Merged) 10jenkins-bot: Set GrowthExperiments task suggester config on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545155 (https://phabricator.wikimedia.org/T234426) (owner: 10Gergő Tisza) [21:04:18] (03CR) 10Bstorm: [C: 03+2] wiki replicas: Add the labsdb1012 replica to maintain_dbusers [puppet] - 10https://gerrit.wikimedia.org/r/543924 (https://phabricator.wikimedia.org/T235791) (owner: 10Bstorm) [21:06:18] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: WikimediaFoundation.org analytics access for CherRaye Glenn - https://phabricator.wikimedia.org/T236209 (10Varnent) >>! In T236209#5596762, @Nuria wrote: > Is CherRaye Glenn a contractor? If so when does the contract expire? Ch... [21:06:52] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10wikimediafoundation.org: WikimediaFoundation.org analytics access for CherRaye Glenn - https://phabricator.wikimedia.org/T236209 (10Nuria) [21:08:45] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10wikimediafoundation.org: WikimediaFoundation.org analytics access for CherRaye Glenn - https://phabricator.wikimedia.org/T236209 (10Nuria) Then she should be added to wmf LDAP group after @Heather's approval ping @Dzahn which I... [21:25:31] (03PS19) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) [21:25:33] (03PS26) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [21:25:35] (03PS23) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [21:25:37] (03PS18) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [21:25:39] (03PS18) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [21:25:41] (03PS18) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [21:27:14] (03PS3) 10Dzahn: admins: add shell account for Lex Nasser [puppet] - 10https://gerrit.wikimedia.org/r/545388 (https://phabricator.wikimedia.org/T235688) [21:29:28] (03CR) 10jerkins-bot: [V: 04-1] query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [21:34:09] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) a:03RobH [21:34:17] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) [21:34:20] 10Operations, 10User-DannyS712: 503 Backend fetch failed - https://phabricator.wikimedia.org/T233271 (10MusikAnimal) There were several bursts of 503s over the past few weeks, the last was six days ago. But overall, yes, things have improved. I do realize 503s are super generic, it was just the frequency that... [21:35:49] (03CR) 10Papaul: [C: 03+1] admins: add shell account for Lex Nasser [puppet] - 10https://gerrit.wikimedia.org/r/545388 (https://phabricator.wikimedia.org/T235688) (owner: 10Dzahn) [21:36:59] (03CR) 10Dzahn: [C: 03+2] admins: add shell account for Lex Nasser [puppet] - 10https://gerrit.wikimedia.org/r/545388 (https://phabricator.wikimedia.org/T235688) (owner: 10Dzahn) [21:39:58] (03PS19) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [21:40:01] (03PS19) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [21:40:03] (03PS19) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [21:40:45] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Dzahn) @lexnasser Within max. 30 minutes this should work for you now. Please take a look at https://wikitech.wikimedia.org/wiki/Production_access... [21:41:00] (03CR) 10Mathew.onipe: "PCC result is good" [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [21:41:09] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Dzahn) 05Open→03Resolved If any unexpected issues please just reopen the ticket. [21:41:33] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Nuria) +1 , also let's make sure to go over the Data guidelines before working with the data. [21:45:21] !log LDAP - added lexnasser to nda group (T235688) [21:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:25] T235688: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 [21:46:29] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Dzahn) >>! In T235688#5587345, @Nuria wrote: > And also we need to add lex to nda group for access to turnilo and superset Done! @lexnasser You... [21:48:04] 10Operations, 10User-DannyS712: 503 Backend fetch failed - https://phabricator.wikimedia.org/T233271 (10sbassett) 05Open→03Resolved a:03MusikAnimal >>! In T233271#5596898, @MusikAnimal wrote: > But overall, yes, things have improved. I do realize 503s are super generic, it was just the frequency that rai... [21:49:15] (03PS24) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [21:49:17] (03PS20) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [21:49:19] (03PS20) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [21:49:21] (03PS20) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [21:49:27] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10wikimediafoundation.org: WikimediaFoundation.org analytics access for CherRaye Glenn - https://phabricator.wikimedia.org/T236209 (10Dzahn) Actually that is @colewhite this week but we are on it. [21:50:12] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) [21:51:59] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) [21:52:36] jouncebot: now [21:52:37] No deployments scheduled for the next 1 hour(s) and 7 minute(s) [21:52:41] oh good. [21:53:54] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) [21:56:13] 10Operations, 10SRE-Access-Requests, 10WMF-Legal: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10Dzahn) a:05RStallman-legalteam→03colewhite [21:56:38] (03CR) 10Mathew.onipe: "> Patch Set 26:" [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [21:57:34] !log stopping gerrit to run ref-update script [21:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:46] !log stopping gerrit to run ref-update script T236114 [21:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:50] T236114: check and fix some Gerrit revs - https://phabricator.wikimedia.org/T236114 [21:59:00] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10wikimediafoundation.org: WikimediaFoundation.org analytics access for CherRaye Glenn - https://phabricator.wikimedia.org/T236209 (10Heather) Approved. Thanks, everyone! [21:59:11] thcipriani: you stopped gerrit! arggggh :) [21:59:33] oh whew [21:59:35] i was like wtfffff [21:59:50] 'how did i bork my git now, it was working a second ago' [21:59:55] :) [21:59:59] sorry folks [22:00:10] should be back [22:00:13] no worries [22:00:18] im just happy it wasnt me. [22:00:31] no p ;) [22:00:32] (03PS1) 10RobH: adding new pdus to esams mgmt [dns] - 10https://gerrit.wikimedia.org/r/545406 [22:01:45] (03CR) 10RobH: [C: 03+2] adding new pdus to esams mgmt [dns] - 10https://gerrit.wikimedia.org/r/545406 (owner: 10RobH) [22:02:28] (03CR) 10Mathew.onipe: "PCC is Ok: https://puppet-compiler.wmflabs.org/compiler1002/19001/" [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [22:03:22] hrmm [22:03:39] linking into tasks automatically via patchset doesnt seem to be happening (or is doing so slowly) [22:03:52] ie: my new dns patch shows the bug in gerrit, but didnt update on the phab task [22:04:33] (03PS1) 10Cwhite: admin: add Nikki to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/545407 (https://phabricator.wikimedia.org/T235136) [22:05:54] robh: hmmm... does wikibugs make those links? It might not have liked the restart of gerrit if so. [22:06:00] * bd808 goes to figure that out [22:06:55] nope. that's done by https://wikitech.wikimedia.org/wiki/Gerrit_Notification_Bot which is a gerrit plugin apparently [22:07:16] robh: missing : [22:07:25] Bug: T... [22:08:07] 10Operations, 10SRE-Access-Requests, 10WMF-Legal: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10colewhite) [22:08:16] (03CR) 10Dzahn: [C: 03+1] admin: add Nikki to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/545407 (https://phabricator.wikimedia.org/T235136) (owner: 10Cwhite) [22:15:49] (03CR) 10Cwhite: [C: 03+2] admin: add Nikki to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/545407 (https://phabricator.wikimedia.org/T235136) (owner: 10Cwhite) [22:21:29] (03PS2) 10Dzahn: DNS: Remove production and mgmt DNS for frav1001 [dns] - 10https://gerrit.wikimedia.org/r/544279 (https://phabricator.wikimedia.org/T222109) (owner: 10Papaul) [22:21:37] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vlan-private1-d-codfw] - member ge-6/0/8; [edit interfaces interface-range disabled] mem... [22:21:42] (03PS1) 10Cwhite: admin: add cohi to researchers and analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/545409 (https://phabricator.wikimedia.org/T234429) [22:22:29] 10Operations, 10SRE-Access-Requests, 10WMF-Legal, 10Patch-For-Review: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10colewhite) [22:22:31] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Papaul) [22:23:26] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP membership for new employee Nikki Nikkhoui - https://phabricator.wikimedia.org/T235136 (10colewhite) Hi Nikki! I've deployed the necessary changes and added you to the wmf group. Please let me know if you encounter any related issue. [22:23:36] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP membership for new employee Nikki Nikkhoui - https://phabricator.wikimedia.org/T235136 (10colewhite) 05Open→03Resolved [22:23:40] (03CR) 10Dzahn: "production DNS entries are not removed yet it looks" [dns] - 10https://gerrit.wikimedia.org/r/544279 (https://phabricator.wikimedia.org/T222109) (owner: 10Papaul) [22:25:13] (03PS2) 10Cwhite: admin: add cohi to researchers and analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/545409 (https://phabricator.wikimedia.org/T234429) [22:25:36] ha [22:25:38] damn [22:27:13] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10wikimediafoundation.org: WikimediaFoundation.org analytics access for CherRaye Glenn - https://phabricator.wikimedia.org/T236209 (10colewhite) a:03colewhite [22:27:14] id have thought the ci would have failed it for that [22:27:21] rather than allow and plugin fail [22:33:29] (03PS1) 10Cwhite: admin: add keepit-ssh (CherRaye Glenn) to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/545410 (https://phabricator.wikimedia.org/T236209) [22:34:12] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10RobH) p:05Triage→03Normal [22:34:20] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10SRE-Access-Requests, and 2 others: WikimediaFoundation.org analytics access for CherRaye Glenn - https://phabricator.wikimedia.org/T236209 (10colewhite) p:05Triage→03Normal [22:34:24] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10RobH) [22:36:41] 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10RobH) p:05Triage→03Normal [22:37:05] 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10RobH) [22:38:11] (03PS7) 10Jforrester: Variant configuration: Allow for YAML-based inheritance of configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 (https://phabricator.wikimedia.org/T223602) [22:38:21] robh: there is a commit message test that can be added to any gerrit repo, but there are not many repos that have opted-in to using it. [22:38:26] (03PS20) 10Jforrester: Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [22:38:28] (03PS1) 10Jforrester: Variant configuration: Generate dblists from YAML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) [22:39:34] (03CR) 10Cwhite: [C: 03+1] logstash: remove deprecated elasticsearch options [puppet] - 10https://gerrit.wikimedia.org/r/545236 (https://phabricator.wikimedia.org/T235891) (owner: 10Filippo Giunchedi) [22:39:38] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [22:39:59] (03Abandoned) 10Jforrester: Variant configuration: Move some dblist configuration into YAML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539414 (owner: 10Jforrester) [22:43:23] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10thcipriani) [22:47:04] (03PS21) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [22:47:06] (03PS21) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [22:47:08] (03PS21) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [22:49:50] (03CR) 10jerkins-bot: [V: 04-1] query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [22:50:30] (03CR) 10jerkins-bot: [V: 04-1] query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [22:58:05] (03PS18) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191022T2300). [23:00:04] andrewbogott and Dbarratt: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:12] here! [23:00:17] me too! [23:00:41] (03PS22) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [23:00:43] (03PS22) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [23:00:45] (03PS22) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [23:01:55] do we have a deployer? [23:03:33] (03CR) 10jerkins-bot: [V: 04-1] query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [23:05:26] (03CR) 10Mathew.onipe: "PCC is Ok: https://puppet-compiler.wmflabs.org/compiler1001/19004/" [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [23:06:11] (03PS1) 10Paladox: Update scap targets [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/545416 [23:06:23] (03PS2) 10Paladox: Update scap targets [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/545416 [23:06:36] ping MaxSem, RoanKattouw, Niharika, and Urbanecm [23:08:16] (03PS23) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [23:08:17] (03PS23) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [23:14:14] andrewbogott I guess not. :( [23:14:30] MaxSem, RoanKattouw, Niharika, Urbanecm, I'm going to step away but please ping me here if one of you appears and I'll rush back to my keyboard. [23:14:57] (03CR) 10Dzahn: [C: 03+1] DNS: Remove production and mgmt DNS for frav1001 [dns] - 10https://gerrit.wikimedia.org/r/544279 (https://phabricator.wikimedia.org/T222109) (owner: 10Papaul) [23:15:02] PROBLEM - Host ps1-a6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:16:32] Is anyone doing SWAT? [23:17:27] mooeypoo: seems not [23:17:39] (03CR) 10Dzahn: [C: 03+1] admin: add cohi to researchers and analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/545409 (https://phabricator.wikimedia.org/T234429) (owner: 10Cwhite) [23:19:32] (03CR) 10Dzahn: [C: 03+1] "looks good, but pending approval by Heather" [puppet] - 10https://gerrit.wikimedia.org/r/545410 (https://phabricator.wikimedia.org/T236209) (owner: 10Cwhite) [23:20:16] pretty please @RoanKattouw / @MaxSem ...? either of you available for SWAT ? [23:20:58] Sorry, was in a meeting [23:21:08] (03CR) 10Dzahn: [C: 03+1] "approval is there, ready to go" [puppet] - 10https://gerrit.wikimedia.org/r/545410 (https://phabricator.wikimedia.org/T236209) (owner: 10Cwhite) [23:21:09] (with... me.... :D ) [23:21:33] andrewbogott & davidwbarratt, yt? [23:21:39] MaxSem yep! [23:22:03] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being recieved from mailing lists - https://phabricator.wikimedia.org/T235983 (10colewhite) I've been monitoring this the past couple days. Since yesterday we've gone from over 20k messages in the queue to less than 6k. The backlog s... [23:22:45] MaxSem: I'm here! [23:23:20] (03CR) 10Cwhite: [C: 03+2] admin: add cohi to researchers and analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/545409 (https://phabricator.wikimedia.org/T234429) (owner: 10Cwhite) [23:23:28] (03PS3) 10Cwhite: admin: add cohi to researchers and analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/545409 (https://phabricator.wikimedia.org/T234429) [23:24:35] (03PS7) 10Andrew Bogott: labtestwikitech: use the new codfw1-dev servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543943 (https://phabricator.wikimedia.org/T229441) [23:25:46] (03CR) 10MaxSem: [C: 03+2] labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [23:26:05] Now I need to figure out how to deploy that change... [23:26:30] (03Merged) 10jenkins-bot: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [23:26:48] I'm not sure how to selectively deploy a rename [23:27:04] maybe just do the after and then the before for cleanup [23:28:05] MaxSem: for andrewbogott's changes, the main thing is to make sure they don't break "real" wikis. So I think staging on mwdebugXXXX and testing there that say enwiki + mw.o work should be sufficient. We can deal with actually fully testing testlabswiki separately later. [23:28:38] yeah, we definitely don't need to worry about breaking testlabswiki, I'm pretty much the only one who ever looks at it [23:29:56] (03CR) 10Dzahn: [C: 03+2] admin: add keepit-ssh (CherRaye Glenn) to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/545410 (https://phabricator.wikimedia.org/T236209) (owner: 10Cwhite) [23:30:16] (03PS2) 10Dzahn: admin: add keepit-ssh (CherRaye Glenn) to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/545410 (https://phabricator.wikimedia.org/T236209) (owner: 10Cwhite) [23:30:19] andrewbogott: pulled on mwdebug1002, please test [23:31:06] ok! Um… what url do I use to hit that host? [23:31:31] bd808: ^ ? [23:31:31] Use the Wikimedia-Debug browser extension [23:31:49] MaxSem: I just tested mw.o reads and edits on mwdebug1002 and they look good. [23:31:58] thank you :) [23:32:52] !log LDAP - added keepit-ssh to wmf group (T236209) [23:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:57] T236209: WikimediaFoundation.org analytics access for CherRaye Glenn - https://phabricator.wikimedia.org/T236209 [23:33:05] MaxSem: enwiki too. So if you didn't see any spurt of soft errors on the backend should be good to go [23:33:47] andrewbogott: https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug is the magic to be able to test things on the mwdebugXXXX hosts [23:34:58] * andrewbogott installs the extension for future reference [23:36:16] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10SRE-Access-Requests, and 2 others: WikimediaFoundation.org analytics access for CherRaye Glenn - https://phabricator.wikimedia.org/T236209 (10Dzahn) 05Open→03Resolved done. she has been added to the "wmf" group [23:37:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, and 2 others: Analytics Access for Grant (groups cn=wmf and analytics-privatedata-users) - https://phabricator.wikimedia.org/T235260 (10Dzahn) [23:37:40] MaxSem: looks good to me too [23:38:03] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, and 2 others: Analytics Access for Grant (groups cn=wmf and analytics-privatedata-users) - https://phabricator.wikimedia.org/T235260 (10Dzahn) a:05herron→03colewhite L3 has been signed. This is unblocked. [23:38:45] !log maxsem@deploy1001 Synchronized dblists/labtestwiki.dblist: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/543664/ (duration: 01m 02s) [23:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:30] Okay, now we either explode or win [23:41:37] !log maxsem@deploy1001 Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/543664/ (duration: 01m 01s) [23:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:52] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10Dzahn) [23:43:11] !log maxsem@deploy1001 Synchronized dblists/: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/543664/ (duration: 00m 59s) [23:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:17] (03CR) 10MaxSem: [C: 03+2] labtestwikitech: use the new codfw1-dev servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543943 (https://phabricator.wikimedia.org/T229441) (owner: 10Andrew Bogott) [23:46:58] (03Merged) 10jenkins-bot: labtestwikitech: use the new codfw1-dev servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543943 (https://phabricator.wikimedia.org/T229441) (owner: 10Andrew Bogott) [23:48:24] andrewbogott and davidwbarratt, your changes are staged on mwdebug1002 [23:48:34] MaxSem thanks! [23:49:49] In my case there's not much to test since the second patch only affects wikitech-style wikis. [23:49:57] (which renders it fairly harmless as well) [23:50:10] davidwbarratt nothing I can really test, but it doesn't appear to have broken anything. :) [23:50:23] MaxSem ^ [23:53:02] !log maxsem@deploy1001 Synchronized wmf-config/wikitech.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/543943/ (duration: 01m 01s) [23:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:28] andrewbogott: please test ^ [23:55:04] MaxSem: lgtm. Edited and logged out/in [23:55:53] that's everything, right? [23:57:06] !log maxsem@deploy1001 Synchronized php-1.35.0-wmf.3/includes/block/DatabaseBlock.php: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/545373/ (duration: 00m 59s) [23:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:24] davidwbarratt: please test ^ [23:57:31] kk [23:57:40] * andrewbogott -> the kitchen [23:57:46] Thank you MaxSem! [23:58:10] MaxSem nothing appears to be broken! [23:58:25] MaxSem thanks! [23:58:51] (03PS1) 10Cwhite: admin: add Kevin Bazira to several groups [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209)