[00:12:14] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) [00:28:51] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) [01:01:31] (03PS1) 10Cwhite: Parameterize path so as to better integrate with Prometheus service discovery. Parameterize spec_segment. Maintain backwards compatibility. Improvements preparing and sanitizing the url for sending to CheckService. [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/541683 [01:32:29] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 28508 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [01:39:02] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) @fgiunchedi: the task description has been updated * ms-be105[1236] are all yours * i need to finish installation/setup of ms-be105[45] [01:40:51] PROBLEM - MediaWiki centralauth errors on graphite1004 is CRITICAL: CRITICAL: 46.67% of data above the critical threshold [1.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [01:44:31] RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [01:48:51] RECOVERY - MediaWiki centralauth errors on graphite1004 is OK: OK: Less than 30.00% above the threshold [0.5] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [01:50:59] (03PS1) 10Mathew.onipe: wdqs: cleanup un-useful nginx config [puppet] - 10https://gerrit.wikimedia.org/r/541684 [02:05:33] (03PS2) 10Mathew.onipe: wdqs: cleanup un-useful nginx config [puppet] - 10https://gerrit.wikimedia.org/r/541684 [02:05:39] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:07:09] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:16:41] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 24160 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [02:28:25] (03PS4) 10CRusnov: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [02:30:52] (03CR) 10jerkins-bot: [V: 04-1] netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [02:32:45] RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [02:38:55] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 342869896 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:40:31] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3872 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:57:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:57:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:59:21] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:59:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:45:12] (03PS3) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [03:45:32] (03CR) 10Mathew.onipe: wdqs: add data-reload cookbook (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [03:53:36] (03CR) 10Mathew.onipe: wdqs: add data-reload cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [03:58:52] 10Operations, 10ops-codfw, 10Traffic: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) >>! In T196560#4927737, @Vgutierrez wrote: > So, the NIC issue reported in T203194 seems to be fixed after upgrading the NIC firmware to version 21.40 (https://www.dell.com/support... [03:59:53] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr Hi @Jclark-ctr - can you hit up @Vgutierrez when you get in during the AM sometime this week to depool the host? You guys have overlap in the mornings, un... [04:20:21] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:06] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp5004 [puppet] - 10https://gerrit.wikimedia.org/r/541687 (https://phabricator.wikimedia.org/T231433) [04:36:08] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp5004 [puppet] - 10https://gerrit.wikimedia.org/r/541688 (https://phabricator.wikimedia.org/T231433) [04:39:59] !log switching cp5004 from nginx to ats-tls - T231433 [04:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:05] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:40:44] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp5004 [puppet] - 10https://gerrit.wikimedia.org/r/541687 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:47:21] PROBLEM - HTTPS Unified ECDSA on cp5004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:47:40] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp5004 [puppet] - 10https://gerrit.wikimedia.org/r/541688 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:47:45] PROBLEM - HTTPS Unified RSA on cp5004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:47:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1103:3312', diff saved to https://phabricator.wikimedia.org/P9272 and previous config saved to /var/cache/conftool/dbconfig/20191009-044752-marostegui.json [04:47:53] ^^ expected [04:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:45] PROBLEM - Ensure traffic_server is running for instance tls on cp5004 is CRITICAL: PROCS CRITICAL: 0 processes with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:53:47] PROBLEM - Check systemd state on cp5004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:54:01] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5004 is CRITICAL: connect to address 10.132.0.104 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:54:29] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp5004 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:55:21] RECOVERY - Ensure traffic_server is running for instance tls on cp5004 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:55:29] RECOVERY - HTTPS Unified ECDSA on cp5004 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345579 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:55:53] RECOVERY - HTTPS Unified RSA on cp5004 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345556 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:56:07] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp5004 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:57:01] RECOVERY - Check systemd state on cp5004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:57:15] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5004 is OK: HTTP OK: HTTP/1.0 200 OK - 19533 bytes in 0.713 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:58:29] PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp5004 is CRITICAL: connect to address 10.132.0.104 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:59:33] PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp5004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:59:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3312 for schema change', diff saved to https://phabricator.wikimedia.org/P9273 and previous config saved to /var/cache/conftool/dbconfig/20191009-045941-marostegui.json [04:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:51] PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp5004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:01:39] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [05:02:01] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [05:02:34] hmmm gerrit is having some issues [05:04:06] vgutierrez: yeah. this got logged in the releng channel -- "[05:01] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring" [05:04:16] oh, here too :) [05:04:21] indeed :) [05:04:39] its a long gc pause if that;s the problem [05:07:15] sigh.. icinga puppet run is messed up with gerrit being down... [05:11:07] !log Restart gerrit as it is down [05:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:43] PROBLEM - Check the last execution of git_pull_charts on deploy1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:14:29] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 25575 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [05:14:34] :D [05:14:49] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.057 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [05:17:01] RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp5004 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 344288 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:17:35] RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp5004 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 344252 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:17:38] nice... puppet is able to run again on icinga :D [05:19:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1085 for schema change - lag will be generated on s6 labs', diff saved to https://phabricator.wikimedia.org/P9274 and previous config saved to /var/cache/conftool/dbconfig/20191009-051911-marostegui.json [05:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:51] RECOVERY - Check the last execution of git_pull_charts on deploy1001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:24:44] 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Login on wikitech wiki fails after OpenStack upgrade removed v2 identity API - https://phabricator.wikimedia.org/T234996 (10bd808) >>! In T234996#5557713, @gerritbot wrote: > Change 541644 had a related patch set up... [05:25:44] 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Login on wikitech wiki fails after OpenStack upgrade removed v2 identity API - https://phabricator.wikimedia.org/T234996 (10bd808) >>! In T234996#5557995, @bd808 wrote: > I set a custom message at https://wikitech.w... [05:41:22] (03PS1) 10Marostegui: db-eqiad.php: Depool es1013, es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541707 (https://phabricator.wikimedia.org/T227536) [05:42:57] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool es1013, es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541707 (https://phabricator.wikimedia.org/T227536) (owner: 10Marostegui) [05:43:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1013, es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541707 (https://phabricator.wikimedia.org/T227536) (owner: 10Marostegui) [05:45:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1013, es1014 T227536 (duration: 01m 00s) [05:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:14] T227536: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 [05:45:24] (03PS5) 10CRusnov: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [05:46:15] (03CR) 10jerkins-bot: [V: 04-1] netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [05:46:57] (03PS1) 10Marostegui: site.pp: Remove puppet references for db2069 [puppet] - 10https://gerrit.wikimedia.org/r/541709 (https://phabricator.wikimedia.org/T230107) [05:47:25] (03PS1) 10Marostegui: wmnet: Remove db2069 production DNS entries [dns] - 10https://gerrit.wikimedia.org/r/541710 (https://phabricator.wikimedia.org/T230107) [05:47:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:44] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2069.codfw.wmnet - https://phabricator.wikimedia.org/T230107 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2069.codfw.wmnet` - db2069.codfw.wmnet (**PASS**)... [05:47:52] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove puppet references for db2069 [puppet] - 10https://gerrit.wikimedia.org/r/541709 (https://phabricator.wikimedia.org/T230107) (owner: 10Marostegui) [05:48:22] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2069 production DNS entries [dns] - 10https://gerrit.wikimedia.org/r/541710 (https://phabricator.wikimedia.org/T230107) (owner: 10Marostegui) [05:48:45] (03PS3) 10Marostegui: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/540762 (https://phabricator.wikimedia.org/T234300) [05:48:52] (03PS2) 10Marostegui: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/540763 (https://phabricator.wikimedia.org/T234300) [05:49:45] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2069.codfw.wmnet - https://phabricator.wikimedia.org/T230107 (10Marostegui) a:05RobH→03Papaul [05:49:58] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2069.codfw.wmnet - https://phabricator.wikimedia.org/T230107 (10Marostegui) Host ready for on-site steps + switch disablement [06:04:48] 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [06:09:45] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp4024 [puppet] - 10https://gerrit.wikimedia.org/r/541711 (https://phabricator.wikimedia.org/T231433) [06:09:47] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp4024 [puppet] - 10https://gerrit.wikimedia.org/r/541712 (https://phabricator.wikimedia.org/T231433) [06:31:38] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:42] !log switching from nginx to ats-tls on cp4024 - T231433 [06:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:46] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [06:35:26] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp4024 [puppet] - 10https://gerrit.wikimedia.org/r/541711 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [06:38:15] vgutierrez: nice! [06:38:40] yup... [06:38:56] half way through more or less right? [06:38:56] upload is pretty happy with ats-tls [06:39:02] yep [06:39:10] I need to debug some tiny issues on text though [06:42:44] PROBLEM - HTTPS Unified ECDSA on cp4024 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [06:43:46] PROBLEM - HTTPS Unified RSA on cp4024 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [06:46:00] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks, merging" [puppet] - 10https://gerrit.wikimedia.org/r/541388 (owner: 10Dzahn) [06:46:07] (03PS2) 10Muehlenhoff: cumin: remove yubiauth alias [puppet] - 10https://gerrit.wikimedia.org/r/541388 (owner: 10Dzahn) [06:47:05] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp4024 [puppet] - 10https://gerrit.wikimedia.org/r/541712 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [06:48:28] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) >>! In T227025#5556705, @Cmjohnson wrote: > I don't know what you need me to do...the servers were setup correctly. There seems to be an issue with... [06:48:34] (03PS1) 10Alexandros Kosiaris: Fix sessionstore lvs monitoring typo [puppet] - 10https://gerrit.wikimedia.org/r/541714 [06:49:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix sessionstore lvs monitoring typo [puppet] - 10https://gerrit.wikimedia.org/r/541714 (owner: 10Alexandros Kosiaris) [06:50:12] RECOVERY - HTTPS Unified RSA on cp4024 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345589 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:50:44] RECOVERY - HTTPS Unified ECDSA on cp4024 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345555 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:53:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, you could could also simply use php-foo packages from php-defaults, which pulls in the correct phpX.Y-foo packages. But the cu" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [06:53:54] (03PS3) 10Muehlenhoff: cumin: remove yubiauth alias [puppet] - 10https://gerrit.wikimedia.org/r/541388 (owner: 10Dzahn) [06:57:28] (03CR) 10Muehlenhoff: [C: 03+2] cumin: remove yubiauth alias [puppet] - 10https://gerrit.wikimedia.org/r/541388 (owner: 10Dzahn) [06:58:30] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "You don't want to add the servers to the "appserver" cluster. You are adding a new "service" to the already existing "parsoid" cluster." [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [07:07:51] (03CR) 10Elukey: "> I think if you just include the druid user in profile::hadoop::master::hadoop_user_groups," [puppet] - 10https://gerrit.wikimedia.org/r/541554 (owner: 10Elukey) [07:09:37] (03CR) 10Jcrespo: "I would suggest to pause this deploy until the gtid filtering issue gets researched." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [07:10:12] (03CR) 10Marostegui: [C: 03+1] "> I would suggest to pause this deploy until the gtid filtering issue" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [07:14:44] (03PS4) 10Elukey: profile::analytics::cluster::users: ensure user druid [puppet] - 10https://gerrit.wikimedia.org/r/541554 [07:18:03] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/18804/" [puppet] - 10https://gerrit.wikimedia.org/r/541554 (owner: 10Elukey) [07:19:20] (03CR) 10Alexandros Kosiaris: "> Add .pipeline/config.yaml with publish stage:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/541371 (https://phabricator.wikimedia.org/T234578) (owner: 10Jeena Huneidi) [07:26:25] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:29:15] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:29:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:53] 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [07:36:55] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:37:36] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp3038 [puppet] - 10https://gerrit.wikimedia.org/r/541752 (https://phabricator.wikimedia.org/T231433) [07:37:38] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp3038 [puppet] - 10https://gerrit.wikimedia.org/r/541753 (https://phabricator.wikimedia.org/T231433) [07:38:55] !log Switch cp3038 from nginx to ats-tls - T231433 [07:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:59] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [07:40:10] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp3038 [puppet] - 10https://gerrit.wikimedia.org/r/541752 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [07:45:31] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp3038 [puppet] - 10https://gerrit.wikimedia.org/r/541753 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [07:46:49] PROBLEM - HTTPS Unified ECDSA on cp3038 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [07:47:02] ^ expected? :p [07:47:07] PROBLEM - HTTPS Unified RSA on cp3038 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [07:48:02] !log reduced RAM assignment for boron to 8G [07:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:25] RECOVERY - HTTPS Unified ECDSA on cp3038 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345571 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS [07:48:43] RECOVERY - HTTPS Unified RSA on cp3038 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345553 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS [07:56:33] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [07:57:08] (03CR) 10Elukey: "Ok sorry I wasn't caffeinated enough :)" [puppet] - 10https://gerrit.wikimedia.org/r/541554 (owner: 10Elukey) [07:59:59] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/541758 (https://phabricator.wikimedia.org/T231433) [08:00:01] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 [puppet] - 10https://gerrit.wikimedia.org/r/541759 (https://phabricator.wikimedia.org/T231433) [08:00:44] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/541759 (https://phabricator.wikimedia.org/T231433) [08:01:52] !log Switch cp2011 from nginx to ats-tls - T231433 [08:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:57] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [08:04:58] (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/541758 (https://phabricator.wikimedia.org/T231433) [08:05:00] (03PS3) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/541759 (https://phabricator.wikimedia.org/T231433) [08:06:15] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/541758 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [08:09:09] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/541759 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [08:09:33] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:10:47] PROBLEM - HTTPS Unified ECDSA on cp2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [08:12:17] RECOVERY - HTTPS Unified ECDSA on cp2011 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345563 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 43 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:14:06] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:14:06] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:40] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:18:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:50] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:18:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:13] 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [08:24:29] !log draining ganeti1006 for upcoming reboot (combined kernel/qemu security updates) [08:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:01] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:33:15] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:33:29] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:33:33] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:34:33] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:34:37] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:37:18] mmmmm [08:37:31] I think there is a maint [08:37:36] I am trying to decrypt the calendar [08:37:57] there is a Telia notification that I can see [08:38:03] checking on the routers [08:38:31] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:39:23] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp1082 [puppet] - 10https://gerrit.wikimedia.org/r/541762 (https://phabricator.wikimedia.org/T231433) [08:39:26] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp1082 [puppet] - 10https://gerrit.wikimedia.org/r/541763 (https://phabricator.wikimedia.org/T231433) [08:39:40] !log Switch cp1082 from nginx to ats-tls - T231433 [08:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:45] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [08:40:14] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp1082 [puppet] - 10https://gerrit.wikimedia.org/r/541762 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [08:42:36] I think that there is a GRE tunnel down due to a transit maintenance, or similar [08:43:33] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp1082 [puppet] - 10https://gerrit.wikimedia.org/r/541763 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [08:44:04] NTT between eqdfw and ulsfo [08:44:45] that I guess we have also between ulsfo and eqsin? [08:45:25] yep [08:46:31] ahhh ok I can see in maint announce the NTT scheduled maintenance [08:46:48] but it is not in the gcal afaics [08:47:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1085 after schema change', diff saved to https://phabricator.wikimedia.org/P9275 and previous config saved to /var/cache/conftool/dbconfig/20191009-084732-marostegui.json [08:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:00] nothing on fire in theory [08:49:11] does what I wrote above make sense?\ [08:50:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113:3316 for schema change, temporarily pool db1085 as vslow,dump', diff saved to https://phabricator.wikimedia.org/P9276 and previous config saved to /var/cache/conftool/dbconfig/20191009-085016-marostegui.json [08:50:19] (the NTT maintenance is affecting ulsfo so both "legs" between eqdfw and eqsin are suffering, causing the OSPF alarms) [08:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:43] (legs == GRE tunnels) [08:51:10] elukey: which links are having maintenances? [08:51:19] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [08:51:38] XioNoX: I've read only NTT in ulsfo [08:52:37] they are sayng a sw upgrade [08:53:05] but if transit is down in ulsfo then both GRE tunnels are down no? [08:53:57] elukey@cr4-ulsfo> show interfaces descriptions | match down [08:53:57] et-0/0/2 down down DISABLED [08:53:58] xe-0/1/0 up down Transit: NTT (service ID 234631) {#1079} [10Gbps] [08:54:12] XioNoX: --^ [08:55:33] on my phone and signal is terrible in CDG... [08:59:05] elukey: your diagnostic make sens to me [08:59:40] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission [08:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:52] XioNoX: I can open a ticket to NTT, they said no impact expected :D [09:00:31] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:49] 10Operations, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: `auth1001.eqiad.wmnet` - auth1001.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Downtimed managemen... [09:01:01] 10Operations, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10MoritzMuehlenhoff) [09:01:25] 10Operations, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10MoritzMuehlenhoff) [09:03:06] elukey: not strictly needed as long as we're in the window [09:04:19] (03PS1) 10Muehlenhoff: auth1001: Remove remaining puppet references [puppet] - 10https://gerrit.wikimedia.org/r/541764 (https://phabricator.wikimedia.org/T234909) [09:05:17] XioNoX: yes we are, it will last 4h [09:07:01] (03CR) 10Muehlenhoff: [C: 03+2] auth1001: Remove remaining puppet references [puppet] - 10https://gerrit.wikimedia.org/r/541764 (https://phabricator.wikimedia.org/T234909) (owner: 10Muehlenhoff) [09:07:50] ok will wait then [09:09:44] 10Operations, 10DC-Ops, 10SRE-tools: Host decommission improvements - https://phabricator.wikimedia.org/T231066 (10Volans) 05Open→03Resolved I'm marking this as resolved as the cookbook has been used many times at this point and both Phabricator templated and wikitech documentation have been updated acco... [09:09:53] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [09:13:20] (03CR) 10Volans: "> Patch Set 1:" [homer/public] - 10https://gerrit.wikimedia.org/r/541375 (owner: 10Ayounsi) [09:13:40] (03PS1) 10Muehlenhoff: Remove DNS entries for auth1001 [dns] - 10https://gerrit.wikimedia.org/r/541766 (https://phabricator.wikimedia.org/T234909) [09:15:50] (03CR) 10Jbond: [C: 03+2] puppet::rsync: disable chroot on volatile and ssl rsync [puppet] - 10https://gerrit.wikimedia.org/r/541545 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [09:15:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove DNS entries for auth1001 [dns] - 10https://gerrit.wikimedia.org/r/541766 (https://phabricator.wikimedia.org/T234909) (owner: 10Muehlenhoff) [09:16:16] (03PS4) 10Jbond: puppet::rsync: disable chroot on volatile and ssl rsync [puppet] - 10https://gerrit.wikimedia.org/r/541545 (https://phabricator.wikimedia.org/T234315) [09:17:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03Cmjohnson [09:18:58] 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10MoritzMuehlenhoff) a:05RobH→03Cmjohnson [09:23:28] 10Operations, 10decommission: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey) a:05RobH→03Cmjohnson [09:29:48] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] debdeploy: Fix update_type type [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541558 (owner: 10Muehlenhoff) [09:31:53] (03CR) 10Volans: "Comment inline" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/541394 (owner: 10Ayounsi) [09:39:17] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:39:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:53] (03PS1) 10Jbond: debdeploy: change global to immutable [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541768 [09:43:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541768 (owner: 10Jbond) [09:44:35] !log draining ganeti1007 for upcoming reboot (combined kernel/qemu security updates) [09:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:49] apparently my iOS Safari hates me connecting to Wikipedia (gives me Connection Reset error) but my macOS disagrees [09:45:51] * revi shrugs [09:48:37] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [09:48:37] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:15] (03CR) 10Jbond: [V: 03+2 C: 03+2] debdeploy: change global to immutable [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541768 (owner: 10Jbond) [09:52:36] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [09:52:37] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:08] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [09:53:09] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:12] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [09:53:13] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:07] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:59:33] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:00:21] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:00:31] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:00:35] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:00:47] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:00:49] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:02:08] gooood [10:02:13] (03PS1) 10Muehlenhoff: Allow skipping distros again [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541770 [10:02:15] NTT maintenance hopefully over [10:07:24] (03PS1) 10Alexandros Kosiaris: restrouter: Allow the kademlia port in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/541771 (https://phabricator.wikimedia.org/T223953) [10:10:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Allow the kademlia port in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/541771 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [10:10:52] (03Merged) 10jenkins-bot: restrouter: Allow the kademlia port in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/541771 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [10:16:45] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'restrouter' for release 'production' . [10:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:58] !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'restrouter' for release 'production' . [10:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:09] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:22:59] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:23:09] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:23:15] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:23:18] nope NTT maintenance again :( [10:23:23] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:23:27] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:24:14] the maintenance window closes in ~1.5h [10:27:52] elukey: you decrypt the calendar! [10:27:58] decrypted* [10:28:46] effie: this info was only in maint announce and not in the cal :( [10:29:04] oh that is wht I couldnt find it [10:33:05] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:33:29] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:34:17] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:34:27] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:34:35] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:34:41] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:34:45] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:58:40] !log akosiaris@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191009T1100). [11:00:04] alaa_wmde: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:22] I'm filling in for Alaa [11:01:11] I can do SWAT today [11:04:15] !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'restrouter' for release 'production' . [11:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:44] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:541777|Put write both limit down to Q70m for item terms (T234948)]] (duration: 01m 10s) [11:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:47] T234948: New Wikibase deadlocks on Wikidata wiki since 2019-10-08T00:00:02: Wikibase\Lib\Store\Sql\Terms\{closure} Deadlock found when trying to get lock; try restarting transaction - https://phabricator.wikimedia.org/T234948 [11:05:05] !log EU SWAT is done [11:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:12] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:25:13] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:18] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:25:18] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:47] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:25:48] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:40] !log draining ganeti1008 for upcoming reboot (combined kernel/qemu security updates) [11:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:28] I have no idea what https://usercontent.irccloud-cdn.com/file/woqWO186/image.png [12:00:39] what's going on (this has been like this for few hrs) [12:02:39] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:02:40] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:46] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:02:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:12] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:03:13] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:31] !log failover Ganeti master in eqiad to ganeti1003 [12:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:41] PROBLEM - k8s API server requests latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={PATCH,POST} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:11:59] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={create,get} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:13:11] PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:13:19] RECOVERY - k8s API server requests latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:13:35] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:13:52] !log draining ganeti1001 for upcoming reboot (combined kernel/qemu security updates) [12:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:59] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={get,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:15:04] all these ^ are the ganeti moves [12:16:27] RECOVERY - k8s API server requests latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:17:15] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:20:41] !log mobrovac@deploy1001 Started deploy [restbase/deploy@068d2ed]: Feed: Use Wikifeeds; Parsoid: Use the ETag revid for stashing and use the same ETag for stashing and response - T170455 T234928 [12:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:46] T234928: RESTBase sometimes not retaining stashed content? - https://phabricator.wikimedia.org/T234928 [12:20:47] T170455: Extract the feed endpoints from PCS into a new wikifeeds service - https://phabricator.wikimedia.org/T170455 [12:28:23] !log depooling cp1085 for a power drain - T231525 [12:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:27] T231525: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 [12:30:21] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@068d2ed]: Feed: Use Wikifeeds; Parsoid: Use the ETag revid for stashing and use the same ETag for stashing and response - T170455 T234928 (duration: 09m 40s) [12:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:26] T234928: RESTBase sometimes not retaining stashed content? - https://phabricator.wikimedia.org/T234928 [12:30:26] T170455: Extract the feed endpoints from PCS into a new wikifeeds service - https://phabricator.wikimedia.org/T170455 [12:32:12] 04Critical Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Juniper alarm active [12:32:57] !log mobrovac@deploy1001 Started deploy [restbase/deploy@068d2ed]: Feed: Use Wikifeeds; Parsoid: Use the ETag revid for stashing and use the same ETag for stashing and response, take #2 [12:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:00] XioNoX: ^^^ [12:33:11] thx [12:33:33] 2019-10-08 12:16:21 UTC Minor FPC 2 PEM 1 is not powered [12:34:49] pinged eqiad-ops on -dcops [12:35:10] !log reimage puppetmaster2002 [12:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:21] !log disabled puppet on DNS recursors for staged rollout of ferm NTP change [12:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:23] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:38:27] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:38:28] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:59] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:40:27] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob [12:40:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1105:3312 after schema change', diff saved to https://phabricator.wikimedia.org/P9277 and previous config saved to /var/cache/conftool/dbconfig/20191009-124035-marostegui.json [12:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:15] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@068d2ed]: Feed: Use Wikifeeds; Parsoid: Use the ETag revid for stashing and use the same ETag for stashing and response, take #2 (duration: 08m 18s) [12:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:03] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:42:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 for BBU replacement T231638', diff saved to https://phabricator.wikimedia.org/P9278 and previous config saved to /var/cache/conftool/dbconfig/20191009-124218-marostegui.json [12:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:23] T231638: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 [12:42:41] !log Stop MySQL and power off db1074 for BBU replacement T231638 [12:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:47] PROBLEM - Aggregate IPsec Tunnel Status esams on icinga1001 is CRITICAL: instance={cp3030:9536,cp3032:9536,cp3033:9536,cp3040:9536,cp3041:9536,cp3042:9536,cp3043:9536} site=esams tunnel={cp1085_v4,cp1085_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [12:46:17] PROBLEM - Aggregate IPsec Tunnel Status ulsfo on icinga1001 is CRITICAL: instance={cp4027:9536,cp4028:9536,cp4029:9536,cp4030:9536,cp4031:9536,cp4032:9536} site=ulsfo tunnel={cp1085_v4,cp1085_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [12:46:45] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2001:9536,cp2004:9536,cp2006:9536,cp2007:9536,cp2010:9536,cp2012:9536,cp2013:9536,cp2016:9536,cp2019:9536,cp2023:9536} site=codfw tunnel={cp1085_v4,cp1085_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [12:46:45] PROBLEM - Aggregate IPsec Tunnel Status eqsin on icinga1001 is CRITICAL: instance={cp5007:9536,cp5008:9536,cp5009:9536,cp5010:9536,cp5011:9536,cp5012:9536} site=eqsin tunnel={cp1085_v4,cp1085_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [12:46:51] arg.. that's expected [12:47:15] good :) [12:48:24] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:28] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:29] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:34] PROBLEM - IPsec on cp5007 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:34] PROBLEM - IPsec on cp5009 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:34] PROBLEM - IPsec on cp5012 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:38] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:44] PROBLEM - IPsec on cp4027 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:44] PROBLEM - IPsec on cp4031 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:46] PROBLEM - IPsec on cp5010 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:46] PROBLEM - IPsec on cp5011 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:46] PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:46] PROBLEM - IPsec on cp4032 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:48] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:52] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:48:52] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:49:09] sigh.. I've been too slow with the downtime :/ [12:49:09] sorry [12:49:26] vgutierrez: np, at least those should go away soon~ish right? [12:49:30] those tunnels [12:49:46] Is there something wrong with Phabricator? I'm getting errors in the console and things randomly error out. [12:50:02] Niharika: which kind of errors? [12:50:52] vgutierrez: When trying to 'Show older changes', on a ticket, I see `ReferenceError: Can't find variable: add_event_listener`in the console. [12:50:58] And it doesn't load. [12:52:24] volans: well.. we need to replace varnish-be with ats on text to get rid of the IPSec tunnels [12:53:18] vgutierrez: yeah, soon~ish :D [12:53:29] volans: but it looks like ema is better than me at taming ATS [12:53:32] so yeah.. [12:56:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318 for schema change T233625', diff saved to https://phabricator.wikimedia.org/P9279 and previous config saved to /var/cache/conftool/dbconfig/20191009-125641-marostegui.json [12:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:47] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [12:59:24] !log mobrovac@deploy1001 Started deploy [restbase/deploy@aaadd73]: Parsoid: Retry fetching stashes with undefined as the revid - T234928 [12:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:29] T234928: RESTBase sometimes not retaining stashed content? - https://phabricator.wikimedia.org/T234928 [13:08:42] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob [13:08:52] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:10:06] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:10:18] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:10:44] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob [13:10:44] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HT [13:10:44] timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:10:50] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/references/{title} (Get references of a test page) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggrega [13:10:50] out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:12:14] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:12:14] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:12:20] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:13:50] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@aaadd73]: Parsoid: Retry fetching stashes with undefined as the revid - T234928 (duration: 14m 26s) [13:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:54] T234928: RESTBase sometimes not retaining stashed content? - https://phabricator.wikimedia.org/T234928 [13:24:10] dunno what's going on but when I connect from home (eqsin) site doesn't load properly and when I try VPNing to Europe or NorthAmerica it works [13:24:17] probably task-worthy I guess [13:25:00] vgutierrez: FYI ^^^ [13:25:48] probably too late, forgot the TZ [13:25:49] for kowiki or somewhere else it is usually missing CSS stuff and hitting F5 loads the CSS but for wikitech it just don't work [13:26:19] revi: what error are you getitng? [13:26:21] *getting [13:26:35] you can also try to follow https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [13:27:36] for iPhone, it complains that connection was lost [13:27:37] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:27:37] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:27:53] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:27:53] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:28:05] and doesn't display anything but an OS error message [13:28:19] for my desktop CSS just don't load, texts are fine [13:28:23] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:28:27] RECOVERY - IPsec on cp5007 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:28:27] RECOVERY - IPsec on cp5012 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:28:29] RECOVERY - IPsec on cp5009 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:28:31] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:28:33] RECOVERY - IPsec on cp4027 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:28:33] RECOVERY - IPsec on cp4031 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:28:37] PROBLEM - Host db1075 is DOWN: PING CRITICAL - Packet loss = 100% [13:28:39] RECOVERY - IPsec on cp4029 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:28:39] RECOVERY - IPsec on cp4032 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:28:45] RECOVERY - IPsec on cp5010 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:28:45] RECOVERY - IPsec on cp5011 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:28:46] I don't know what's written there but it should be also posted off-wikimedia so it can be read even when users cannot access wikimedia servers [13:28:47] RECOVERY - Aggregate IPsec Tunnel Status eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:28:59] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:29:17] RECOVERY - Aggregate IPsec Tunnel Status ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:29:25] RECOVERY - Aggregate IPsec Tunnel Status esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:30:29] yup [13:30:39] the server is back :) [13:30:49] yeah I can read wikitech [13:31:49] RECOVERY - Host db1075 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:34:44] PROBLEM - MariaDB Slave SQL: s3 #page on db1075 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:35:12] <_joe_> uh what's up? [13:35:23] db1075 rebooted? [13:35:26] PROBLEM - mysqld processes #page on db1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:35:37] PROBLEM - MariaDB read only s3 on db1075 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:35:49] yup, uptime concurs [13:35:55] <_joe_> this happened last week as well I think? [13:35:56] PROBLEM - MariaDB Slave IO: s3 #page on db1075 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:35:58] errrr [13:36:02] I know what that is [13:36:09] what is it? [13:36:10] jclark-ctr: are you touching db1074 or db1075? [13:36:13] seems like a normal reboot [13:36:34] I will depool db1075 for now [13:36:37] PDU operations? [13:36:52] * apergos peeks in [13:36:54] we had an schedule maintenance for db1074 [13:37:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'depool db1075', diff saved to https://phabricator.wikimedia.org/P9280 and previous config saved to /var/cache/conftool/dbconfig/20191009-133709-marostegui.json [13:37:11] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Juniper alarm active [13:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:24] !log repooling cp1085 - T231525 [13:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:27] T231525: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 [13:37:42] no more errors [13:39:13] db1074 was being under on-site maintenance and db1075 had a loose cable [13:39:15] so it went down too [13:39:29] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:40:01] RECOVERY - mysqld processes #page on db1075 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:41:05] PROBLEM - Check systemd state on puppetmaster2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:24] RECOVERY - MariaDB Slave SQL: s3 #page on db1075 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:43:23] RECOVERY - MariaDB read only s3 on db1075 is OK: Version 10.1.38-MariaDB, Uptime 227s, read_only: True, 1824.48 QPS, connection latency: 0.004589s, query latency: 0.001061s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:43:40] RECOVERY - MariaDB Slave IO: s3 #page on db1075 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:48:35] !log reimage puppetmaster2001 [13:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:00] !log rebalancing Ganeti eqiad/row A after rolling reboots of Ganeti nodes [14:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:56] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:57] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1075 after unexpected reboot', diff saved to https://phabricator.wikimedia.org/P9282 and previous config saved to /var/cache/conftool/dbconfig/20191009-140749-marostegui.json [14:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1099:3318 after schema change T233625', diff saved to https://phabricator.wikimedia.org/P9283 and previous config saved to /var/cache/conftool/dbconfig/20191009-141137-marostegui.json [14:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:42] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [14:13:32] RECOVERY - Check systemd state on puppetmaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:11] !log cr1-eqsin: change IPv6 address for BGP peer AS4761 [14:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'More trafic to db1075 after unexpected reboot', diff saved to https://phabricator.wikimedia.org/P9284 and previous config saved to /var/cache/conftool/dbconfig/20191009-144400-marostegui.json [14:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3318 for schema change T233625', diff saved to https://phabricator.wikimedia.org/P9285 and previous config saved to /var/cache/conftool/dbconfig/20191009-144607-marostegui.json [14:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:12] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [14:49:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P9286 and previous config saved to /var/cache/conftool/dbconfig/20191009-144928-marostegui.json [14:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1085 vslow and dump group', diff saved to https://phabricator.wikimedia.org/P9287 and previous config saved to /var/cache/conftool/dbconfig/20191009-145102-marostegui.json [14:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:34] !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:02] !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1101:3318 after schema change', diff saved to https://phabricator.wikimedia.org/P9288 and previous config saved to /var/cache/conftool/dbconfig/20191009-153705-marostegui.json [15:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:38] (03CR) 10Alexandros Kosiaris: "Isn't production mediawiki talking locally (as in via 127.0.0.1) to mcrouter though?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz) [15:55:03] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Papaul) [15:58:43] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vlan-private1-d-codfw] - member ge-6/0/6; [edit interfaces interface-range disabled] mem... [15:59:54] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frban2001.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Papaul) a:05Papaul→03Jgreen [16:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191009T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:05:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1075 after unexpected reboot', diff saved to https://phabricator.wikimedia.org/P9289 and previous config saved to /var/cache/conftool/dbconfig/20191009-160506-marostegui.json [16:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:19] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Papaul) [16:15:18] (03CR) 10CDanis: "Mostly LGTM, a couple nits and questions -- thanks!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [16:15:34] (03CR) 10CDanis: [C: 03+1] site: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [16:17:59] (03PS2) 10CDanis: prometheus global: add rules for correct global HTTP avail [puppet] - 10https://gerrit.wikimedia.org/r/540676 (https://phabricator.wikimedia.org/T234567) [16:22:38] (03CR) 10Alexandros Kosiaris: "> Hm, I wonder if we could modify the script to only search for files that have an e.g. one day mtime, or that don't already have shasum s" [puppet] - 10https://gerrit.wikimedia.org/r/541775 (owner: 10Alexandros Kosiaris) [16:25:01] (03PS1) 10Elukey: role::aqs: update druid datasource for MediaWiki history [puppet] - 10https://gerrit.wikimedia.org/r/541850 [16:25:19] milimetric: --^ [16:26:16] (03CR) 10Elukey: [C: 03+2] role::aqs: update druid datasource for MediaWiki history [puppet] - 10https://gerrit.wikimedia.org/r/541850 (owner: 10Elukey) [16:30:57] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10RobH) Please note that when I compare librenms output it seems like it sees both towers right now: ps1-b6-eqiad: https://librenms.wikimedia.org/device/device=50/ ps1-a4-eqiad: ht... [16:32:34] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10RobH) Clarification: https://netbox.wikimedia.org/dcim/devices/1394/ is the OLD ps1-b3-eqiad that should have its hostname set to asset tag, and then set to offline state as its... [16:33:13] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 50.84 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:33:35] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10ayounsi) It was a PDU miss-configuration and a monitoring issue. Was solved in https://phabricator.wikimedia.org/T229328 [16:34:18] traffic drop looks like same thing that has been happening in eqsin, with a spike followed by a return to baseline [16:35:32] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10wiki_willy) 05Open→03Resolved Thanks for confirming @ayounsi Resolving task. [16:35:34] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10wiki_willy) [16:36:27] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 81.24 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:40:05] (03PS1) 10Mholloway: wikifeeds: bump image to 2019-10-09-163206-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/541852 (https://phabricator.wikimedia.org/T235102) [16:41:05] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10wiki_willy) a:05RobH→03Jclark-ctr @Jclark-ctr - can you wrap up the netbox entries on this one, and then close out the task? Thanks, Willy [16:42:19] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10RobH) a:05RobH→03Jclark-ctr I've just attempted to connect to ps1-a2-eqiad via serial, and failed. To fix this, I'll outline the steps needed below and after coordination wit... [16:44:08] (03CR) 10Mholloway: [V: 03+2 C: 03+2] wikifeeds: bump image to 2019-10-09-163206-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/541852 (https://phabricator.wikimedia.org/T235102) (owner: 10Mholloway) [16:46:50] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [16:46:51] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus - https://phabricator.wikimedia.org/T234567 (10CDanis) We probably want to let the new recording rule accumulate some data -- a week's worth? -- and then st... [16:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:21] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [16:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:35] (03PS1) 10Nray: Turn on Amc Outreach Modal (contexual hooks campaign) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541853 (https://phabricator.wikimedia.org/T234026) [16:49:30] (03PS1) 10Arturo Borrero Gonzalez: toolforge: apt_pinning: add Buster support [puppet] - 10https://gerrit.wikimedia.org/r/541854 (https://phabricator.wikimedia.org/T235059) [16:50:16] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [16:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: apt_pinning: add Buster support [puppet] - 10https://gerrit.wikimedia.org/r/541854 (https://phabricator.wikimedia.org/T235059) (owner: 10Arturo Borrero Gonzalez) [17:04:25] 'Safari cannot open the page because the network connection was lost' on https://en.wikipedia.org/wiki/Special:Watchlist, essentially the one I reported earlier today https://usercontent.irccloud-cdn.com/file/eISS3gJE/IMG_3166.PNG [17:12:43] (03PS5) 10Arturo Borrero Gonzalez: toolforge: refactor proxy role from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) [17:17:28] (03PS6) 10Arturo Borrero Gonzalez: toolforge: refactor proxy role from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) [17:22:08] !log roll restart aqs on aqs100[4-9] to pick up new Druid config changes [17:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:48] (03CR) 10Anomie: [C: 03+1] "Seems ok to me, although I'm not terribly familiar with this part of the config." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541611 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [17:55:09] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config, 10Release-Engineering-Team (Development services): Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10CDanis) Was this discussed during the Monday meeting? What was the outcome? [17:56:42] (03CR) 10Andrew Bogott: [C: 03+1] "Looks good! As far as I know there's nothing automated that depends on these, and it would be nice to get some more intelligible response" [dns] - 10https://gerrit.wikimedia.org/r/541526 (https://phabricator.wikimedia.org/T234836) (owner: 10Arturo Borrero Gonzalez) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191009T1800) [18:09:30] (03PS2) 10EBernhardson: yarn: Add sequential scheduler queue for heavy jobs [puppet] - 10https://gerrit.wikimedia.org/r/541654 [18:30:34] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) [18:32:24] 10Operations: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) a:05RobH→03fgiunchedi @fgiunchedi, ms-be105[1-6].eqiad.wmnet are all online and calling into puppet. You can push them into service as you see fit. Please note when you push them in... [18:43:17] (03PS1) 10Effie Mouzeli: lvs::monitor_services: increase number of tries before MCS is critical [puppet] - 10https://gerrit.wikimedia.org/r/541891 (https://phabricator.wikimedia.org/T229286) [18:43:44] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [18:43:44] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:28] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [18:44:29] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:13] (03CR) 10jerkins-bot: [V: 04-1] lvs::monitor_services: increase number of tries before MCS is critical [puppet] - 10https://gerrit.wikimedia.org/r/541891 (https://phabricator.wikimedia.org/T229286) (owner: 10Effie Mouzeli) [18:45:50] !log Upgrade restbase-dev1004-{a,b} to Cassandra 3.11.4 -- T200803 [18:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:53] T200803: Test/evaluate Cassandra 3.11.4 for production upgrade - https://phabricator.wikimedia.org/T200803 [18:46:23] (03CR) 10Ottomata: [C: 03+2] yarn: Add sequential scheduler queue for heavy jobs [puppet] - 10https://gerrit.wikimedia.org/r/541654 (owner: 10EBernhardson) [18:46:30] (03PS3) 10Ottomata: yarn: Add sequential scheduler queue for heavy jobs [puppet] - 10https://gerrit.wikimedia.org/r/541654 (owner: 10EBernhardson) [18:51:46] !log Upgrade restbase-dev1005-{a,b} to Cassandra 3.11.4 -- T200803 [18:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:49] T200803: Test/evaluate Cassandra 3.11.4 for production upgrade - https://phabricator.wikimedia.org/T200803 [18:58:40] (03PS1) 10Ottomata: Fix hadoop sequential queue xml [puppet] - 10https://gerrit.wikimedia.org/r/541895 [18:59:07] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix hadoop sequential queue xml [puppet] - 10https://gerrit.wikimedia.org/r/541895 (owner: 10Ottomata) [19:00:05] marxarelli: Dear deployers, time to do the MediaWiki train - American version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191009T1900). [19:03:53] (03PS1) 10Dduvall: group1 wikis to 1.35.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541896 [19:03:55] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.35.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541896 (owner: 10Dduvall) [19:04:50] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541896 (owner: 10Dduvall) [19:05:09] marxarelli: Argh, not labswiki. [19:05:26] It's still running on HHVM. [19:05:33] oh, shite [19:05:55] Sorry, forgot to check this morning if it had been fixed yet. [19:06:04] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.1 [19:06:11] already ran ^ [19:06:16] Yeah. [19:06:19] i'll prepare for immediate rollback [19:06:25] of labswiki [19:06:27] Just for wikitech. [19:06:28] Yeah. [19:07:03] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.1 (duration: 00m 58s) [19:08:26] syncing now [19:08:48] !log Upgrade restbase-dev1006-{a,b} to Cassandra 3.11.4 -- T200803 [19:09:09] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: labswiki rollback to 1.34.0-wmf.25 due to hhvm [19:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:21] T200803: Test/evaluate Cassandra 3.11.4 for production upgrade - https://phabricator.wikimedia.org/T200803 [19:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:27] (03CR) 10Dzahn: [C: 04-1] phabricator: support buster with PHP 7.3 packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [19:10:04] marxarelli, James_F: ouch. sorry we left that booby trap for you. [19:10:27] (03PS2) 10Effie Mouzeli: lvs::monitor_services: increase number of tries before MCS is critical [puppet] - 10https://gerrit.wikimedia.org/r/541891 (https://phabricator.wikimedia.org/T229286) [19:10:34] * marxarelli shakes fist at bd808 [19:10:40] * James_F grins. [19:10:45] we are close to ready to try wikitech on php7. I found one more thing to fix in puppet this morning [19:10:57] bd808: On behalf of RelEng, sorry for accidentally taking down your wiki. ;-) [19:11:06] * bd808 is trying to prioritize emergencies today [19:11:11] * James_F nods. [19:11:32] is there a task? i should add it as a train blocker even though it's only for labswiki [19:11:54] T223393 [19:11:54] T223393: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 [19:12:04] cool cool [19:12:07] Hmm, there's some code somewhere running on HHVM. [19:12:13] syntax error, unexpected T_CONST, expecting T_VARIABLE in /srv/mediawiki/php-1.35.0-wmf.1/includes/title/NamespaceInfo.php on line 59 [19:12:40] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10wikitech.wikimedia.org, and 3 others: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10dduvall) [19:13:01] Oh, that's labweb1002 which is also wikitechwiki? [19:13:03] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10wikitech.wikimedia.org, and 3 others: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10bd808) Spotted while using `eval.php` on labweb1002: we are currently missing the php7.2-ldap package there. [19:13:31] James_F: yes, labweb* is wikitech [19:14:31] But this is not a request for labswiki? AW2x7H7ox3rdj6D8OhQt [19:14:42] (It's not a request for any wiki, somehow.) [19:16:15] James_F: maybe me playing with eval.php earlier? [19:16:16] James_F: i don't see that error following the rollback [19:16:34] bd808: Ah, could be. [19:16:39] marxarelli: Yeah, all looks good now [19:16:51] I noticed the train running when eval.php crashed with "Error: You might be using an older PHP version (PHP 5.6.99-hhvm)." [19:17:03] I was just worried that we were serving anything other than labswiki via HHVM. [19:17:27] * bd808 was debugging a different OpenStackManager bug [19:18:17] You mean there's more than one?! ;-) [19:20:33] (03PS1) 10Dduvall: Rollback labswiki to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541901 (https://phabricator.wikimedia.org/T223393) [19:20:37] (03CR) 10Dduvall: [C: 03+2] Rollback labswiki to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541901 (https://phabricator.wikimedia.org/T223393) (owner: 10Dduvall) [19:21:31] (03Merged) 10jenkins-bot: Rollback labswiki to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541901 (https://phabricator.wikimedia.org/T223393) (owner: 10Dduvall) [19:23:31] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10wikitech.wikimedia.org, and 3 others: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10bd808) >>! In T223393#5560962, @bd808 wrote: > Spotted while using `eval.php` on labweb1002: we are currently missing the php7.2-lda... [19:25:06] !log 1.35.0-wmf.1 promoted to group1, labswiki rolled back to 1.34.0-wmf.25 and to be kept back, cc: T233849 [19:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:11] T233849: 1.35.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T233849 [19:27:10] (03PS6) 10BryanDavis: wikitech: switch runtime from HHVM to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [19:30:01] (03PS5) 10Dzahn: phabricator: support buster with PHP 7.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) [19:30:13] (03CR) 10jerkins-bot: [V: 04-1] wikitech: switch runtime from HHVM to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [19:33:58] (03CR) 10Dzahn: [C: 03+1] "lgtm as far as i can tell. thanks for taking it. https://puppet-compiler.wmflabs.org/compiler1001/18809/labweb1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [19:34:27] !log milimetric@deploy1001 Started deploy [analytics/refinery@0a914bf]: new geoeditors column and wikipedia portal EL fix [19:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:21] RECOVERY - ElasticSearch shard size check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [19:40:28] (03CR) 10BryanDavis: wikitech: switch runtime from HHVM to PHP 7.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [19:40:42] (03PS7) 10BryanDavis: wikitech: switch runtime from HHVM to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [19:44:00] !log milimetric@deploy1001 Finished deploy [analytics/refinery@0a914bf]: new geoeditors column and wikipedia portal EL fix (duration: 09m 33s) [19:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:10] !log milimetric@deploy1001 Started deploy [analytics/refinery@0a914bf]: new geoeditors column and wikipedia portal EL fix [19:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:44] (03PS8) 10Andrew Bogott: wikitech: switch runtime from HHVM to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [19:52:07] (03CR) 10Andrew Bogott: [C: 03+2] wikitech: switch runtime from HHVM to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [19:52:10] !log milimetric@deploy1001 Finished deploy [analytics/refinery@0a914bf]: new geoeditors column and wikipedia portal EL fix (duration: 08m 00s) [19:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:27] (03PS1) 10Mholloway: wikifeeds: deploy 2019-10-09-175646-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/541906 [19:53:29] (03CR) 10BPirkle: [WIP] Config changes for Echo kask migration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [19:53:40] (03CR) 10Mholloway: [V: 03+2 C: 03+2] wikifeeds: deploy 2019-10-09-175646-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/541906 (owner: 10Mholloway) [19:54:02] (sorry for the spam, having trouble with the scap deploy, will have to try another few times as we debug) [19:54:46] !log milimetric@deploy1001 Started deploy [analytics/refinery@0a914bf]: new geoeditors column and wikipedia portal EL fix [19:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:56] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [19:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:59] !log milimetric@deploy1001 Finished deploy [analytics/refinery@0a914bf]: new geoeditors column and wikipedia portal EL fix (duration: 00m 12s) [19:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:16] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [19:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:41] 10Operations, 10Wikimedia-Mailing-lists: disable WMFSF, keep archives - https://phabricator.wikimedia.org/T233883 (10herron) Hi @Varnent, the old list address is disabled and messages sent there will held in moderation indefinitely. The communication mail that was sent out about this IMO is clear that the old... [19:58:42] (03PS1) 10Papaul: DNS: Remove mgmt DNS for sarin,db2050 and db2055 [dns] - 10https://gerrit.wikimedia.org/r/541907 [19:59:56] (03CR) 10Krinkle: [Beta Cluster] Enable wmgUseCSPReportOnly for all (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541655 (https://phabricator.wikimedia.org/T211539) (owner: 10Jforrester) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191009T2000). [20:00:16] no parsoid deploy today [20:01:00] (03CR) 10Jforrester: [Beta Cluster] Enable wmgUseCSPReportOnly for all (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541655 (https://phabricator.wikimedia.org/T211539) (owner: 10Jforrester) [20:01:56] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [20:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:38] bd808: wikitech on PHP72 seems to work. I can browse, edit, log in and out. [20:03:01] James_F: yeah, we are close to calling that {{done}} [20:03:14] then on to the other bug :) [20:03:27] Awesome. Thank you so much. [20:03:37] which if you could login in semi-fixed by the pending patch for OSM [20:05:07] * James_F nods. [20:06:54] !log milimetric@deploy1001 Started deploy [analytics/refinery@46501d1]: new geoeditors column and wikipedia portal EL fix [20:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:17] !log milimetric@deploy1001 Finished deploy [analytics/refinery@46501d1]: new geoeditors column and wikipedia portal EL fix (duration: 02m 23s) [20:09:19] 10Operations, 10ops-eqiad: Move kafka100[123] to logstash102[012] - https://phabricator.wikimedia.org/T235124 (10herron) p:05Triage→03Normal [20:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:05] 10Operations, 10ops-codfw, 10DC-Ops: Move kafka200[123] to logstash202[012] - https://phabricator.wikimedia.org/T235125 (10herron) p:05Triage→03Normal [20:10:41] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:48] !log milimetric@deploy1001 Started deploy [analytics/refinery@46501d1]: new geoeditors column and wikipedia portal EL fix [20:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:15] 10Operations, 10ops-eqiad, 10DC-Ops: Move kafka100[123] to logstash102[012] - https://phabricator.wikimedia.org/T235124 (10herron) [20:16:23] !log milimetric@deploy1001 Finished deploy [analytics/refinery@46501d1]: new geoeditors column and wikipedia portal EL fix (duration: 05m 34s) [20:16:25] !log milimetric@deploy1001 Started deploy [analytics/refinery@46501d1]: new geoeditors column and wikipedia portal EL fix [20:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:35] !log milimetric@deploy1001 Finished deploy [analytics/refinery@46501d1]: new geoeditors column and wikipedia portal EL fix (duration: 00m 10s) [20:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:45] (03CR) 10Dzahn: [C: 03+1] DNS: Remove mgmt DNS for sarin,db2050 and db2055 [dns] - 10https://gerrit.wikimedia.org/r/541907 (owner: 10Papaul) [20:17:30] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for sarin,db2050 and db2055 [dns] - 10https://gerrit.wikimedia.org/r/541907 (owner: 10Papaul) [20:18:48] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Papaul) [20:19:09] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Papaul) 05Open→03Resolved complete [20:19:12] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [20:19:44] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Papaul) [20:19:59] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Papaul) 05Open→03Resolved complete [20:20:02] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [20:20:34] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Papaul) [20:20:52] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Papaul) 05Open→03Resolved complete [20:22:20] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@469ed65]: Update mobileapps to b9a225e [20:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:10] 10Operations, 10ops-eqiad, 10DC-Ops: Move kafka100[123] to logstash102[012] - https://phabricator.wikimedia.org/T235124 (10wiki_willy) a:03Cmjohnson @Cmjohnson - this task is relabel, update in Netbox, and update switchport descriptions to the newly renamed hostnames [20:23:51] !log otto@deploy1001 Started deploy [analytics/refinery@9b322e4]: (no justification provided) [20:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:20] 10Operations, 10ops-codfw, 10DC-Ops: Move kafka200[123] to logstash202[012] - https://phabricator.wikimedia.org/T235125 (10wiki_willy) a:03Papaul Hi @Papaul - this task is relabel, update in Netbox, and update switchport descriptions to the newly renamed hostnames. Thanks, Willy [20:26:36] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRIT [20:26:36] ve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve most-read articles for date with no data (with aggregated=true)) is CRITICAL: Test retrieve most-read articles for date with no data (with aggregated=true) returned the unexpected status 404 (expecting: 204): /{domain}/v1/media/image/featured/{year}/{m [20:26:36] ieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/news (get In the News content for unsupported language (with aggregated=true)) is CRITICAL: Test get In the [20:26:36] unsupported language (with aggregated=true) returned the unexpected status 404 (expecting: 204) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:27:44] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 404 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value [20:27:44] g keys: [mostread, tfa, image] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:27:48] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:27:58] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 404 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value [20:27:58] g keys: [tfa, mostread, image] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:28:14] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 404 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value [20:28:14] g keys: [mostread, image, tfa] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:28:17] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [20:28:17] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:22] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [20:28:22] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:42] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@469ed65]: Update mobileapps to b9a225e (duration: 06m 22s) [20:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:00] PROBLEM - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [tfa, image, mostread] https://wikitech.wikimedia.org/wiki/RESTBase [20:31:38] !log rebooting ms-be1051 to access BIOS [20:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:52] (03PS1) 10Filippo Giunchedi: hieradata: add ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541910 (https://phabricator.wikimedia.org/T232367) [20:33:47] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541910 (https://phabricator.wikimedia.org/T232367) (owner: 10Filippo Giunchedi) [20:33:54] (03PS2) 10Filippo Giunchedi: hieradata: add ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541910 (https://phabricator.wikimedia.org/T232367) [20:34:06] PROBLEM - Host ms-be2051 is DOWN: PING CRITICAL - Packet loss = 100% [20:34:42] (03PS1) 10Eevans: restbase: Cassandra client access from k8s [puppet] - 10https://gerrit.wikimedia.org/r/541911 (https://phabricator.wikimedia.org/T234374) [20:35:08] PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [mostread, tfa, image] https://wikitech.wikimedia.org/wiki/RESTBase [20:37:30] bd808: OK, should we trying rolling labswiki over to 1.35.0-wmf.1? [20:38:24] James_F: andrewbogott and I are live hacking there right now. If things go right we will have a backport "soon" and then can catch up with the train late today/tomorrow [20:38:32] Sure, no worries. [20:38:40] (03PS1) 10Jhedden: openstack: update designate wmfsink handler for newton [puppet] - 10https://gerrit.wikimedia.org/r/541913 (https://phabricator.wikimedia.org/T235127) [20:39:00] not being able to do this on mwdebug1xxx is annoying :/ [20:41:11] (03PS3) 10Dzahn: parsoid/conftool: add wtp servers as apache appservers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) [20:41:52] PROBLEM - Check systemd state on ms-be1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:58] !log ppchelko@deploy1001 Started deploy [restbase/deploy@aaadd73] (dev-cluster): Switch to wikifeeds [20:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:46] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1048 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:42:47] (03PS1) 10Filippo Giunchedi: hieradata: use servers _per_port with ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541916 (https://phabricator.wikimedia.org/T232367) [20:43:46] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:04] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:40] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@aaadd73] (dev-cluster): Switch to wikifeeds (duration: 02m 42s) [20:44:40] (03CR) 10Andrew Bogott: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/541913 (https://phabricator.wikimedia.org/T235127) (owner: 10Jhedden) [20:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:28] (03CR) 10Jhedden: [C: 03+2] openstack: update designate wmfsink handler for newton [puppet] - 10https://gerrit.wikimedia.org/r/541913 (https://phabricator.wikimedia.org/T235127) (owner: 10Jhedden) [20:46:44] (03PS2) 10Jhedden: openstack: update designate wmfsink handler for newton [puppet] - 10https://gerrit.wikimedia.org/r/541913 (https://phabricator.wikimedia.org/T235127) [20:51:11] (03PS4) 10Dzahn: parsoid/conftool: add new service parsoid.httpd to wtp servers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) [20:53:28] !log ppchelko@deploy1001 Started deploy [restbase/deploy@aaadd73] (dev-cluster): Switch to wikifeeds, rb-dev1006 [20:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:21] (03CR) 10Dzahn: [C: 04-2] "service/services.yaml does not exist anymore." [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [20:54:24] (03PS2) 10Filippo Giunchedi: hieradata: use servers _per_port with ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541916 (https://phabricator.wikimedia.org/T232367) [20:55:12] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@aaadd73] (dev-cluster): Switch to wikifeeds, rb-dev1006 (duration: 01m 44s) [20:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:44] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:57:32] (03PS3) 10Filippo Giunchedi: hieradata: use servers _per_port with ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541916 (https://phabricator.wikimedia.org/T232367) [20:58:56] RECOVERY - Host ms-be2051 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [20:59:40] RECOVERY - Check systemd state on ms-be1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:32] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1048 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:02:20] !log otto@deploy1001 deploy aborted: (no justification provided) (duration: 38m 29s) [21:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:27] !log otto@deploy1001 Started deploy [analytics/refinery@9b322e4]: (no justification provided) [21:02:28] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: use servers _per_port with ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541916 (https://phabricator.wikimedia.org/T232367) (owner: 10Filippo Giunchedi) [21:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:29] !log otto@deploy1001 Finished deploy [analytics/refinery@9b322e4]: (no justification provided) (duration: 00m 02s) [21:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:30] (03PS5) 10Dzahn: parsoid/conftool: add new service parsoid.httpd to wtp servers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) [21:05:04] ACKNOWLEDGEMENT - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 404 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with une [21:05:04] path = Missing keys: [tfa, mostread, image] ppchelko restrouter in k8s is not used yet by anything and the issue will be resolved by mobrovac in EU work hours. - The acknowledgement expires at: 2019-10-10 21:03:49. https://wikitech.wikimedia.org/wiki/RESTBase [21:05:04] ACKNOWLEDGEMENT - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 404 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with une [21:05:04] path = Missing keys: [mostread, image, tfa] ppchelko restrouter in k8s is not used yet by anything and the issue will be resolved by mobrovac in EU work hours. - The acknowledgement expires at: 2019-10-10 21:03:49. https://wikitech.wikimedia.org/wiki/RESTBase [21:05:19] (03PS6) 10Dzahn: parsoid/conftool: add new service parsoid.httpd to wtp servers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) [21:06:01] (03PS6) 10Dzahn: phabricator: support buster with PHP 7.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) [21:16:34] (03CR) 10Dzahn: [C: 03+2] phabricator: support buster with PHP 7.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [21:22:18] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.003605 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:26:56] icinga wants to say it's back to 4 instead of 5 failed hosts but that isn't really true. it just fails differently, but on it [21:27:17] !log swift eqiad-prod: add ms-be105[1-6] - T232367 [21:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:21] T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 [21:27:38] mutante: also I think there's a problem in how we calculate that metric, that can't be right that 4 hosts trigger the alert [21:27:56] godog: i noticed yesterday it got triggered by 4 -> 5 [21:27:56] haven't had the time to look into it but it is in my backlog [21:28:03] yeah that's wrong [21:28:05] alright, thanks! [21:31:08] (03PS1) 10Papaul: DNS: Remove mgmt DNS for rhenium and lithium [dns] - 10https://gerrit.wikimedia.org/r/541928 [21:32:42] (03CR) 10Dzahn: [C: 03+1] DNS: Remove mgmt DNS for rhenium and lithium [dns] - 10https://gerrit.wikimedia.org/r/541928 (owner: 10Papaul) [21:37:14] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10Papaul) [21:39:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission lithium - https://phabricator.wikimedia.org/T229557 (10Papaul) [21:40:19] (03PS1) 10Nuria: Bumping up refine to newest version [puppet] - 10https://gerrit.wikimedia.org/r/541929 (https://phabricator.wikimedia.org/T234461) [21:42:22] 10Operations, 10Gerrit: replication/gerrit2001 issues - https://phabricator.wikimedia.org/T235135 (10MarcoAurelio) [21:42:24] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for rhenium and lithium [dns] - 10https://gerrit.wikimedia.org/r/541928 (owner: 10Papaul) [21:42:43] 10Operations, 10Gerrit: replication/gerrit2001 issues - https://phabricator.wikimedia.org/T235135 (10MarcoAurelio) p:05Triage→03High [21:42:47] (03PS1) 10Dzahn: phabricator::httpd: support stretch/buster with/without php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/541930 (https://phabricator.wikimedia.org/T190568) [21:44:46] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10Papaul) 05Open→03Resolved complete [21:48:12] 10Operations, 10Gerrit: replication/gerrit2001 issues - https://phabricator.wikimedia.org/T235135 (10Dzahn) Broken by https://gerrit.wikimedia.org/r/c/operations/puppet/+/541386 when we renamed the replication target yesterday. root cause: reject HostKey: gerrit-replica.wikimedia.org as shown in replicati... [21:53:50] (03PS1) 10Dzahn: Revert "Gerrit: Switch replication url for replica to gerrit-replica" [puppet] - 10https://gerrit.wikimedia.org/r/541931 [21:54:10] (03PS2) 10Paladox: Revert "Gerrit: Switch replication url for replica to gerrit-replica" [puppet] - 10https://gerrit.wikimedia.org/r/541931 (owner: 10Dzahn) [21:54:13] (03CR) 10Paladox: [C: 03+1] Revert "Gerrit: Switch replication url for replica to gerrit-replica" [puppet] - 10https://gerrit.wikimedia.org/r/541931 (owner: 10Dzahn) [21:54:42] (03PS3) 10Dzahn: Revert "Gerrit: Switch replication url for replica to gerrit-replica" [puppet] - 10https://gerrit.wikimedia.org/r/541931 (https://phabricator.wikimedia.org/T235135) [21:55:13] (03CR) 10Dzahn: [C: 03+2] "quick fix for now, real fix for later" [puppet] - 10https://gerrit.wikimedia.org/r/541931 (https://phabricator.wikimedia.org/T235135) (owner: 10Dzahn) [22:00:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission phab1002/WMF4727 - https://phabricator.wikimedia.org/T221391 (10Papaul) ` [edit interfaces] - ge-3/0/29 { - description phab1002; - enable; - } [22:01:15] jouncebot: now [22:01:15] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [22:01:39] !log restarting gerrit to revert replication config change (T235135) [22:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:43] T235135: replication/gerrit2001 issues - https://phabricator.wikimedia.org/T235135 [22:02:16] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:41] Hi, whats happening with gerrit.wikimedia.org? [22:03:19] Ok, nothing works now :) [22:04:47] "nothing, works now" vs. "nothing works now". but it's the former [22:05:54] "commas are important" pictures popping in my mind [22:05:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission astatine - https://phabricator.wikimedia.org/T221244 (10Papaul) [22:06:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission astatine - https://phabricator.wikimedia.org/T221244 (10Papaul) ` papaul@asw2-d-eqiad# show | compare [edit interfaces] - ge-3/0/8 { - description astatine; - enable; - } [22:06:49] hehe, yea [22:07:15] gerrit is replicating again [22:10:47] 10Operations, 10Gerrit: replication/gerrit2001 issues - https://phabricator.wikimedia.org/T235135 (10Dzahn) replication.log shows it is replicating again and working on the backlog queue right now. [22:14:25] 10Operations, 10LDAP-Access-Requests: LDAP membership for new employee Nikki Nikkhoui - https://phabricator.wikimedia.org/T235136 (10nnikkhoui) [22:17:39] (03PS2) 10Dzahn: phabricator::httpd: support stretch/buster with/without php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/541930 (https://phabricator.wikimedia.org/T190568) [22:37:08] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/18814/" [puppet] - 10https://gerrit.wikimedia.org/r/541930 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [22:37:40] (03CR) 10Dzahn: [C: 03+2] phabricator::httpd: support stretch/buster with/without php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/541930 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [22:41:49] 10Operations, 10Gerrit: replication/gerrit2001 issues - https://phabricator.wikimedia.org/T235135 (10MarcoAurelio) 05Open→03Resolved a:03Dzahn It looks everything is back to normal now. [22:49:11] (03PS1) 10Jdlrobson: Enable AMC everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541953 (https://phabricator.wikimedia.org/T233612) [22:51:08] (03CR) 10Zoranzoki21: [C: 03+1] "Yes, finally!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541953 (https://phabricator.wikimedia.org/T233612) (owner: 10Jdlrobson) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191009T2300). [23:00:05] Jdlrobson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:17] present :) [23:03:22] RoanKattouw: MaxSem can i pester you for a swat? [23:03:59] You may try ;) [23:04:50] I can do it [23:05:24] Jdlrobson: Which order should I deploy these in? [23:05:30] im in the corner RoanKattouw on the sofas if you need to pester me in person. [23:05:38] 1st up should be the outreach drawer i think [23:05:45] but you can also do together if that makes sense [23:06:46] (03PS1) 10Dzahn: phabricator: install s-nail instead of heirloom-mailx on buster [puppet] - 10https://gerrit.wikimedia.org/r/541967 (https://phabricator.wikimedia.org/T190568) [23:07:07] (03PS2) 10Catrope: Turn on Amc Outreach Modal (contexual hooks campaign) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541853 (https://phabricator.wikimedia.org/T234026) (owner: 10Nray) [23:07:18] (03CR) 10Catrope: [C: 03+2] Turn on Amc Outreach Modal (contexual hooks campaign) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541853 (https://phabricator.wikimedia.org/T234026) (owner: 10Nray) [23:08:13] (03Merged) 10jenkins-bot: Turn on Amc Outreach Modal (contexual hooks campaign) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541853 (https://phabricator.wikimedia.org/T234026) (owner: 10Nray) [23:09:01] (03CR) 10Masumrezarock100: [C: 03+1] "Thanks John for taking care of this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541953 (https://phabricator.wikimedia.org/T233612) (owner: 10Jdlrobson) [23:09:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 72, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:09:18] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:10:13] (03CR) 10Dzahn: [C: 03+2] phabricator: install s-nail instead of heirloom-mailx on buster [puppet] - 10https://gerrit.wikimedia.org/r/541967 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [23:10:22] (03PS2) 10Dzahn: phabricator: install s-nail instead of heirloom-mailx on buster [puppet] - 10https://gerrit.wikimedia.org/r/541967 (https://phabricator.wikimedia.org/T190568) [23:10:44] (03PS8) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) [23:10:51] (03CR) 10jerkins-bot: [V: 04-1] mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) (owner: 10Brennen Bearnes) [23:11:30] Jdlrobson: Outreach drawer is on mwdebug1002, please test [23:11:35] on it [23:12:51] (03PS9) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) [23:13:29] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis CenturyLink Scheduled Maintenance #: 17161404 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:13:29] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 72, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis CenturyLink Scheduled Maintenance #: 17161404 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:13:42] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:13:50] 2019-09-23 09:38:12 GMT - This maintenance is scheduled. [23:14:01] cdanis: you beat me to it. ack. on the calendar [23:14:11] :) [23:14:22] about to sign off for the day, train ride almost over [23:14:31] was still going to check if that interface is really CenturyLink [23:14:36] ok, cu [23:14:48] mutante: yeah, the circuit IDs given in the alert vs in the email matched [23:15:22] ack, great [23:15:25] RoanKattouw: i think we're good here [23:15:41] i will need to check something when amc goes live everywhere too [23:16:02] OK, I'll take this one live first then [23:16:36] (03PS2) 10Catrope: Enable AMC everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541953 (https://phabricator.wikimedia.org/T233612) (owner: 10Jdlrobson) [23:16:50] (03CR) 10Catrope: [C: 03+2] Enable AMC everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541953 (https://phabricator.wikimedia.org/T233612) (owner: 10Jdlrobson) [23:17:17] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Turn on AMC outreach modal (T234026) (duration: 00m 59s) [23:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:21] T234026: Deploy AMC contextual hooks modal - https://phabricator.wikimedia.org/T234026 [23:17:37] (03Merged) 10jenkins-bot: Enable AMC everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541953 (https://phabricator.wikimedia.org/T233612) (owner: 10Jdlrobson) [23:20:43] Jdlrobson: AMC everywhere now on mwdebug1002, please test [23:20:58] on it.. [23:23:06] RoanKattouw: looks great! sync away [23:23:19] i'll then keep an eye on logstash [23:23:58] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: NDA Request from WMDE employee Verena - https://phabricator.wikimedia.org/T233807 (10Nuria) Ping on this , seems this request is been stalled on NDA sign in for a while [23:24:18] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:24:21] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable AMC on all wikis (T233612) (duration: 00m 58s) [23:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:24] T233612: Deploy Advanced mode to all Wikimedia projects - https://phabricator.wikimedia.org/T233612 [23:29:44] 10Operations, 10Analytics, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) Ping @bblack to give us some priorities around this work [23:30:55] 10Operations, 10User-fgiunchedi: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10fgiunchedi) [23:33:03] sweeet. Amc is here [23:39:17] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10wikitech.wikimedia.org, and 2 others: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10Dzahn) 05Open→03Resolved switched over by @andrew and @bd808 [23:39:20] 10Operations, 10Patch-For-Review, 10User-Joe: Package and install php 7.2 in place of php 7.0 - https://phabricator.wikimedia.org/T208433 (10Dzahn) [23:39:29] 10Operations, 10Analytics, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) a:03JAllemandou [23:39:53] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) [23:42:53] (03CR) 10Filippo Giunchedi: "Thanks for the review!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [23:48:29] (03PS11) 10Filippo Giunchedi: swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) [23:48:31] (03PS12) 10Filippo Giunchedi: site: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (https://phabricator.wikimedia.org/T162123) [23:49:26] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): package wikimedia-lvs-realserver for buster - https://phabricator.wikimedia.org/T235140 (10Dzahn) [23:49:45] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): package wikimedia-lvs-realserver for buster - https://phabricator.wikimedia.org/T235140 (10Dzahn) a:05Dzahn→03None [23:51:05] !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@e4e2b22]: (no justification provided) [23:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:37] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) >>! In T190568#5320142, @MoritzMuehlenhoff wrote: >>>! In T190568#5319370, @Dzahn wrote: >> Next we need to... [23:53:12] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) @Muehlenhoff Currently moving to buster is blocked by T235140 [23:55:01] !log twentyafterfour@deploy1001 deploy aborted: (no justification provided) (duration: 03m 57s) [23:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log