[00:12:14] <wikibugs>	 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH)
[00:28:51] <wikibugs>	 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH)
[01:01:31] <wikibugs>	 (03PS1) 10Cwhite: Parameterize path so as to better integrate with Prometheus service discovery. Parameterize spec_segment. Maintain backwards compatibility. Improvements preparing and sanitizing the url for sending to CheckService. [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/541683
[01:32:29] <icinga-wm>	 PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 28508 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops
[01:39:02] <wikibugs>	 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) @fgiunchedi:  the task description has been updated  * ms-be105[1236] are all yours * i need to finish installation/setup of ms-be105[45]
[01:40:51] <icinga-wm>	 PROBLEM - MediaWiki centralauth errors on graphite1004 is CRITICAL: CRITICAL: 46.67% of data above the critical threshold [1.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1
[01:44:31] <icinga-wm>	 RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops
[01:48:51] <icinga-wm>	 RECOVERY - MediaWiki centralauth errors on graphite1004 is OK: OK: Less than 30.00% above the threshold [0.5] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1
[01:50:59] <wikibugs>	 (03PS1) 10Mathew.onipe: wdqs: cleanup un-useful nginx config [puppet] - 10https://gerrit.wikimedia.org/r/541684
[02:05:33] <wikibugs>	 (03PS2) 10Mathew.onipe: wdqs: cleanup un-useful nginx config [puppet] - 10https://gerrit.wikimedia.org/r/541684
[02:05:39] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[02:07:09] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[02:16:41] <icinga-wm>	 PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 24160 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops
[02:28:25] <wikibugs>	 (03PS4) 10CRusnov: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[02:30:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[02:32:45] <icinga-wm>	 RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops
[02:38:55] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 342869896 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:40:31] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3872 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:57:45] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[02:57:47] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[02:59:21] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[02:59:23] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[03:45:12] <wikibugs>	 (03PS3) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588)
[03:45:32] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: add data-reload cookbook (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe)
[03:53:36] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: add data-reload cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe)
[03:58:52] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) >>! In T196560#4927737, @Vgutierrez wrote: > So, the NIC issue reported in T203194 seems to be fixed after upgrading the NIC firmware to version 21.40 (https://www.dell.com/support...
[03:59:53] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr Hi @Jclark-ctr - can you hit up @Vgutierrez when you get in during the AM sometime this week to depool the host?  You guys have overlap in the mornings, un...
[04:20:21] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:36:06] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp5004 [puppet] - 10https://gerrit.wikimedia.org/r/541687 (https://phabricator.wikimedia.org/T231433)
[04:36:08] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp5004 [puppet] - 10https://gerrit.wikimedia.org/r/541688 (https://phabricator.wikimedia.org/T231433)
[04:39:59] <vgutierrez>	 !log switching cp5004 from nginx to ats-tls - T231433
[04:40:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:05] <stashbot>	 T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433
[04:40:44] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp5004 [puppet] - 10https://gerrit.wikimedia.org/r/541687 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[04:47:21] <icinga-wm>	 PROBLEM - HTTPS Unified ECDSA on cp5004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[04:47:40] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp5004 [puppet] - 10https://gerrit.wikimedia.org/r/541688 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[04:47:45] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp5004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[04:47:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1103:3312', diff saved to https://phabricator.wikimedia.org/P9272 and previous config saved to /var/cache/conftool/dbconfig/20191009-044752-marostegui.json
[04:47:53] <vgutierrez>	 ^^ expected
[04:47:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:53:45] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance tls on cp5004 is CRITICAL: PROCS CRITICAL: 0 processes with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:53:47] <icinga-wm>	 PROBLEM - Check systemd state on cp5004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:54:01] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5004 is CRITICAL: connect to address 10.132.0.104 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:54:29] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp5004 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:55:21] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance tls on cp5004 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:55:29] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp5004 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345579 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS
[04:55:53] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp5004 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345556 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS
[04:56:07] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp5004 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:57:01] <icinga-wm>	 RECOVERY - Check systemd state on cp5004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:57:15] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5004 is OK: HTTP OK: HTTP/1.0 200 OK - 19533 bytes in 0.713 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:58:29] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp5004 is CRITICAL: connect to address 10.132.0.104 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:59:33] <icinga-wm>	 PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp5004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[04:59:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3312 for schema change', diff saved to https://phabricator.wikimedia.org/P9273 and previous config saved to /var/cache/conftool/dbconfig/20191009-045941-marostegui.json
[04:59:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:59:51] <icinga-wm>	 PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp5004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[05:01:39] <icinga-wm>	 PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring
[05:02:01] <icinga-wm>	 PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[05:02:34] <vgutierrez>	 hmmm gerrit is having some issues
[05:04:06] <bd808>	 vgutierrez: yeah. this got logged in the releng channel -- "[05:01]  <icinga-wm>	PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring"
[05:04:16] <bd808>	 oh, here too :)
[05:04:21] <vgutierrez>	 indeed :)
[05:04:39] <bd808>	 its a long gc pause if that;s the problem
[05:07:15] <vgutierrez>	 sigh.. icinga puppet run is messed up with gerrit being down...
[05:11:07] <marostegui>	 !log Restart gerrit as it is down
[05:11:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:13:43] <icinga-wm>	 PROBLEM - Check the last execution of git_pull_charts on deploy1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:14:29] <icinga-wm>	 RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 25575 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring
[05:14:34] <vgutierrez>	 :D
[05:14:49] <icinga-wm>	 RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.057 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[05:17:01] <icinga-wm>	 RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp5004 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 344288 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS
[05:17:35] <icinga-wm>	 RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp5004 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 344252 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS
[05:17:38] <vgutierrez>	 nice... puppet is able to run again on icinga :D
[05:19:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1085 for schema change - lag will be generated on s6 labs', diff saved to https://phabricator.wikimedia.org/P9274 and previous config saved to /var/cache/conftool/dbconfig/20191009-051911-marostegui.json
[05:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:23:51] <icinga-wm>	 RECOVERY - Check the last execution of git_pull_charts on deploy1001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:24:44] <wikibugs>	 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Login on wikitech wiki fails after OpenStack upgrade removed v2 identity API - https://phabricator.wikimedia.org/T234996 (10bd808) >>! In T234996#5557713, @gerritbot wrote: > Change 541644 had a related patch set up...
[05:25:44] <wikibugs>	 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Login on wikitech wiki fails after OpenStack upgrade removed v2 identity API - https://phabricator.wikimedia.org/T234996 (10bd808) >>! In T234996#5557995, @bd808 wrote: > I set a custom message at https://wikitech.w...
[05:41:22] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool es1013, es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541707 (https://phabricator.wikimedia.org/T227536)
[05:42:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool es1013, es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541707 (https://phabricator.wikimedia.org/T227536) (owner: 10Marostegui)
[05:43:40] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1013, es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541707 (https://phabricator.wikimedia.org/T227536) (owner: 10Marostegui)
[05:45:10] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1013, es1014 T227536 (duration: 01m 00s)
[05:45:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:45:14] <stashbot>	 T227536: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536
[05:45:24] <wikibugs>	 (03PS5) 10CRusnov: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[05:46:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[05:46:57] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove puppet references for db2069 [puppet] - 10https://gerrit.wikimedia.org/r/541709 (https://phabricator.wikimedia.org/T230107)
[05:47:25] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Remove db2069 production DNS entries [dns] - 10https://gerrit.wikimedia.org/r/541710 (https://phabricator.wikimedia.org/T230107)
[05:47:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission
[05:47:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:37] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[05:47:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:44] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2069.codfw.wmnet - https://phabricator.wikimedia.org/T230107 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2069.codfw.wmnet` -  db2069.codfw.wmnet (**PASS**)...
[05:47:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove puppet references for db2069 [puppet] - 10https://gerrit.wikimedia.org/r/541709 (https://phabricator.wikimedia.org/T230107) (owner: 10Marostegui)
[05:48:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2069 production DNS entries [dns] - 10https://gerrit.wikimedia.org/r/541710 (https://phabricator.wikimedia.org/T230107) (owner: 10Marostegui)
[05:48:45] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/540762 (https://phabricator.wikimedia.org/T234300)
[05:48:52] <wikibugs>	 (03PS2) 10Marostegui: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/540763 (https://phabricator.wikimedia.org/T234300)
[05:49:45] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2069.codfw.wmnet - https://phabricator.wikimedia.org/T230107 (10Marostegui) a:05RobH→03Papaul
[05:49:58] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2069.codfw.wmnet - https://phabricator.wikimedia.org/T230107 (10Marostegui) Host ready for on-site steps + switch disablement
[06:04:48] <wikibugs>	 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez)
[06:09:45] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp4024 [puppet] - 10https://gerrit.wikimedia.org/r/541711 (https://phabricator.wikimedia.org/T231433)
[06:09:47] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp4024 [puppet] - 10https://gerrit.wikimedia.org/r/541712 (https://phabricator.wikimedia.org/T231433)
[06:31:38] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:34:42] <vgutierrez>	 !log switching from nginx to ats-tls on cp4024 - T231433
[06:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:34:46] <stashbot>	 T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433
[06:35:26] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp4024 [puppet] - 10https://gerrit.wikimedia.org/r/541711 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[06:38:15] <elukey>	 vgutierrez: nice!
[06:38:40] <vgutierrez>	 yup...
[06:38:56] <elukey>	 half way through more or less right?
[06:38:56] <vgutierrez>	 upload is pretty happy with ats-tls
[06:39:02] <vgutierrez>	 yep
[06:39:10] <vgutierrez>	 I need to debug some tiny issues on text though
[06:42:44] <icinga-wm>	 PROBLEM - HTTPS Unified ECDSA on cp4024 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[06:43:46] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp4024 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[06:46:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Thanks, merging" [puppet] - 10https://gerrit.wikimedia.org/r/541388 (owner: 10Dzahn)
[06:46:07] <wikibugs>	 (03PS2) 10Muehlenhoff: cumin: remove yubiauth alias [puppet] - 10https://gerrit.wikimedia.org/r/541388 (owner: 10Dzahn)
[06:47:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp4024 [puppet] - 10https://gerrit.wikimedia.org/r/541712 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[06:48:28] <wikibugs>	 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) >>! In T227025#5556705, @Cmjohnson wrote: > I don't know what you need me to do...the servers were setup correctly.  There seems to be an issue with...
[06:48:34] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Fix sessionstore lvs monitoring typo [puppet] - 10https://gerrit.wikimedia.org/r/541714
[06:49:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix sessionstore lvs monitoring typo [puppet] - 10https://gerrit.wikimedia.org/r/541714 (owner: 10Alexandros Kosiaris)
[06:50:12] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp4024 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345589 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS
[06:50:44] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp4024 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345555 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS
[06:53:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, you could could also simply use php-foo packages from php-defaults, which pulls in the correct phpX.Y-foo packages. But the cu" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn)
[06:53:54] <wikibugs>	 (03PS3) 10Muehlenhoff: cumin: remove yubiauth alias [puppet] - 10https://gerrit.wikimedia.org/r/541388 (owner: 10Dzahn)
[06:57:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] cumin: remove yubiauth alias [puppet] - 10https://gerrit.wikimedia.org/r/541388 (owner: 10Dzahn)
[06:58:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "You don't want to add the servers to the "appserver" cluster. You are adding a new "service" to the already existing "parsoid" cluster." [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[07:07:51] <wikibugs>	 (03CR) 10Elukey: "> I think if you just include the druid user in profile::hadoop::master::hadoop_user_groups," [puppet] - 10https://gerrit.wikimedia.org/r/541554 (owner: 10Elukey)
[07:09:37] <wikibugs>	 (03CR) 10Jcrespo: "I would suggest to pause this deploy until the gtid filtering issue gets researched." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz)
[07:10:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "> I would suggest to pause this deploy until the gtid filtering issue" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz)
[07:14:44] <wikibugs>	 (03PS4) 10Elukey: profile::analytics::cluster::users: ensure user druid [puppet] - 10https://gerrit.wikimedia.org/r/541554
[07:18:03] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/18804/" [puppet] - 10https://gerrit.wikimedia.org/r/541554 (owner: 10Elukey)
[07:19:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> Add .pipeline/config.yaml with publish stage:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/541371 (https://phabricator.wikimedia.org/T234578) (owner: 10Jeena Huneidi)
[07:26:25] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:29:15] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[07:29:16] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[07:29:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:53] <wikibugs>	 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez)
[07:36:55] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:37:36] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp3038 [puppet] - 10https://gerrit.wikimedia.org/r/541752 (https://phabricator.wikimedia.org/T231433)
[07:37:38] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp3038 [puppet] - 10https://gerrit.wikimedia.org/r/541753 (https://phabricator.wikimedia.org/T231433)
[07:38:55] <vgutierrez>	 !log Switch cp3038 from nginx to ats-tls - T231433
[07:38:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:59] <stashbot>	 T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433
[07:40:10] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp3038 [puppet] - 10https://gerrit.wikimedia.org/r/541752 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[07:45:31] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp3038 [puppet] - 10https://gerrit.wikimedia.org/r/541753 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[07:46:49] <icinga-wm>	 PROBLEM - HTTPS Unified ECDSA on cp3038 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[07:47:02] <effie>	 ^ expected? :p
[07:47:07] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp3038 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[07:48:02] <moritzm>	 !log reduced RAM assignment for boron to 8G
[07:48:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:25] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp3038 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345571 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS
[07:48:43] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp3038 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345553 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/HTTPS
[07:56:33] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez)
[07:57:08] <wikibugs>	 (03CR) 10Elukey: "Ok sorry I wasn't caffeinated enough :)" [puppet] - 10https://gerrit.wikimedia.org/r/541554 (owner: 10Elukey)
[07:59:59] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/541758 (https://phabricator.wikimedia.org/T231433)
[08:00:01] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 [puppet] - 10https://gerrit.wikimedia.org/r/541759 (https://phabricator.wikimedia.org/T231433)
[08:00:44] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/541759 (https://phabricator.wikimedia.org/T231433)
[08:01:52] <vgutierrez>	 !log Switch cp2011 from nginx to ats-tls - T231433
[08:01:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:57] <stashbot>	 T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433
[08:04:58] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/541758 (https://phabricator.wikimedia.org/T231433)
[08:05:00] <wikibugs>	 (03PS3) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/541759 (https://phabricator.wikimedia.org/T231433)
[08:06:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/541758 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[08:09:09] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/541759 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[08:09:33] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:10:47] <icinga-wm>	 PROBLEM - HTTPS Unified ECDSA on cp2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[08:12:17] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp2011 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345563 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 43 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:14:06] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[08:14:06] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:14:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:40] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[08:18:41] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:18:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:50] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[08:18:51] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:18:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:13] <wikibugs>	 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez)
[08:24:29] <moritzm>	 !log draining ganeti1006 for upcoming reboot (combined kernel/qemu security updates)
[08:24:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:01] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:33:15] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:33:29] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:33:33] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:34:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:34:37] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:37:18] <elukey>	 mmmmm
[08:37:31] <effie>	 I think there is a maint 
[08:37:36] <effie>	 I am trying to decrypt the calendar
[08:37:57] <elukey>	 there is a Telia notification that I can see
[08:38:03] <elukey>	 checking on the routers
[08:38:31] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:39:23] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp1082 [puppet] - 10https://gerrit.wikimedia.org/r/541762 (https://phabricator.wikimedia.org/T231433)
[08:39:26] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp1082 [puppet] - 10https://gerrit.wikimedia.org/r/541763 (https://phabricator.wikimedia.org/T231433)
[08:39:40] <vgutierrez>	 !log Switch cp1082 from nginx to ats-tls - T231433
[08:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:45] <stashbot>	 T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433
[08:40:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp1082 [puppet] - 10https://gerrit.wikimedia.org/r/541762 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[08:42:36] <elukey>	 I think that there is a GRE tunnel down due to a transit maintenance, or similar
[08:43:33] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp1082 [puppet] - 10https://gerrit.wikimedia.org/r/541763 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[08:44:04] <elukey>	 NTT between eqdfw and ulsfo 
[08:44:45] <elukey>	 that I guess we have also between ulsfo and eqsin?
[08:45:25] <elukey>	 yep
[08:46:31] <elukey>	 ahhh ok I can see in maint announce the NTT scheduled maintenance
[08:46:48] <elukey>	 but it is not in the gcal afaics
[08:47:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1085 after schema change', diff saved to https://phabricator.wikimedia.org/P9275 and previous config saved to /var/cache/conftool/dbconfig/20191009-084732-marostegui.json
[08:47:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:00] <elukey>	 nothing on fire in theory
[08:49:11] <elukey>	 does what I wrote above make sense?\
[08:50:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113:3316 for schema change, temporarily pool db1085 as vslow,dump', diff saved to https://phabricator.wikimedia.org/P9276 and previous config saved to /var/cache/conftool/dbconfig/20191009-085016-marostegui.json
[08:50:19] <elukey>	 (the NTT maintenance is affecting ulsfo so both "legs" between eqdfw and eqsin are suffering, causing the OSPF alarms)
[08:50:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:43] <elukey>	 (legs == GRE tunnels)
[08:51:10] <XioNoX>	 elukey: which links are having maintenances?
[08:51:19] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez)
[08:51:38] <elukey>	 XioNoX: I've read only NTT in ulsfo 
[08:52:37] <elukey>	 they are sayng a sw upgrade
[08:53:05] <elukey>	 but if transit is down in ulsfo then both GRE tunnels are down no?
[08:53:57] <elukey>	 elukey@cr4-ulsfo> show interfaces descriptions | match down
[08:53:57] <elukey>	 et-0/0/2        down  down DISABLED
[08:53:58] <elukey>	 xe-0/1/0        up    down Transit: NTT (service ID 234631) {#1079} [10Gbps]
[08:54:12] <elukey>	 XioNoX: --^
[08:55:33] <XioNoX>	 on my phone and signal is terrible in CDG...
[08:59:05] <XioNoX>	 elukey: your diagnostic make sens to me
[08:59:40] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.decommission
[08:59:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:52] <elukey>	 XioNoX: I can open a ticket to NTT, they said no impact expected :D
[09:00:31] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[09:00:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:49] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: `auth1001.eqiad.wmnet` -  auth1001.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga   - Downtimed managemen...
[09:01:01] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10MoritzMuehlenhoff)
[09:01:25] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10MoritzMuehlenhoff)
[09:03:06] <XioNoX>	 elukey: not strictly needed as long as we're in the window
[09:04:19] <wikibugs>	 (03PS1) 10Muehlenhoff: auth1001: Remove remaining puppet references [puppet] - 10https://gerrit.wikimedia.org/r/541764 (https://phabricator.wikimedia.org/T234909)
[09:05:17] <elukey>	 XioNoX: yes we are, it will last 4h
[09:07:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] auth1001: Remove remaining puppet references [puppet] - 10https://gerrit.wikimedia.org/r/541764 (https://phabricator.wikimedia.org/T234909) (owner: 10Muehlenhoff)
[09:07:50] <elukey>	 ok will wait then
[09:09:44] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-tools: Host decommission improvements - https://phabricator.wikimedia.org/T231066 (10Volans) 05Open→03Resolved I'm marking this as resolved as the cookbook has been used many times at this point and both Phabricator templated and wikitech documentation have been updated acco...
[09:09:53] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[09:13:20] <wikibugs>	 (03CR) 10Volans: "> Patch Set 1:" [homer/public] - 10https://gerrit.wikimedia.org/r/541375 (owner: 10Ayounsi)
[09:13:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove DNS entries for auth1001 [dns] - 10https://gerrit.wikimedia.org/r/541766 (https://phabricator.wikimedia.org/T234909)
[09:15:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet::rsync: disable chroot on volatile and ssl rsync [puppet] - 10https://gerrit.wikimedia.org/r/541545 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond)
[09:15:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove DNS entries for auth1001 [dns] - 10https://gerrit.wikimedia.org/r/541766 (https://phabricator.wikimedia.org/T234909) (owner: 10Muehlenhoff)
[09:16:16] <wikibugs>	 (03PS4) 10Jbond: puppet::rsync: disable chroot on volatile and ssl rsync [puppet] - 10https://gerrit.wikimedia.org/r/541545 (https://phabricator.wikimedia.org/T234315)
[09:17:01] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03Cmjohnson
[09:18:58] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10MoritzMuehlenhoff) a:05RobH→03Cmjohnson
[09:23:28] <wikibugs>	 10Operations, 10decommission: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey) a:05RobH→03Cmjohnson
[09:29:48] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] debdeploy: Fix update_type type [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541558 (owner: 10Muehlenhoff)
[09:31:53] <wikibugs>	 (03CR) 10Volans: "Comment inline" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/541394 (owner: 10Ayounsi)
[09:39:17] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[09:39:17] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:39:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:53] <wikibugs>	 (03PS1) 10Jbond: debdeploy: change global to immutable [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541768
[09:43:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541768 (owner: 10Jbond)
[09:44:35] <moritzm>	 !log draining ganeti1007 for upcoming reboot (combined kernel/qemu security updates)
[09:44:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:49] <revi>	 apparently my iOS Safari hates me connecting to Wikipedia (gives me Connection Reset error) but my macOS disagrees
[09:45:51] * revi shrugs
[09:48:37] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[09:48:37] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:48:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] debdeploy: change global to immutable [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541768 (owner: 10Jbond)
[09:52:36] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[09:52:37] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:52:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:08] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[09:53:09] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:12] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[09:53:13] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:53:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:07] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:59:33] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:00:21] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:00:31] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:00:35] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:00:47] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:00:49] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:02:08] <elukey>	 gooood
[10:02:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Allow skipping distros again [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541770
[10:02:15] <elukey>	 NTT maintenance hopefully over
[10:07:24] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: restrouter: Allow the kademlia port in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/541771 (https://phabricator.wikimedia.org/T223953)
[10:10:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Allow the kademlia port in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/541771 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris)
[10:10:52] <wikibugs>	 (03Merged) 10jenkins-bot: restrouter: Allow the kademlia port in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/541771 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris)
[10:16:45] <logmsgbot>	 !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'restrouter' for release 'production' .
[10:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:58] <logmsgbot>	 !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'restrouter' for release 'production' .
[10:19:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:09] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:22:59] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:23:09] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:23:15] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:23:18] <elukey>	 nope NTT maintenance again :(
[10:23:23] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:23:27] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:24:14] <elukey>	 the maintenance window closes in ~1.5h
[10:27:52] <effie>	 elukey: you decrypt the calendar!
[10:27:58] <effie>	 decrypted*
[10:28:46] <elukey>	 effie: this info was only in maint announce and not in the cal :(
[10:29:04] <effie>	 oh that is wht I couldnt find it 
[10:33:05] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:33:29] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:34:17] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:34:27] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:34:35] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:34:41] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:34:45] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:58:40] <logmsgbot>	 !log akosiaris@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' .
[10:58:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191009T1100).
[11:00:04] <jouncebot>	 alaa_wmde: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:22] <Amir1>	 I'm filling in for Alaa
[11:01:11] <Amir1>	 I can do SWAT today
[11:04:15] <logmsgbot>	 !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'restrouter' for release 'production' .
[11:04:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:44] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:541777|Put write both limit down to Q70m for item terms (T234948)]] (duration: 01m 10s)
[11:04:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:47] <stashbot>	 T234948: New Wikibase deadlocks on Wikidata wiki since 2019-10-08T00:00:02: Wikibase\Lib\Store\Sql\Terms\{closure} Deadlock found when trying to get lock; try restarting transaction - https://phabricator.wikimedia.org/T234948
[11:05:05] <Amir1>	 !log EU SWAT is done
[11:05:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:12] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[11:25:13] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:25:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:18] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[11:25:18] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:25:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:47] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[11:25:48] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:25:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:40] <moritzm>	 !log draining ganeti1008 for upcoming reboot (combined kernel/qemu security updates)
[11:32:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:28] <revi>	 I have no idea what https://usercontent.irccloud-cdn.com/file/woqWO186/image.png
[12:00:39] <revi>	 what's going on (this has been like this for few hrs)
[12:02:39] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[12:02:40] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:02:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:46] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[12:02:47] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:02:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:12] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[12:03:13] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:03:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:31] <moritzm>	 !log failover Ganeti master in eqiad to ganeti1003
[12:10:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:41] <icinga-wm>	 PROBLEM - k8s API server requests latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={PATCH,POST} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:11:59] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={create,get} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:13:11] <icinga-wm>	 PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:13:19] <icinga-wm>	 RECOVERY - k8s API server requests latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:13:35] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:13:52] <moritzm>	 !log draining ganeti1001 for upcoming reboot (combined kernel/qemu security updates)
[12:13:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:59] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={get,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:15:04] <akosiaris>	 all these ^ are the ganeti moves
[12:16:27] <icinga-wm>	 RECOVERY - k8s API server requests latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:17:15] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:20:41] <logmsgbot>	 !log mobrovac@deploy1001 Started deploy [restbase/deploy@068d2ed]: Feed: Use Wikifeeds; Parsoid: Use the ETag revid for stashing and use the same ETag for stashing and response - T170455 T234928
[12:20:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:46] <stashbot>	 T234928: RESTBase sometimes not retaining stashed content? - https://phabricator.wikimedia.org/T234928
[12:20:47] <stashbot>	 T170455: Extract the feed endpoints from PCS into a new wikifeeds service - https://phabricator.wikimedia.org/T170455
[12:28:23] <vgutierrez>	 !log depooling cp1085 for a power drain - T231525
[12:28:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:27] <stashbot>	 T231525: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525
[12:30:21] <logmsgbot>	 !log mobrovac@deploy1001 Finished deploy [restbase/deploy@068d2ed]: Feed: Use Wikifeeds; Parsoid: Use the ETag revid for stashing and use the same ETag for stashing and response - T170455 T234928 (duration: 09m 40s)
[12:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:26] <stashbot>	 T234928: RESTBase sometimes not retaining stashed content? - https://phabricator.wikimedia.org/T234928
[12:30:26] <stashbot>	 T170455: Extract the feed endpoints from PCS into a new wikifeeds service - https://phabricator.wikimedia.org/T170455
[12:32:12] <librenms-wmf>	 04Critical Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Juniper alarm active
[12:32:57] <logmsgbot>	 !log mobrovac@deploy1001 Started deploy [restbase/deploy@068d2ed]: Feed: Use Wikifeeds; Parsoid: Use the ETag revid for stashing and use the same ETag for stashing and response, take #2
[12:32:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:00] <volans>	 XioNoX: ^^^
[12:33:11] <XioNoX>	 thx
[12:33:33] <XioNoX>	 2019-10-08 12:16:21 UTC  Minor  FPC 2 PEM 1 is not powered
[12:34:49] <XioNoX>	 pinged eqiad-ops on -dcops
[12:35:10] <jbond42>	 !log reimage puppetmaster2002
[12:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:21] <moritzm>	 !log disabled puppet on DNS recursors for staged rollout of ferm NTP change 
[12:36:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:23] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[12:38:27] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[12:38:28] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:59] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[12:40:27] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob
[12:40:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1105:3312 after schema change', diff saved to https://phabricator.wikimedia.org/P9277 and previous config saved to /var/cache/conftool/dbconfig/20191009-124035-marostegui.json
[12:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:15] <logmsgbot>	 !log mobrovac@deploy1001 Finished deploy [restbase/deploy@068d2ed]: Feed: Use Wikifeeds; Parsoid: Use the ETag revid for stashing and use the same ETag for stashing and response, take #2 (duration: 08m 18s)
[12:41:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[12:42:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 for BBU replacement T231638', diff saved to https://phabricator.wikimedia.org/P9278 and previous config saved to /var/cache/conftool/dbconfig/20191009-124218-marostegui.json
[12:42:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:23] <stashbot>	 T231638: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638
[12:42:41] <marostegui>	 !log Stop MySQL and power off db1074 for BBU replacement T231638
[12:42:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:47] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status esams on icinga1001 is CRITICAL: instance={cp3030:9536,cp3032:9536,cp3033:9536,cp3040:9536,cp3041:9536,cp3042:9536,cp3043:9536} site=esams tunnel={cp1085_v4,cp1085_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[12:46:17] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status ulsfo on icinga1001 is CRITICAL: instance={cp4027:9536,cp4028:9536,cp4029:9536,cp4030:9536,cp4031:9536,cp4032:9536} site=ulsfo tunnel={cp1085_v4,cp1085_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[12:46:45] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2001:9536,cp2004:9536,cp2006:9536,cp2007:9536,cp2010:9536,cp2012:9536,cp2013:9536,cp2016:9536,cp2019:9536,cp2023:9536} site=codfw tunnel={cp1085_v4,cp1085_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[12:46:45] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status eqsin on icinga1001 is CRITICAL: instance={cp5007:9536,cp5008:9536,cp5009:9536,cp5010:9536,cp5011:9536,cp5012:9536} site=eqsin tunnel={cp1085_v4,cp1085_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[12:46:51] <vgutierrez>	 arg.. that's expected
[12:47:15] <elukey>	 good :)
[12:48:24] <icinga-wm>	 PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:28] <icinga-wm>	 PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:29] <icinga-wm>	 PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:34] <icinga-wm>	 PROBLEM - IPsec on cp5007 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:34] <icinga-wm>	 PROBLEM - IPsec on cp5009 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:34] <icinga-wm>	 PROBLEM - IPsec on cp5012 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:38] <icinga-wm>	 PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:44] <icinga-wm>	 PROBLEM - IPsec on cp4027 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:44] <icinga-wm>	 PROBLEM - IPsec on cp4031 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:46] <icinga-wm>	 PROBLEM - IPsec on cp5010 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:46] <icinga-wm>	 PROBLEM - IPsec on cp5011 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:46] <icinga-wm>	 PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:46] <icinga-wm>	 PROBLEM - IPsec on cp4032 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:48] <icinga-wm>	 PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:52] <icinga-wm>	 PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:48:52] <icinga-wm>	 PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1085_v4, cp1085_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:49:09] <vgutierrez>	 sigh.. I've been too slow with the downtime :/
[12:49:09] <vgutierrez>	 sorry
[12:49:26] <volans>	 vgutierrez: np, at least those should go away soon~ish right?
[12:49:30] <volans>	 those tunnels
[12:49:46] <Niharika>	 Is there something wrong with Phabricator? I'm getting errors in the console and things randomly error out.
[12:50:02] <vgutierrez>	 Niharika: which kind of errors?
[12:50:52] <Niharika>	 vgutierrez: When trying to 'Show older changes', on a ticket, I see `ReferenceError: Can't find variable: add_event_listener`in the console. 
[12:50:58] <Niharika>	 And it doesn't load. 
[12:52:24] <vgutierrez>	 volans: well.. we need to replace varnish-be with ats on text to get rid of the IPSec tunnels
[12:53:18] <volans>	 vgutierrez: yeah, soon~ish :D
[12:53:29] <vgutierrez>	 volans: but it looks like ema is better than me at taming ATS
[12:53:32] <vgutierrez>	 so yeah..
[12:56:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318 for schema change T233625', diff saved to https://phabricator.wikimedia.org/P9279 and previous config saved to /var/cache/conftool/dbconfig/20191009-125641-marostegui.json
[12:56:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:47] <stashbot>	 T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625
[12:59:24] <logmsgbot>	 !log mobrovac@deploy1001 Started deploy [restbase/deploy@aaadd73]: Parsoid: Retry fetching stashes with undefined as the revid - T234928
[12:59:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:29] <stashbot>	 T234928: RESTBase sometimes not retaining stashed content? - https://phabricator.wikimedia.org/T234928
[13:08:42] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob
[13:08:52] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[13:10:06] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[13:10:18] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[13:10:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob
[13:10:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HT
[13:10:44] <icinga-wm>	  timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[13:10:50] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/references/{title} (Get references of a test page) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggrega
[13:10:50] <icinga-wm>	 out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[13:12:14] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[13:12:14] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[13:12:20] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[13:13:50] <logmsgbot>	 !log mobrovac@deploy1001 Finished deploy [restbase/deploy@aaadd73]: Parsoid: Retry fetching stashes with undefined as the revid - T234928 (duration: 14m 26s)
[13:13:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:54] <stashbot>	 T234928: RESTBase sometimes not retaining stashed content? - https://phabricator.wikimedia.org/T234928
[13:24:10] <revi>	 dunno what's going on but when I connect from home (eqsin) site doesn't load properly and when I try VPNing to Europe or NorthAmerica it works
[13:24:17] <revi>	 probably task-worthy I guess
[13:25:00] <volans>	 vgutierrez: FYI ^^^
[13:25:48] <volans>	 probably too late, forgot the TZ
[13:25:49] <revi>	 for kowiki or somewhere else it is usually missing CSS stuff and hitting F5 loads the CSS but for wikitech it just don't work
[13:26:19] <volans>	 revi: what error are you getitng?
[13:26:21] <volans>	 *getting
[13:26:35] <volans>	 you can also try to follow https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue
[13:27:36] <revi>	 for iPhone, it complains that connection was lost
[13:27:37] <icinga-wm>	 RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:27:37] <icinga-wm>	 RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:27:53] <icinga-wm>	 RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:27:53] <icinga-wm>	 RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:28:05] <revi>	 and doesn't display anything but an OS error message
[13:28:19] <revi>	 for my desktop CSS just don't load, texts are fine
[13:28:23] <icinga-wm>	 RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:28:27] <icinga-wm>	 RECOVERY - IPsec on cp5007 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:28:27] <icinga-wm>	 RECOVERY - IPsec on cp5012 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:28:29] <icinga-wm>	 RECOVERY - IPsec on cp5009 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:28:31] <icinga-wm>	 RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:28:33] <icinga-wm>	 RECOVERY - IPsec on cp4027 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:28:33] <icinga-wm>	 RECOVERY - IPsec on cp4031 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:28:37] <icinga-wm>	 PROBLEM - Host db1075 is DOWN: PING CRITICAL - Packet loss = 100%
[13:28:39] <icinga-wm>	 RECOVERY - IPsec on cp4029 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:28:39] <icinga-wm>	 RECOVERY - IPsec on cp4032 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:28:45] <icinga-wm>	 RECOVERY - IPsec on cp5010 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:28:45] <icinga-wm>	 RECOVERY - IPsec on cp5011 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:28:46] <revi>	 I don't know what's written there but it should be also posted off-wikimedia so it can be read even when users cannot access wikimedia servers
[13:28:47] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[13:28:59] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[13:29:17] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[13:29:25] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[13:30:29] <vgutierrez>	 yup
[13:30:39] <vgutierrez>	 the server is back :)
[13:30:49] <revi>	 yeah I can read wikitech
[13:31:49] <icinga-wm>	 RECOVERY - Host db1075 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[13:34:44] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 #page on db1075 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[13:35:12] <_joe_>	 uh what's up?
[13:35:23] <akosiaris>	 db1075 rebooted?
[13:35:26] <icinga-wm>	 PROBLEM - mysqld processes #page on db1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[13:35:37] <icinga-wm>	 PROBLEM - MariaDB read only s3 on db1075 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[13:35:49] <akosiaris>	 yup, uptime concurs
[13:35:55] <_joe_>	 this happened last week as well I think?
[13:35:56] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 #page on db1075 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[13:35:58] <marostegui>	 errrr
[13:36:02] <marostegui>	 I know what that is
[13:36:09] <jynus>	 what is it?
[13:36:10] <marostegui>	 jclark-ctr: are you touching db1074 or db1075?
[13:36:13] <akosiaris>	 seems like a normal reboot
[13:36:34] <marostegui>	 I will depool db1075 for now
[13:36:37] <arturo>	 PDU operations?
[13:36:52] * apergos peeks in
[13:36:54] <marostegui>	 we had an schedule maintenance for db1074
[13:37:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'depool db1075', diff saved to https://phabricator.wikimedia.org/P9280 and previous config saved to /var/cache/conftool/dbconfig/20191009-133709-marostegui.json
[13:37:11] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Juniper alarm active
[13:37:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:24] <vgutierrez>	 !log repooling cp1085 - T231525
[13:37:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:27] <stashbot>	 T231525: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525
[13:37:42] <jynus>	 no more errors
[13:39:13] <marostegui>	 db1074 was being under on-site maintenance and db1075 had a loose cable
[13:39:15] <marostegui>	 so it went down too
[13:39:29] <icinga-wm>	 RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[13:40:01] <icinga-wm>	 RECOVERY - mysqld processes #page on db1075 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[13:41:05] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:42:24] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 #page on db1075 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[13:43:23] <icinga-wm>	 RECOVERY - MariaDB read only s3 on db1075 is OK: Version 10.1.38-MariaDB, Uptime 227s, read_only: True, 1824.48 QPS, connection latency: 0.004589s, query latency: 0.001061s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[13:43:40] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 #page on db1075 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[13:48:35] <jbond42>	 !log reimage puppetmaster2001
[13:48:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:00] <moritzm>	 !log rebalancing Ganeti eqiad/row A after rolling reboots of Ganeti nodes
[14:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:56] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime
[14:03:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:57] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:05:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1075 after unexpected reboot', diff saved to https://phabricator.wikimedia.org/P9282 and previous config saved to /var/cache/conftool/dbconfig/20191009-140749-marostegui.json
[14:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1099:3318 after schema change T233625', diff saved to https://phabricator.wikimedia.org/P9283 and previous config saved to /var/cache/conftool/dbconfig/20191009-141137-marostegui.json
[14:11:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:42] <stashbot>	 T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625
[14:13:32] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:11] <elukey>	 !log cr1-eqsin: change IPv6 address for BGP peer AS4761
[14:38:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'More trafic to db1075 after unexpected reboot', diff saved to https://phabricator.wikimedia.org/P9284 and previous config saved to /var/cache/conftool/dbconfig/20191009-144400-marostegui.json
[14:44:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3318 for schema change T233625', diff saved to https://phabricator.wikimedia.org/P9285 and previous config saved to /var/cache/conftool/dbconfig/20191009-144607-marostegui.json
[14:46:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:12] <stashbot>	 T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625
[14:49:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P9286 and previous config saved to /var/cache/conftool/dbconfig/20191009-144928-marostegui.json
[14:49:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1085 vslow and dump group', diff saved to https://phabricator.wikimedia.org/P9287 and previous config saved to /var/cache/conftool/dbconfig/20191009-145102-marostegui.json
[14:51:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:34] <logmsgbot>	 !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
[15:02:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:02] <logmsgbot>	 !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
[15:04:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1101:3318 after schema change', diff saved to https://phabricator.wikimedia.org/P9288 and previous config saved to /var/cache/conftool/dbconfig/20191009-153705-marostegui.json
[15:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Isn't production mediawiki talking locally (as in via 127.0.0.1) to mcrouter though?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz)
[15:55:03] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Papaul)
[15:58:43] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Papaul) ` papaul@asw-d-codfw# show | compare  [edit interfaces interface-range vlan-private1-d-codfw] -    member ge-6/0/6; [edit interfaces interface-range disabled]      mem...
[15:59:54] <wikibugs>	 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frban2001.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Papaul) a:05Papaul→03Jgreen
[16:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191009T1600).
[16:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:05:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1075 after unexpected reboot', diff saved to https://phabricator.wikimedia.org/P9289 and previous config saved to /var/cache/conftool/dbconfig/20191009-160506-marostegui.json
[16:05:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:19] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Papaul)
[16:15:18] <wikibugs>	 (03CR) 10CDanis: "Mostly LGTM, a couple nits and questions -- thanks!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi)
[16:15:34] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] site: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi)
[16:17:59] <wikibugs>	 (03PS2) 10CDanis: prometheus global: add rules for correct global HTTP avail [puppet] - 10https://gerrit.wikimedia.org/r/540676 (https://phabricator.wikimedia.org/T234567)
[16:22:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> Hm, I wonder if we could modify the script to only search for files that have an e.g. one day mtime, or that don't already have shasum s" [puppet] - 10https://gerrit.wikimedia.org/r/541775 (owner: 10Alexandros Kosiaris)
[16:25:01] <wikibugs>	 (03PS1) 10Elukey: role::aqs: update druid datasource for MediaWiki history [puppet] - 10https://gerrit.wikimedia.org/r/541850
[16:25:19] <elukey>	 milimetric: --^
[16:26:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::aqs: update druid datasource for MediaWiki history [puppet] - 10https://gerrit.wikimedia.org/r/541850 (owner: 10Elukey)
[16:30:57] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10RobH) Please note that when I compare librenms output it seems like it sees both towers right now:  ps1-b6-eqiad: https://librenms.wikimedia.org/device/device=50/ ps1-a4-eqiad: ht...
[16:32:34] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10RobH) Clarification:  https://netbox.wikimedia.org/dcim/devices/1394/ is the OLD ps1-b3-eqiad that should have its hostname set to asset tag, and then set to offline state as its...
[16:33:13] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 50.84 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:33:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10ayounsi) It was a PDU miss-configuration and a monitoring issue. Was solved in https://phabricator.wikimedia.org/T229328
[16:34:18] <cdanis>	 traffic drop looks like same thing that has been happening in eqsin, with a spike followed by a return to baseline
[16:35:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10wiki_willy) 05Open→03Resolved Thanks for confirming @ayounsi   Resolving task.
[16:35:34] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10wiki_willy)
[16:36:27] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 81.24 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:40:05] <wikibugs>	 (03PS1) 10Mholloway: wikifeeds: bump image to 2019-10-09-163206-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/541852 (https://phabricator.wikimedia.org/T235102)
[16:41:05] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10wiki_willy) a:05RobH→03Jclark-ctr @Jclark-ctr - can you wrap up the netbox entries on this one, and then close out the task?  Thanks, Willy
[16:42:19] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10RobH) a:05RobH→03Jclark-ctr I've just attempted to connect to ps1-a2-eqiad via serial, and failed.  To fix this, I'll outline the steps needed below and after coordination wit...
[16:44:08] <wikibugs>	 (03CR) 10Mholloway: [V: 03+2 C: 03+2] wikifeeds: bump image to 2019-10-09-163206-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/541852 (https://phabricator.wikimedia.org/T235102) (owner: 10Mholloway)
[16:46:50] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
[16:46:51] <wikibugs>	 10Operations, 10Traffic, 10observability, 10Patch-For-Review: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus - https://phabricator.wikimedia.org/T234567 (10CDanis) We probably want to let the new recording rule accumulate some data -- a week's worth? -- and then st...
[16:46:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:21] <logmsgbot>	 !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
[16:48:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:35] <wikibugs>	 (03PS1) 10Nray: Turn on Amc Outreach Modal (contexual hooks campaign) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541853 (https://phabricator.wikimedia.org/T234026)
[16:49:30] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: apt_pinning: add Buster support [puppet] - 10https://gerrit.wikimedia.org/r/541854 (https://phabricator.wikimedia.org/T235059)
[16:50:16] <logmsgbot>	 !log @ helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
[16:50:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: apt_pinning: add Buster support [puppet] - 10https://gerrit.wikimedia.org/r/541854 (https://phabricator.wikimedia.org/T235059) (owner: 10Arturo Borrero Gonzalez)
[17:04:25] <revi>	 'Safari cannot open the page because the network connection was lost' on https://en.wikipedia.org/wiki/Special:Watchlist, essentially the one I reported earlier today https://usercontent.irccloud-cdn.com/file/eISS3gJE/IMG_3166.PNG
[17:12:43] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: toolforge: refactor proxy role from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362)
[17:17:28] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: toolforge: refactor proxy role from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362)
[17:22:08] <elukey>	 !log roll restart aqs on aqs100[4-9] to pick up new Druid config changes
[17:22:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:48] <wikibugs>	 (03CR) 10Anomie: [C: 03+1] "Seems ok to me, although I'm not terribly familiar with this part of the config." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541611 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[17:55:09] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config, 10Release-Engineering-Team (Development services): Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10CDanis) Was this discussed during the Monday meeting?  What was the outcome?
[17:56:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "Looks good!  As far as I know there's nothing automated that depends on these, and it would be nice to get some more intelligible response" [dns] - 10https://gerrit.wikimedia.org/r/541526 (https://phabricator.wikimedia.org/T234836) (owner: 10Arturo Borrero Gonzalez)
[18:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191009T1800)
[18:09:30] <wikibugs>	 (03PS2) 10EBernhardson: yarn: Add sequential scheduler queue for heavy jobs [puppet] - 10https://gerrit.wikimedia.org/r/541654
[18:30:34] <wikibugs>	 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH)
[18:32:24] <wikibugs>	 10Operations: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) a:05RobH→03fgiunchedi @fgiunchedi,  ms-be105[1-6].eqiad.wmnet are all online and calling into puppet.  You can push them into service as you see fit.  Please note when you push them in...
[18:43:17] <wikibugs>	 (03PS1) 10Effie Mouzeli: lvs::monitor_services: increase number of tries before MCS is critical [puppet] - 10https://gerrit.wikimedia.org/r/541891 (https://phabricator.wikimedia.org/T229286)
[18:43:44] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime
[18:43:44] <logmsgbot>	 !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[18:43:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:28] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime
[18:44:29] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] lvs::monitor_services: increase number of tries before MCS is critical [puppet] - 10https://gerrit.wikimedia.org/r/541891 (https://phabricator.wikimedia.org/T229286) (owner: 10Effie Mouzeli)
[18:45:50] <urandom>	 !log Upgrade restbase-dev1004-{a,b} to Cassandra 3.11.4 -- T200803
[18:45:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:53] <stashbot>	 T200803: Test/evaluate Cassandra 3.11.4 for production upgrade - https://phabricator.wikimedia.org/T200803
[18:46:23] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] yarn: Add sequential scheduler queue for heavy jobs [puppet] - 10https://gerrit.wikimedia.org/r/541654 (owner: 10EBernhardson)
[18:46:30] <wikibugs>	 (03PS3) 10Ottomata: yarn: Add sequential scheduler queue for heavy jobs [puppet] - 10https://gerrit.wikimedia.org/r/541654 (owner: 10EBernhardson)
[18:51:46] <urandom>	 !log Upgrade restbase-dev1005-{a,b} to Cassandra 3.11.4 -- T200803
[18:51:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:49] <stashbot>	 T200803: Test/evaluate Cassandra 3.11.4 for production upgrade - https://phabricator.wikimedia.org/T200803
[18:58:40] <wikibugs>	 (03PS1) 10Ottomata: Fix hadoop sequential queue xml [puppet] - 10https://gerrit.wikimedia.org/r/541895
[18:59:07] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix hadoop sequential queue xml [puppet] - 10https://gerrit.wikimedia.org/r/541895 (owner: 10Ottomata)
[19:00:05] <jouncebot>	 marxarelli: Dear deployers, time to do the MediaWiki train - American version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191009T1900).
[19:03:53] <wikibugs>	 (03PS1) 10Dduvall: group1 wikis to 1.35.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541896
[19:03:55] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.35.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541896 (owner: 10Dduvall)
[19:04:50] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541896 (owner: 10Dduvall)
[19:05:09] <James_F>	 marxarelli: Argh, not labswiki.
[19:05:26] <James_F>	 It's still running on HHVM.
[19:05:33] <marxarelli>	 oh, shite
[19:05:55] <James_F>	 Sorry, forgot to check this morning if it had been fixed yet.
[19:06:04] <logmsgbot>	 !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.1
[19:06:11] <marxarelli>	 already ran ^
[19:06:16] <James_F>	 Yeah.
[19:06:19] <marxarelli>	 i'll prepare for immediate rollback
[19:06:25] <marxarelli>	 of labswiki
[19:06:27] <James_F>	 Just for wikitech.
[19:06:28] <James_F>	 Yeah.
[19:07:03] <logmsgbot>	 !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.1 (duration: 00m 58s)
[19:08:26] <marxarelli>	 syncing now
[19:08:48] <urandom>	 !log Upgrade restbase-dev1006-{a,b} to Cassandra 3.11.4 -- T200803
[19:09:09] <logmsgbot>	 !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: labswiki rollback to 1.34.0-wmf.25 due to hhvm
[19:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:21] <stashbot>	 T200803: Test/evaluate Cassandra 3.11.4 for production upgrade - https://phabricator.wikimedia.org/T200803
[19:09:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:27] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] phabricator: support buster with PHP 7.3 packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn)
[19:10:04] <bd808>	 marxarelli, James_F: ouch. sorry we left that booby trap for you.
[19:10:27] <wikibugs>	 (03PS2) 10Effie Mouzeli: lvs::monitor_services: increase number of tries before MCS is critical [puppet] - 10https://gerrit.wikimedia.org/r/541891 (https://phabricator.wikimedia.org/T229286)
[19:10:34] * marxarelli shakes fist at bd808 
[19:10:40] * James_F grins.
[19:10:45] <bd808>	 we are close to ready to try wikitech on php7. I found one more thing to fix in puppet this morning
[19:10:57] <James_F>	 bd808: On behalf of RelEng, sorry for accidentally taking down your wiki. ;-)
[19:11:06] * bd808 is trying to prioritize emergencies today
[19:11:11] * James_F nods.
[19:11:32] <marxarelli>	 is there a task? i should add it as a train blocker even though it's only for labswiki
[19:11:54] <bd808>	 T223393
[19:11:54] <stashbot>	 T223393: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393
[19:12:04] <marxarelli>	 cool cool
[19:12:07] <James_F>	 Hmm, there's some code somewhere running on HHVM.
[19:12:13] <James_F>	 syntax error, unexpected T_CONST, expecting T_VARIABLE in /srv/mediawiki/php-1.35.0-wmf.1/includes/title/NamespaceInfo.php on line 59
[19:12:40] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10wikitech.wikimedia.org, and 3 others: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10dduvall)
[19:13:01] <James_F>	 Oh, that's labweb1002 which is also wikitechwiki?
[19:13:03] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10wikitech.wikimedia.org, and 3 others: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10bd808) Spotted while using `eval.php` on labweb1002: we are currently missing the php7.2-ldap package there.
[19:13:31] <bd808>	 James_F: yes, labweb* is wikitech
[19:14:31] <James_F>	 But this is not a request for labswiki? AW2x7H7ox3rdj6D8OhQt
[19:14:42] <James_F>	 (It's not a request for any wiki, somehow.)
[19:16:15] <bd808>	 James_F: maybe me playing with eval.php earlier?
[19:16:16] <marxarelli>	 James_F: i don't see that error following the rollback
[19:16:34] <James_F>	 bd808: Ah, could be.
[19:16:39] <James_F>	 marxarelli: Yeah, all looks good now
[19:16:51] <bd808>	 I noticed the train running when eval.php crashed with "Error: You might be using an older PHP version (PHP 5.6.99-hhvm)."
[19:17:03] <James_F>	 I was just worried that we were serving anything other than labswiki via HHVM.
[19:17:27] * bd808 was debugging a different OpenStackManager bug
[19:18:17] <James_F>	 You mean there's more than one?! ;-)
[19:20:33] <wikibugs>	 (03PS1) 10Dduvall: Rollback labswiki to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541901 (https://phabricator.wikimedia.org/T223393)
[19:20:37] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] Rollback labswiki to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541901 (https://phabricator.wikimedia.org/T223393) (owner: 10Dduvall)
[19:21:31] <wikibugs>	 (03Merged) 10jenkins-bot: Rollback labswiki to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541901 (https://phabricator.wikimedia.org/T223393) (owner: 10Dduvall)
[19:23:31] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10wikitech.wikimedia.org, and 3 others: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10bd808) >>! In T223393#5560962, @bd808 wrote: > Spotted while using `eval.php` on labweb1002: we are currently missing the php7.2-lda...
[19:25:06] <marxarelli>	 !log 1.35.0-wmf.1 promoted to group1, labswiki rolled back to 1.34.0-wmf.25 and to be kept back, cc: T233849
[19:25:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:11] <stashbot>	 T233849: 1.35.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T233849
[19:27:10] <wikibugs>	 (03PS6) 10BryanDavis: wikitech: switch runtime from HHVM to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn)
[19:30:01] <wikibugs>	 (03PS5) 10Dzahn: phabricator: support buster with PHP 7.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568)
[19:30:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikitech: switch runtime from HHVM to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn)
[19:33:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm as far as i can tell. thanks for taking it. https://puppet-compiler.wmflabs.org/compiler1001/18809/labweb1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn)
[19:34:27] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@0a914bf]: new geoeditors column and wikipedia portal EL fix
[19:34:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:21] <icinga-wm>	 RECOVERY - ElasticSearch shard size check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed
[19:40:28] <wikibugs>	 (03CR) 10BryanDavis: wikitech: switch runtime from HHVM to PHP 7.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn)
[19:40:42] <wikibugs>	 (03PS7) 10BryanDavis: wikitech: switch runtime from HHVM to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn)
[19:44:00] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@0a914bf]: new geoeditors column and wikipedia portal EL fix (duration: 09m 33s)
[19:44:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:10] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@0a914bf]: new geoeditors column and wikipedia portal EL fix
[19:44:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:49:44] <wikibugs>	 (03PS8) 10Andrew Bogott: wikitech: switch runtime from HHVM to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn)
[19:52:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wikitech: switch runtime from HHVM to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn)
[19:52:10] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@0a914bf]: new geoeditors column and wikipedia portal EL fix (duration: 08m 00s)
[19:52:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:27] <wikibugs>	 (03PS1) 10Mholloway: wikifeeds: deploy 2019-10-09-175646-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/541906
[19:53:29] <wikibugs>	 (03CR) 10BPirkle: [WIP] Config changes for Echo kask migration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope)
[19:53:40] <wikibugs>	 (03CR) 10Mholloway: [V: 03+2 C: 03+2] wikifeeds: deploy 2019-10-09-175646-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/541906 (owner: 10Mholloway)
[19:54:02] <milimetric>	 (sorry for the spam, having trouble with the scap deploy, will have to try another few times as we debug)
[19:54:46] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@0a914bf]: new geoeditors column and wikipedia portal EL fix
[19:54:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:56] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
[19:54:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:59] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@0a914bf]: new geoeditors column and wikipedia portal EL fix (duration: 00m 12s)
[19:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:56:16] <logmsgbot>	 !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
[19:56:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:56:41] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: disable WMFSF, keep archives - https://phabricator.wikimedia.org/T233883 (10herron) Hi @Varnent, the old list address is disabled and messages sent there will held in moderation indefinitely.  The communication mail that was sent out about this IMO is clear that the old...
[19:58:42] <wikibugs>	 (03PS1) 10Papaul: DNS: Remove mgmt DNS for sarin,db2050 and db2055 [dns] - 10https://gerrit.wikimedia.org/r/541907
[19:59:56] <wikibugs>	 (03CR) 10Krinkle: [Beta Cluster] Enable wmgUseCSPReportOnly for all (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541655 (https://phabricator.wikimedia.org/T211539) (owner: 10Jforrester)
[20:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191009T2000).
[20:00:16] <subbu>	 no parsoid deploy today
[20:01:00] <wikibugs>	 (03CR) 10Jforrester: [Beta Cluster] Enable wmgUseCSPReportOnly for all (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541655 (https://phabricator.wikimedia.org/T211539) (owner: 10Jforrester)
[20:01:56] <logmsgbot>	 !log @ helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
[20:01:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:38] <James_F>	 bd808: wikitech on PHP72 seems to work. I can browse, edit, log in and out.
[20:03:01] <bd808>	 James_F: yeah, we are close to calling that {{done}}
[20:03:14] <bd808>	 then on to the other bug :)
[20:03:27] <James_F>	 Awesome. Thank you so much.
[20:03:37] <bd808>	 which if you could login in semi-fixed by the pending patch for OSM
[20:05:07] * James_F nods.
[20:06:54] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@46501d1]: new geoeditors column and wikipedia portal EL fix
[20:06:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:17] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@46501d1]: new geoeditors column and wikipedia portal EL fix (duration: 02m 23s)
[20:09:19] <wikibugs>	 10Operations, 10ops-eqiad: Move kafka100[123] to logstash102[012] - https://phabricator.wikimedia.org/T235124 (10herron) p:05Triage→03Normal
[20:09:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:05] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: Move kafka200[123] to logstash202[012] - https://phabricator.wikimedia.org/T235125 (10herron) p:05Triage→03Normal
[20:10:41] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:10:48] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@46501d1]: new geoeditors column and wikipedia portal EL fix
[20:10:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Move kafka100[123] to logstash102[012] - https://phabricator.wikimedia.org/T235124 (10herron)
[20:16:23] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@46501d1]: new geoeditors column and wikipedia portal EL fix (duration: 05m 34s)
[20:16:25] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@46501d1]: new geoeditors column and wikipedia portal EL fix
[20:16:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:35] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@46501d1]: new geoeditors column and wikipedia portal EL fix (duration: 00m 10s)
[20:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] DNS: Remove mgmt DNS for sarin,db2050 and db2055 [dns] - 10https://gerrit.wikimedia.org/r/541907 (owner: 10Papaul)
[20:17:30] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for sarin,db2050 and db2055 [dns] - 10https://gerrit.wikimedia.org/r/541907 (owner: 10Papaul)
[20:18:48] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Papaul)
[20:19:09] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Papaul) 05Open→03Resolved complete
[20:19:12] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul)
[20:19:44] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Papaul)
[20:19:59] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Papaul) 05Open→03Resolved complete
[20:20:02] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul)
[20:20:34] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Papaul)
[20:20:52] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Papaul) 05Open→03Resolved complete
[20:22:20] <logmsgbot>	 !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@469ed65]: Update mobileapps to b9a225e
[20:22:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:10] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Move kafka100[123] to logstash102[012] - https://phabricator.wikimedia.org/T235124 (10wiki_willy) a:03Cmjohnson @Cmjohnson - this task is relabel, update in Netbox, and update switchport descriptions to the newly renamed hostnames
[20:23:51] <logmsgbot>	 !log otto@deploy1001 Started deploy [analytics/refinery@9b322e4]: (no justification provided)
[20:23:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:20] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: Move kafka200[123] to logstash202[012] - https://phabricator.wikimedia.org/T235125 (10wiki_willy) a:03Papaul Hi @Papaul - this task is relabel, update in Netbox, and update switchport descriptions to the newly renamed hostnames.  Thanks, Willy
[20:26:36] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRIT
[20:26:36] <icinga-wm>	 ve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve most-read articles for date with no data (with aggregated=true)) is CRITICAL: Test retrieve most-read articles for date with no data (with aggregated=true) returned the unexpected status 404 (expecting: 204): /{domain}/v1/media/image/featured/{year}/{m
[20:26:36] <icinga-wm>	 ieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/news (get In the News content for unsupported language (with aggregated=true)) is CRITICAL: Test get In the
[20:26:36] <icinga-wm>	  unsupported language (with aggregated=true) returned the unexpected status 404 (expecting: 204) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:27:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 404 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value
[20:27:44] <icinga-wm>	 g keys: [mostread, tfa, image] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:27:48] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:27:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 404 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value
[20:27:58] <icinga-wm>	 g keys: [tfa, mostread, image] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:28:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 404 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value
[20:28:14] <icinga-wm>	 g keys: [mostread, image, tfa] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:28:17] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime
[20:28:17] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:28:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:22] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime
[20:28:22] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:28:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:42] <logmsgbot>	 !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@469ed65]: Update mobileapps to b9a225e (duration: 06m 22s)
[20:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:00] <icinga-wm>	 PROBLEM - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [tfa, image, mostread] https://wikitech.wikimedia.org/wiki/RESTBase
[20:31:38] <papaul>	 !log rebooting ms-be1051 to access BIOS 
[20:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:52] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: add ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541910 (https://phabricator.wikimedia.org/T232367)
[20:33:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541910 (https://phabricator.wikimedia.org/T232367) (owner: 10Filippo Giunchedi)
[20:33:54] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: add ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541910 (https://phabricator.wikimedia.org/T232367)
[20:34:06] <icinga-wm>	 PROBLEM - Host ms-be2051 is DOWN: PING CRITICAL - Packet loss = 100%
[20:34:42] <wikibugs>	 (03PS1) 10Eevans: restbase: Cassandra client access from k8s [puppet] - 10https://gerrit.wikimedia.org/r/541911 (https://phabricator.wikimedia.org/T234374)
[20:35:08] <icinga-wm>	 PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [mostread, tfa, image] https://wikitech.wikimedia.org/wiki/RESTBase
[20:37:30] <James_F>	 bd808: OK, should we trying rolling labswiki over to 1.35.0-wmf.1?
[20:38:24] <bd808>	 James_F: andrewbogott and I are live hacking there right now. If things go right we will have a backport "soon" and then can catch up with the train late today/tomorrow
[20:38:32] <James_F>	 Sure, no worries.
[20:38:40] <wikibugs>	 (03PS1) 10Jhedden: openstack: update designate wmfsink handler for newton [puppet] - 10https://gerrit.wikimedia.org/r/541913 (https://phabricator.wikimedia.org/T235127)
[20:39:00] <bd808>	 not being able to do this on mwdebug1xxx is annoying :/
[20:41:11] <wikibugs>	 (03PS3) 10Dzahn: parsoid/conftool: add wtp servers as apache appservers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654)
[20:41:52] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:41:58] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@aaadd73] (dev-cluster): Switch to wikifeeds
[20:42:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:46] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1048 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:42:47] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: use servers _per_port with ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541916 (https://phabricator.wikimedia.org/T232367)
[20:43:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:40] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@aaadd73] (dev-cluster): Switch to wikifeeds (duration: 02m 42s)
[20:44:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/541913 (https://phabricator.wikimedia.org/T235127) (owner: 10Jhedden)
[20:44:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:28] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] openstack: update designate wmfsink handler for newton [puppet] - 10https://gerrit.wikimedia.org/r/541913 (https://phabricator.wikimedia.org/T235127) (owner: 10Jhedden)
[20:46:44] <wikibugs>	 (03PS2) 10Jhedden: openstack: update designate wmfsink handler for newton [puppet] - 10https://gerrit.wikimedia.org/r/541913 (https://phabricator.wikimedia.org/T235127)
[20:51:11] <wikibugs>	 (03PS4) 10Dzahn: parsoid/conftool: add new service parsoid.httpd to wtp servers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654)
[20:53:28] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@aaadd73] (dev-cluster): Switch to wikifeeds, rb-dev1006
[20:53:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:21] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "service/services.yaml does not exist anymore." [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[20:54:24] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: use servers _per_port with ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541916 (https://phabricator.wikimedia.org/T232367)
[20:55:12] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@aaadd73] (dev-cluster): Switch to wikifeeds, rb-dev1006 (duration: 01m 44s)
[20:55:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:57:32] <wikibugs>	 (03PS3) 10Filippo Giunchedi: hieradata: use servers _per_port with ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541916 (https://phabricator.wikimedia.org/T232367)
[20:58:56] <icinga-wm>	 RECOVERY - Host ms-be2051 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms
[20:59:40] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:00:32] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1048 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[21:02:20] <logmsgbot>	 !log otto@deploy1001 deploy aborted: (no justification provided) (duration: 38m 29s)
[21:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:27] <logmsgbot>	 !log otto@deploy1001 Started deploy [analytics/refinery@9b322e4]: (no justification provided)
[21:02:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: use servers _per_port with ms-be105[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/541916 (https://phabricator.wikimedia.org/T232367) (owner: 10Filippo Giunchedi)
[21:02:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:29] <logmsgbot>	 !log otto@deploy1001 Finished deploy [analytics/refinery@9b322e4]: (no justification provided) (duration: 00m 02s)
[21:02:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:04:30] <wikibugs>	 (03PS5) 10Dzahn: parsoid/conftool: add new service parsoid.httpd to wtp servers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654)
[21:05:04] <icinga-wm>	 ACKNOWLEDGEMENT - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 404 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with une
[21:05:04] <icinga-wm>	 path = Missing keys: [tfa, mostread, image] ppchelko restrouter in k8s is not used yet by anything and the issue will be resolved by mobrovac in EU work hours. - The acknowledgement expires at: 2019-10-10 21:03:49. https://wikitech.wikimedia.org/wiki/RESTBase
[21:05:04] <icinga-wm>	 ACKNOWLEDGEMENT - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 404 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with une
[21:05:04] <icinga-wm>	 path = Missing keys: [mostread, image, tfa] ppchelko restrouter in k8s is not used yet by anything and the issue will be resolved by mobrovac in EU work hours. - The acknowledgement expires at: 2019-10-10 21:03:49. https://wikitech.wikimedia.org/wiki/RESTBase
[21:05:19] <wikibugs>	 (03PS6) 10Dzahn: parsoid/conftool: add new service parsoid.httpd to wtp servers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654)
[21:06:01] <wikibugs>	 (03PS6) 10Dzahn: phabricator: support buster with PHP 7.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568)
[21:16:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: support buster with PHP 7.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn)
[21:22:18] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.003605 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[21:26:56] <mutante>	 icinga wants to say it's back to 4 instead of 5 failed hosts but that isn't really true. it just fails differently, but on it
[21:27:17] <godog>	 !log swift eqiad-prod: add ms-be105[1-6] - T232367
[21:27:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:21] <stashbot>	 T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367
[21:27:38] <godog>	 mutante: also I think there's a problem in how we calculate that metric, that can't be right that 4 hosts trigger the alert
[21:27:56] <mutante>	 godog: i noticed yesterday it got triggered by 4 -> 5
[21:27:56] <godog>	 haven't had the time to look into it but it is in my backlog
[21:28:03] <godog>	 yeah that's wrong
[21:28:05] <mutante>	 alright, thanks!
[21:31:08] <wikibugs>	 (03PS1) 10Papaul: DNS: Remove mgmt DNS for rhenium and lithium [dns] - 10https://gerrit.wikimedia.org/r/541928
[21:32:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] DNS: Remove mgmt DNS for rhenium and lithium [dns] - 10https://gerrit.wikimedia.org/r/541928 (owner: 10Papaul)
[21:37:14] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10Papaul)
[21:39:51] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission lithium - https://phabricator.wikimedia.org/T229557 (10Papaul)
[21:40:19] <wikibugs>	 (03PS1) 10Nuria: Bumping up refine to newest version [puppet] - 10https://gerrit.wikimedia.org/r/541929 (https://phabricator.wikimedia.org/T234461)
[21:42:22] <wikibugs>	 10Operations, 10Gerrit: replication/gerrit2001 issues - https://phabricator.wikimedia.org/T235135 (10MarcoAurelio)
[21:42:24] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for rhenium and lithium [dns] - 10https://gerrit.wikimedia.org/r/541928 (owner: 10Papaul)
[21:42:43] <wikibugs>	 10Operations, 10Gerrit: replication/gerrit2001 issues - https://phabricator.wikimedia.org/T235135 (10MarcoAurelio) p:05Triage→03High
[21:42:47] <wikibugs>	 (03PS1) 10Dzahn: phabricator::httpd: support stretch/buster with/without php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/541930 (https://phabricator.wikimedia.org/T190568)
[21:44:46] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10Papaul) 05Open→03Resolved complete
[21:48:12] <wikibugs>	 10Operations, 10Gerrit: replication/gerrit2001 issues - https://phabricator.wikimedia.org/T235135 (10Dzahn) Broken by https://gerrit.wikimedia.org/r/c/operations/puppet/+/541386  when we renamed the replication target yesterday.  root cause: reject HostKey: gerrit-replica.wikimedia.org    as shown in replicati...
[21:53:50] <wikibugs>	 (03PS1) 10Dzahn: Revert "Gerrit: Switch replication url for replica to gerrit-replica" [puppet] - 10https://gerrit.wikimedia.org/r/541931
[21:54:10] <wikibugs>	 (03PS2) 10Paladox: Revert "Gerrit: Switch replication url for replica to gerrit-replica" [puppet] - 10https://gerrit.wikimedia.org/r/541931 (owner: 10Dzahn)
[21:54:13] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] Revert "Gerrit: Switch replication url for replica to gerrit-replica" [puppet] - 10https://gerrit.wikimedia.org/r/541931 (owner: 10Dzahn)
[21:54:42] <wikibugs>	 (03PS3) 10Dzahn: Revert "Gerrit: Switch replication url for replica to gerrit-replica" [puppet] - 10https://gerrit.wikimedia.org/r/541931 (https://phabricator.wikimedia.org/T235135)
[21:55:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "quick fix for now, real fix for later" [puppet] - 10https://gerrit.wikimedia.org/r/541931 (https://phabricator.wikimedia.org/T235135) (owner: 10Dzahn)
[22:00:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission phab1002/WMF4727 - https://phabricator.wikimedia.org/T221391 (10Papaul) ` [edit interfaces] -   ge-3/0/29 { -       description phab1002; -       enable; -   }
[22:01:15] <mutante>	 jouncebot: now
[22:01:15] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 58 minute(s)
[22:01:39] <mutante>	 !log restarting gerrit to revert replication config change (T235135)
[22:01:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:01:43] <stashbot>	 T235135: replication/gerrit2001 issues - https://phabricator.wikimedia.org/T235135
[22:02:16] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:02:41] <Zoranzoki21>	 Hi, whats happening with gerrit.wikimedia.org?
[22:03:19] <Zoranzoki21>	 Ok, nothing works now :)
[22:04:47] <mutante>	 "nothing, works now" vs. "nothing works now". but it's the former
[22:05:54] <godog>	 "commas are important" pictures popping in my mind
[22:05:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission astatine - https://phabricator.wikimedia.org/T221244 (10Papaul)
[22:06:10] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission astatine - https://phabricator.wikimedia.org/T221244 (10Papaul) ` papaul@asw2-d-eqiad# show | compare  [edit interfaces] -   ge-3/0/8 { -       description astatine; -       enable; -   }
[22:06:49] <mutante>	 hehe, yea
[22:07:15] <mutante>	 gerrit is replicating again 
[22:10:47] <wikibugs>	 10Operations, 10Gerrit: replication/gerrit2001 issues - https://phabricator.wikimedia.org/T235135 (10Dzahn) replication.log shows it is replicating again and working on the backlog queue right now.
[22:14:25] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP membership for new employee Nikki Nikkhoui - https://phabricator.wikimedia.org/T235136 (10nnikkhoui)
[22:17:39] <wikibugs>	 (03PS2) 10Dzahn: phabricator::httpd: support stretch/buster with/without php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/541930 (https://phabricator.wikimedia.org/T190568)
[22:37:08] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/18814/" [puppet] - 10https://gerrit.wikimedia.org/r/541930 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn)
[22:37:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator::httpd: support stretch/buster with/without php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/541930 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn)
[22:41:49] <wikibugs>	 10Operations, 10Gerrit: replication/gerrit2001 issues - https://phabricator.wikimedia.org/T235135 (10MarcoAurelio) 05Open→03Resolved a:03Dzahn It looks everything is back to normal now.
[22:49:11] <wikibugs>	 (03PS1) 10Jdlrobson: Enable AMC everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541953 (https://phabricator.wikimedia.org/T233612)
[22:51:08] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 03+1] "Yes, finally!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541953 (https://phabricator.wikimedia.org/T233612) (owner: 10Jdlrobson)
[23:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191009T2300).
[23:00:05] <jouncebot>	 Jdlrobson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:17] <Jdlrobson>	 present :)
[23:03:22] <Jdlrobson>	 RoanKattouw: MaxSem can i pester you for a swat?
[23:03:59] <MaxSem>	 You may try ;)
[23:04:50] <RoanKattouw>	 I can do it
[23:05:24] <RoanKattouw>	 Jdlrobson: Which order should I deploy these in?
[23:05:30] <Jdlrobson>	 im in the corner RoanKattouw on the sofas if you need to pester me in person.
[23:05:38] <Jdlrobson>	 1st up should be the outreach drawer i think
[23:05:45] <Jdlrobson>	 but you can also do together if that makes sense
[23:06:46] <wikibugs>	 (03PS1) 10Dzahn: phabricator: install s-nail instead of heirloom-mailx on buster [puppet] - 10https://gerrit.wikimedia.org/r/541967 (https://phabricator.wikimedia.org/T190568)
[23:07:07] <wikibugs>	 (03PS2) 10Catrope: Turn on Amc Outreach Modal (contexual hooks campaign) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541853 (https://phabricator.wikimedia.org/T234026) (owner: 10Nray)
[23:07:18] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Turn on Amc Outreach Modal (contexual hooks campaign) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541853 (https://phabricator.wikimedia.org/T234026) (owner: 10Nray)
[23:08:13] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on Amc Outreach Modal (contexual hooks campaign) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541853 (https://phabricator.wikimedia.org/T234026) (owner: 10Nray)
[23:09:01] <wikibugs>	 (03CR) 10Masumrezarock100: [C: 03+1] "Thanks John for taking care of this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541953 (https://phabricator.wikimedia.org/T233612) (owner: 10Jdlrobson)
[23:09:08] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 72, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:09:18] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:10:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: install s-nail instead of heirloom-mailx on buster [puppet] - 10https://gerrit.wikimedia.org/r/541967 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn)
[23:10:22] <wikibugs>	 (03PS2) 10Dzahn: phabricator: install s-nail instead of heirloom-mailx on buster [puppet] - 10https://gerrit.wikimedia.org/r/541967 (https://phabricator.wikimedia.org/T190568)
[23:10:44] <wikibugs>	 (03PS8) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391)
[23:10:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) (owner: 10Brennen Bearnes)
[23:11:30] <RoanKattouw>	 Jdlrobson: Outreach drawer is on mwdebug1002, please test
[23:11:35] <Jdlrobson>	 on it
[23:12:51] <wikibugs>	 (03PS9) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391)
[23:13:29] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis CenturyLink Scheduled Maintenance #: 17161404 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:13:29] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 72, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis CenturyLink Scheduled Maintenance #: 17161404 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:13:42] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:13:50] <mutante>	 2019-09-23 09:38:12 GMT - This maintenance is scheduled.
[23:14:01] <mutante>	 cdanis: you beat me to it. ack. on the calendar
[23:14:11] <cdanis>	 :)
[23:14:22] <cdanis>	 about to sign off for the day, train ride almost over
[23:14:31] <mutante>	 was still going to check if that interface is really CenturyLink
[23:14:36] <mutante>	 ok, cu
[23:14:48] <cdanis>	 mutante: yeah, the circuit IDs given in the alert vs in the email matched
[23:15:22] <mutante>	 ack, great
[23:15:25] <Jdlrobson>	 RoanKattouw: i think we're good here
[23:15:41] <Jdlrobson>	 i will need to check something when amc goes live everywhere too
[23:16:02] <RoanKattouw>	 OK, I'll take this one live first then
[23:16:36] <wikibugs>	 (03PS2) 10Catrope: Enable AMC everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541953 (https://phabricator.wikimedia.org/T233612) (owner: 10Jdlrobson)
[23:16:50] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Enable AMC everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541953 (https://phabricator.wikimedia.org/T233612) (owner: 10Jdlrobson)
[23:17:17] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Turn on AMC outreach modal (T234026) (duration: 00m 59s)
[23:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:17:21] <stashbot>	 T234026: Deploy AMC contextual hooks modal - https://phabricator.wikimedia.org/T234026
[23:17:37] <wikibugs>	 (03Merged) 10jenkins-bot: Enable AMC everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541953 (https://phabricator.wikimedia.org/T233612) (owner: 10Jdlrobson)
[23:20:43] <RoanKattouw>	 Jdlrobson: AMC everywhere now on mwdebug1002, please test
[23:20:58] <Jdlrobson>	 on it..
[23:23:06] <Jdlrobson>	 RoanKattouw: looks great! sync away
[23:23:19] <Jdlrobson>	 i'll then keep an eye on logstash
[23:23:58] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: NDA Request from WMDE employee Verena - https://phabricator.wikimedia.org/T233807 (10Nuria) Ping on this , seems this request is been stalled on NDA sign in for a while
[23:24:18] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:24:21] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable AMC on all wikis (T233612) (duration: 00m 58s)
[23:24:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:24:24] <stashbot>	 T233612: Deploy Advanced mode to all Wikimedia projects - https://phabricator.wikimedia.org/T233612
[23:29:44] <wikibugs>	 10Operations, 10Analytics, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) Ping @bblack to give us some priorities around this work
[23:30:55] <wikibugs>	 10Operations, 10User-fgiunchedi: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10fgiunchedi)
[23:33:03] <Jdlrobson>	 sweeet. Amc is here
[23:39:17] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10wikitech.wikimedia.org, and 2 others: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10Dzahn) 05Open→03Resolved switched over by @andrew and @bd808
[23:39:20] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-Joe: Package and install php 7.2 in place of php 7.0 - https://phabricator.wikimedia.org/T208433 (10Dzahn)
[23:39:29] <wikibugs>	 10Operations, 10Analytics, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) a:03JAllemandou
[23:39:53] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria)
[23:42:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thanks for the review!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi)
[23:48:29] <wikibugs>	 (03PS11) 10Filippo Giunchedi: swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123)
[23:48:31] <wikibugs>	 (03PS12) 10Filippo Giunchedi: site: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (https://phabricator.wikimedia.org/T162123)
[23:49:26] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): package wikimedia-lvs-realserver for buster - https://phabricator.wikimedia.org/T235140 (10Dzahn)
[23:49:45] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): package wikimedia-lvs-realserver for buster - https://phabricator.wikimedia.org/T235140 (10Dzahn) a:05Dzahn→03None
[23:51:05] <logmsgbot>	 !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@e4e2b22]: (no justification provided)
[23:51:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:52:37] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) >>! In T190568#5320142, @MoritzMuehlenhoff wrote: >>>! In T190568#5319370, @Dzahn wrote: >> Next we need to...
[23:53:12] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) @Muehlenhoff Currently moving to buster is blocked by T235140
[23:55:01] <logmsgbot>	 !log twentyafterfour@deploy1001 deploy aborted: (no justification provided) (duration: 03m 57s)
[23:55:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log