[00:00:00] (03CR) 10jerkins-bot: [V: 04-1] add discovery name for RT, point to moscovium [dns] - 10https://gerrit.wikimedia.org/r/534129 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [00:01:40] (03PS3) 10Dzahn: add discovery name for RT, point to moscovium [dns] - 10https://gerrit.wikimedia.org/r/534129 (https://phabricator.wikimedia.org/T180641) [00:05:16] (03CR) 10Dzahn: [C: 03+2] add discovery name for RT, point to moscovium [dns] - 10https://gerrit.wikimedia.org/r/534129 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [00:15:58] (03PS1) 10Dzahn: ATS/varnish: replace director for RT with moscovium [puppet] - 10https://gerrit.wikimedia.org/r/544077 (https://phabricator.wikimedia.org/T180641) [00:17:20] (03PS1) 10Dzahn: exim: switch mail for RT to moscovium [puppet] - 10https://gerrit.wikimedia.org/r/544078 (https://phabricator.wikimedia.org/T180641) [00:19:12] (03PS1) 10Dzahn: mariadb/ferm_misc: allow moscovium to connect instead of ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/544079 (https://phabricator.wikimedia.org/T180641) [00:22:48] (03PS1) 10Dzahn: site/install_server: decom ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/544080 (https://phabricator.wikimedia.org/T180641) [00:42:08] (03CR) 10Jforrester: [C: 03+1] Remove comment about phab bans being superseded by now non existent WP0 bans [puppet] - 10https://gerrit.wikimedia.org/r/543138 (owner: 10Reedy) [01:15:26] (03CR) 10BPirkle: [WIP] Config changes for Echo kask migration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [01:38:23] (03Abandoned) 10Thcipriani: beta: keyholder: don't require encrypted keys [puppet] - 10https://gerrit.wikimedia.org/r/544064 (https://phabricator.wikimedia.org/T235674) (owner: 10Thcipriani) [02:54:06] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20518208 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:01:56] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 156431544 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:03:32] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 54472 and 24 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:27:44] (03PS1) 10Vgutierrez: hiera: Move nginx to port 4443 on cp5005 [puppet] - 10https://gerrit.wikimedia.org/r/544092 (https://phabricator.wikimedia.org/T231433) [04:27:46] (03PS1) 10Vgutierrez: hiera: Move ats-tls to port 443 on cp5005 [puppet] - 10https://gerrit.wikimedia.org/r/544093 (https://phabricator.wikimedia.org/T231433) [04:27:54] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [04:28:12] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [04:28:24] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:28:38] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [04:28:48] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [04:28:52] uh... stat1007 went down? [04:29:00] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:06] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [04:29:08] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [04:29:31] hmm nope... [04:29:38] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:31:12] !log restarting nagios-nrpe-server on stat1007 [04:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:32] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:31:46] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [04:31:54] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [04:32:08] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:12] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [04:32:14] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [04:32:36] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [04:32:54] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [04:33:45] memory issues on stat1007 till OOM cleaned the mess [04:34:05] *oom_reaper [04:34:48] !log switch cp5005 from nginx to ats-tls - T231433 [04:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:52] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:36:03] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx to port 4443 on cp5005 [puppet] - 10https://gerrit.wikimedia.org/r/544092 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:40:12] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:42:22] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls to port 443 on cp5005 [puppet] - 10https://gerrit.wikimedia.org/r/544093 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:45:00] PROBLEM - HTTPS Unified RSA on cp5005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:45:05] ^^ expected [04:45:30] PROBLEM - HTTPS Unified ECDSA on cp5005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:47:06] RECOVERY - HTTPS Unified ECDSA on cp5005 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345578 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 35 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:48:10] RECOVERY - HTTPS Unified RSA on cp5005 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345514 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 35 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:57:25] !log switch cp4025 from nginx to ats-tls - T231433 [04:57:29] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp4025 [puppet] - 10https://gerrit.wikimedia.org/r/544094 (https://phabricator.wikimedia.org/T231433) [04:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:30] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:57:31] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp4025 [puppet] - 10https://gerrit.wikimedia.org/r/544095 (https://phabricator.wikimedia.org/T231433) [04:58:26] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp4025 [puppet] - 10https://gerrit.wikimedia.org/r/544094 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:02:36] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp4025 [puppet] - 10https://gerrit.wikimedia.org/r/544095 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:05:03] PROBLEM - HTTPS Unified RSA on cp4025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:05:11] PROBLEM - HTTPS Unified ECDSA on cp4025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:05:53] ^^ expected [05:06:25] RECOVERY - HTTPS Unified RSA on cp4025 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345541 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 35 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:06:33] RECOVERY - HTTPS Unified ECDSA on cp4025 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345531 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 35 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:08:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1099:3311 and db2086:3318 after table compression', diff saved to https://phabricator.wikimedia.org/P9385 and previous config saved to /var/cache/conftool/dbconfig/20191018-050831-marostegui.json [05:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1129 for schema change', diff saved to https://phabricator.wikimedia.org/P9386 and previous config saved to /var/cache/conftool/dbconfig/20191018-051355-marostegui.json [05:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:52] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp3039 [puppet] - 10https://gerrit.wikimedia.org/r/544096 (https://phabricator.wikimedia.org/T231433) [05:14:54] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp3039 [puppet] - 10https://gerrit.wikimedia.org/r/544097 (https://phabricator.wikimedia.org/T231433) [05:14:55] !log switch cp3039 from nginx to ats-tls - T231433 [05:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:59] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [05:15:16] !log Compress tables on db2091:3314 T235599 [05:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:21] T235599: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 [05:15:54] !log Deploy schema change on db1129 T233135 T234066 [05:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:59] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [05:15:59] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [05:16:20] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp3039 [puppet] - 10https://gerrit.wikimedia.org/r/544096 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:19:10] !log Rename m5 labtestwiki database - T233236 [05:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:14] T233236: Move labtestwikitech database to clouddb2001-dev - https://phabricator.wikimedia.org/T233236 [05:19:29] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp3039 [puppet] - 10https://gerrit.wikimedia.org/r/544097 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:22:31] PROBLEM - HTTPS Unified ECDSA on cp3039 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:23:17] (03CR) 10Santhosh: [C: 03+1] Enable CX out of beta in Malayalam/Bengali/Mongolian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543764 (https://phabricator.wikimedia.org/T233008) (owner: 10KartikMistry) [05:23:28] (03PS1) 10Marostegui: labtestwikitech.pp: Specify the new location for labtestwiki [puppet] - 10https://gerrit.wikimedia.org/r/544098 (https://phabricator.wikimedia.org/T233236) [05:23:40] ^^ expected [05:23:55] RECOVERY - HTTPS Unified ECDSA on cp3039 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 565954 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-10-06 12:00:00 +0000 (expires in 354 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:24:45] (03PS2) 10Marostegui: labtestwikitech.pp: Specify the new location for labtestwiki [puppet] - 10https://gerrit.wikimedia.org/r/544098 (https://phabricator.wikimedia.org/T233236) [05:31:35] (03CR) 10Marostegui: [C: 03+2] "Not sure if this class can be entirely cleaned up, but for now specifying where the new database is" [puppet] - 10https://gerrit.wikimedia.org/r/544098 (https://phabricator.wikimedia.org/T233236) (owner: 10Marostegui) [05:31:53] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp2014 [puppet] - 10https://gerrit.wikimedia.org/r/544099 (https://phabricator.wikimedia.org/T231433) [05:31:55] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp2014 [puppet] - 10https://gerrit.wikimedia.org/r/544100 (https://phabricator.wikimedia.org/T231433) [05:32:16] !log switch cp2014 from nginx to ats-tls - T231433 [05:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:19] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [05:33:25] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp2014 [puppet] - 10https://gerrit.wikimedia.org/r/544099 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:33:50] (03PS1) 10Marostegui: site.pp: Remove references to db2059 [puppet] - 10https://gerrit.wikimedia.org/r/544101 (https://phabricator.wikimedia.org/T230884) [05:34:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:34:13] (03PS1) 10Marostegui: wmnet: Remove production DNS entries for db2059 [dns] - 10https://gerrit.wikimedia.org/r/544102 (https://phabricator.wikimedia.org/T230884) [05:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:05] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove references to db2059 [puppet] - 10https://gerrit.wikimedia.org/r/544101 (https://phabricator.wikimedia.org/T230884) (owner: 10Marostegui) [05:36:02] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries for db2059 [dns] - 10https://gerrit.wikimedia.org/r/544102 (https://phabricator.wikimedia.org/T230884) (owner: 10Marostegui) [05:36:31] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp2014 [puppet] - 10https://gerrit.wikimedia.org/r/544100 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:36:49] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp2014 [puppet] - 10https://gerrit.wikimedia.org/r/544100 (https://phabricator.wikimedia.org/T231433) [05:38:43] PROBLEM - HTTPS Unified ECDSA on cp2014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:38:48] ^^ expected [05:39:09] PROBLEM - HTTPS Unified RSA on cp2014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:40:31] RECOVERY - HTTPS Unified RSA on cp2014 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345589 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 35 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:41:27] RECOVERY - HTTPS Unified ECDSA on cp2014 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345532 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 35 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:53:27] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp1084 [puppet] - 10https://gerrit.wikimedia.org/r/544103 (https://phabricator.wikimedia.org/T231433) [05:53:29] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 84443 to 443 on cp1084 [puppet] - 10https://gerrit.wikimedia.org/r/544104 (https://phabricator.wikimedia.org/T231433) [05:53:30] !log switch cp10184 from nginx to ats-tls - T231433 [05:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:34] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [05:54:08] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp1084 [puppet] - 10https://gerrit.wikimedia.org/r/544103 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:57:46] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 84443 to 443 on cp1084 [puppet] - 10https://gerrit.wikimedia.org/r/544104 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [06:00:47] PROBLEM - HTTPS Unified ECDSA on cp1084 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [06:01:11] ^^ expected :) [06:01:32] (03CR) 10DannyS712: "Recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [06:02:15] RECOVERY - HTTPS Unified ECDSA on cp1084 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345516 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 35 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:02:49] (03PS1) 10DannyS712: [DNM] Debug testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544105 [06:03:22] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Debug testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544105 (owner: 10DannyS712) [06:03:59] (03PS2) 10DannyS712: [DNM] Debug testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544105 [06:04:33] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Debug testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544105 (owner: 10DannyS712) [06:04:40] (03PS3) 10DannyS712: [DNM] Debug testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544105 [06:05:13] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Debug testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544105 (owner: 10DannyS712) [06:05:53] (03Abandoned) 10DannyS712: [DNM] Debug testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544105 (owner: 10DannyS712) [06:07:11] (03PS1) 10DannyS712: [DNM] Debug for T235816 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544106 [06:07:48] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Debug for T235816 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544106 (owner: 10DannyS712) [06:17:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:14] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:45:12] there seems to be Telia maintenance scheduled (but not in the gcal) --^ [06:45:22] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:45:26] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:07:41] (03PS1) 10Abijeet Patro: Add Translate channel for the Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544114 (https://phabricator.wikimedia.org/T221119) [07:08:16] (03CR) 10jerkins-bot: [V: 04-1] Add Translate channel for the Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544114 (https://phabricator.wikimedia.org/T221119) (owner: 10Abijeet Patro) [07:08:54] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:00] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:02] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:08] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:11] (03CR) 10Abijeet Patro: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544114 (https://phabricator.wikimedia.org/T221119) (owner: 10Abijeet Patro) [07:20:11] !log installing libdatetime-timezone-perl updates (time zone updates)# [07:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:36] !log installing unbound security updates on buster [07:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1129 after schema change', diff saved to https://phabricator.wikimedia.org/P9387 and previous config saved to /var/cache/conftool/dbconfig/20191018-075529-marostegui.json [07:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1076 for schema change', diff saved to https://phabricator.wikimedia.org/P9388 and previous config saved to /var/cache/conftool/dbconfig/20191018-075709-marostegui.json [07:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:00] !log Deploy schema change on db1076 [07:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:59] !log depool cp4028 and reimage as text_ats T227432 [08:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:03] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [08:04:11] (03PS2) 10Ema: cache: reimage cp4028 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/543867 (https://phabricator.wikimedia.org/T227432) [08:05:23] (03CR) 10Ema: [C: 03+2] cache: reimage cp4028 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/543867 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [08:32:05] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [08:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:07] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:49] (03CR) 10Gehel: [C: 04-1] wdqs: add data-reload cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [08:39:56] (03PS1) 10Muehlenhoff: Reinstate explicit dependencies for client and server packages [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/544146 [08:41:03] (03CR) 10Muehlenhoff: [C: 03+2] Reinstate explicit dependencies for client and server packages [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/544146 (owner: 10Muehlenhoff) [08:42:31] (03PS1) 10Giuseppe Lavagetto: echostore: add discovery configuration [puppet] - 10https://gerrit.wikimedia.org/r/544147 [08:42:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [08:42:54] (03PS1) 10Muehlenhoff: Bump changelog for new build [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/544148 [08:44:43] (03PS7) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) [08:48:26] (03CR) 10Muehlenhoff: [C: 03+2] Bump changelog for new build [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/544148 (owner: 10Muehlenhoff) [08:50:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: refresh puppet code for the new k8s [puppet] - 10https://gerrit.wikimedia.org/r/543815 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [08:55:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline (non blocking)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/543916 (https://phabricator.wikimedia.org/T223458) (owner: 10Bstorm) [08:55:43] (03PS2) 10Giuseppe Lavagetto: echostore: add LVS configuration stanzas [puppet] - 10https://gerrit.wikimedia.org/r/543123 (https://phabricator.wikimedia.org/T234464) [08:57:07] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [08:57:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] echostore: add discovery configuration [puppet] - 10https://gerrit.wikimedia.org/r/544147 (owner: 10Giuseppe Lavagetto) [08:57:20] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [08:57:41] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01016 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:59:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] echostore: add LVS configuration stanzas [puppet] - 10https://gerrit.wikimedia.org/r/543123 (https://phabricator.wikimedia.org/T234464) (owner: 10Giuseppe Lavagetto) [09:01:09] (03PS1) 10Vgutierrez: ATS: Use a common base path for /etc/ssl and /etc/acmecerts certs [puppet] - 10https://gerrit.wikimedia.org/r/544151 (https://phabricator.wikimedia.org/T234803) [09:03:30] (03CR) 10Filippo Giunchedi: "LGTM overall! See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [09:07:29] (03PS1) 10Alexandros Kosiaris: Remove the portforward right from deploy role [deployment-charts] - 10https://gerrit.wikimedia.org/r/544153 (https://phabricator.wikimedia.org/T235821) [09:09:15] (03PS2) 10Vgutierrez: ATS: Use a common base path for /etc/ssl and /etc/acmecerts certs [puppet] - 10https://gerrit.wikimedia.org/r/544151 (https://phabricator.wikimedia.org/T234803) [09:10:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove the portforward right from deploy role [deployment-charts] - 10https://gerrit.wikimedia.org/r/544153 (https://phabricator.wikimedia.org/T235821) (owner: 10Alexandros Kosiaris) [09:10:27] (03Merged) 10jenkins-bot: Remove the portforward right from deploy role [deployment-charts] - 10https://gerrit.wikimedia.org/r/544153 (https://phabricator.wikimedia.org/T235821) (owner: 10Alexandros Kosiaris) [09:11:02] <_joe_> !log hotpatching puppet-merge on puppetmaster1001 [09:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:55] !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [09:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:59] !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [09:13:00] !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [09:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:22] !log importing debdeploy 0.0.99.12 to apt.wikimedia.org [09:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:43] !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [09:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:48] (03PS1) 10Filippo Giunchedi: hieradata: fix cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/544155 (https://phabricator.wikimedia.org/T234232) [09:14:49] !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [09:14:50] !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [09:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:54] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: service=echostore [09:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:50] !log akosiaris@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [09:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:54] !log akosiaris@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [09:16:54] !log akosiaris@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [09:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:17] (03PS1) 10Elukey: Skip installing a default archiva.xml file [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/544156 [09:19:13] (03PS2) 10Elukey: Skip installing the default archiva.xml file [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/544156 [09:20:05] jbond42 vgutierrez https://gerrit.wikimedia.org/r/c/operations/puppet/+/544155 for your eyes [09:20:11] (03PS1) 10Alexandros Kosiaris: Revert "Remove the portforward right from deploy role" [deployment-charts] - 10https://gerrit.wikimedia.org/r/544158 (https://phabricator.wikimedia.org/T235821) [09:20:34] (03PS3) 10Elukey: Skip installing the default archiva.xml file [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/544156 [09:20:34] <_joe_> !log restarting pybal on lvs2006 to pick up the addition of echostore [09:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM. Instead of removing the archiva.xml you could also add it to /usr/share/doc/archiva/examples for people who are using our deb, but n" [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/544156 (owner: 10Elukey) [09:22:09] (03PS1) 10Jbond: Revert "Add local flock to puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/544159 [09:22:24] (03PS2) 10Jbond: Revert "Add local flock to puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/544159 [09:23:14] (03CR) 10Jbond: [C: 03+1] "LGTM, however im not entirely familiar with this structure" [puppet] - 10https://gerrit.wikimedia.org/r/544155 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [09:23:53] (03PS3) 10Alexandros Kosiaris: profile::backup::host: Add the ability to configure ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/543877 (https://phabricator.wikimedia.org/T229209) [09:23:58] (03CR) 10Alexandros Kosiaris: "I am not either, but it's temporary enough to not matter much" [puppet] - 10https://gerrit.wikimedia.org/r/543877 (https://phabricator.wikimedia.org/T229209) (owner: 10Alexandros Kosiaris) [09:24:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] profile::backup::host: Add the ability to configure ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/543877 (https://phabricator.wikimedia.org/T229209) (owner: 10Alexandros Kosiaris) [09:24:14] (03CR) 10Elukey: "> LGTM. Instead of removing the archiva.xml you could also add it to" [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/544156 (owner: 10Elukey) [09:24:45] godog: we also need eqsin on cache_text [09:25:10] (03CR) 10Vgutierrez: hieradata: fix cluster inconsistencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544155 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [09:25:25] vgutierrez: doh of course! and lvs too [09:26:00] (03CR) 10Jbond: [C: 03+2] Revert "Add local flock to puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/544159 (owner: 10Jbond) [09:28:45] (03CR) 10Filippo Giunchedi: hieradata: fix cluster inconsistencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544155 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [09:28:55] (03PS2) 10Filippo Giunchedi: hieradata: fix cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/544155 (https://phabricator.wikimedia.org/T234232) [09:31:06] (03CR) 10Ayounsi: [C: 03+1] devices: fix behaviour according to docstring [software/homer] - 10https://gerrit.wikimedia.org/r/543886 (owner: 10Volans) [09:31:49] (03PS8) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) [09:32:05] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [09:32:28] (03CR) 10Ayounsi: [C: 03+1] typing: use Mapping instead of Dict for arguments [software/homer] - 10https://gerrit.wikimedia.org/r/543887 (owner: 10Volans) [09:34:13] <_joe_> !log restarting pybal on lvs1016 to pick up the addition of echostore [09:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:13] (03CR) 10Ayounsi: [C: 03+1] tests: increase coverage for transports.junos [software/homer] - 10https://gerrit.wikimedia.org/r/543888 (owner: 10Volans) [09:35:44] godog: looking better now :) [09:35:55] (03CR) 10Vgutierrez: [C: 03+1] hieradata: fix cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/544155 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [09:36:06] vgutierrez: indeed! thanks for taking a look [09:36:11] merging [09:36:22] yup... that should help clearing icinga.wm.o/alerts a little bit [09:36:38] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix cluster inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/544155 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [09:36:54] <_joe_> !log restarting pybal on lvs2003 to pick up the addition of echostore [09:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:50] !log pool cp4028 with ATS backend T227432 [09:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:54] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [09:40:27] (03CR) 10Ayounsi: [C: 03+1] setup.py: remove unused test dependency [software/homer] - 10https://gerrit.wikimedia.org/r/543965 (owner: 10Volans) [09:40:51] <_joe_> !log restarting pybal on lvs1015 to pick up the addition of echostore [09:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:59] (03CR) 10Alexandros Kosiaris: [C: 04-1] scaffold: Add option for TLS termination (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (owner: 10Giuseppe Lavagetto) [09:45:48] (03PS1) 10Giuseppe Lavagetto: lvs::monitor_services: fix the echostore stanza [puppet] - 10https://gerrit.wikimedia.org/r/544165 [09:49:25] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs::monitor_services: fix the echostore stanza [puppet] - 10https://gerrit.wikimedia.org/r/544165 (owner: 10Giuseppe Lavagetto) [09:49:42] (03CR) 10Ema: [C: 03+1] ATS: Use a common base path for /etc/ssl and /etc/acmecerts certs [puppet] - 10https://gerrit.wikimedia.org/r/544151 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [09:54:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] echostore: add discovery configuration [puppet] - 10https://gerrit.wikimedia.org/r/544147 (owner: 10Giuseppe Lavagetto) [09:57:41] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=echostore [09:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:46] !log rolling out debdeploy 0.0.99.12 fleet-wide [09:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:42] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005072 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:59:58] (03CR) 10Marostegui: "> > please do test on mwdebug in eqiad and codfw with some requests" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [10:00:28] (03CR) 10Ayounsi: [C: 03+1] devices: refactor signature [software/homer] - 10https://gerrit.wikimedia.org/r/543889 (owner: 10Volans) [10:00:33] (03PS1) 10Filippo Giunchedi: hieradata: fix wikimedia_clusters for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/544166 (https://phabricator.wikimedia.org/T234232) [10:01:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: fix wikimedia_clusters for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/544166 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [10:02:17] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix wikimedia_clusters for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/544166 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [10:03:59] (03PS2) 10Giuseppe Lavagetto: echostore: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/541276 (https://phabricator.wikimedia.org/T234464) [10:04:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] echostore: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/541276 (https://phabricator.wikimedia.org/T234464) (owner: 10Giuseppe Lavagetto) [10:15:32] (03PS1) 10Jbond: puppet-merge: implement locking. [puppet] - 10https://gerrit.wikimedia.org/r/544169 [10:18:06] jbond42: that rm -rf scares me :) [10:18:33] ill remove the -r ? [10:18:40] <3 [10:19:18] (03PS2) 10Jbond: puppet-merge: implement locking. [puppet] - 10https://gerrit.wikimedia.org/r/544169 [10:19:52] done :) [10:25:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] puppet-merge: implement locking. [puppet] - 10https://gerrit.wikimedia.org/r/544169 (owner: 10Jbond) [10:26:16] (03CR) 10Jbond: [C: 03+2] puppet-merge: implement locking. [puppet] - 10https://gerrit.wikimedia.org/r/544169 (owner: 10Jbond) [10:26:50] <_joe_> vgutierrez: lol I was about to write the review saying that, and saw the patch was updated [10:27:00] :) [10:27:01] <_joe_> and I thought jbond42 was reading my mind [10:27:19] ok i have merged the current change and will get a refactor up for review later today [10:27:20] <_joe_> given he created the fix when I asked about it [10:27:28] <_joe_> jbond42: thanks a lot [10:27:33] np [10:29:20] (03CR) 10Ayounsi: [C: 04-1] "1 bug." (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/543889 (owner: 10Volans) [10:30:31] (03PS1) 10Arturo Borrero Gonzalez: toolforge: reintroduce profile::toolforge::k8s::client [puppet] - 10https://gerrit.wikimedia.org/r/544170 [10:33:13] (03CR) 10Giuseppe Lavagetto: scaffold: Add option for TLS termination (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (owner: 10Giuseppe Lavagetto) [10:34:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: reintroduce profile::toolforge::k8s::client [puppet] - 10https://gerrit.wikimedia.org/r/544170 (owner: 10Arturo Borrero Gonzalez) [10:49:11] !log Uploading wikidiff2_1.9.0-2~wmf1 to wikimedia-stretch - T231586 [10:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:15] T231586: Deploy Updated wikidiff2 C++ Engine - https://phabricator.wikimedia.org/T231586 [10:50:03] stretch-wikimedia [10:50:04] oh well [10:53:36] !log Updating wikidiff2 to 1.9.0-2~wmf1 and slowly restart php-fpm across the fleet [10:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:51] !log Updating wikidiff2 to 1.9.0-2~wmf1 and slowly restart php-fpm across the fleet - T234175 [10:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:55] T234175: Deploy wikidiff2 v1.9.0 - https://phabricator.wikimedia.org/T234175 [11:10:29] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: name=cp4028.ulsfo.wmnet,service=ats-be [11:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:21] (03PS1) 10Arturo Borrero Gonzalez: wikimedia.cloud: add initial zone file [dns] - 10https://gerrit.wikimedia.org/r/544175 (https://phabricator.wikimedia.org/T235846) [11:56:14] !log `mwscript refreshLinks.php banwiki` on mwmaint1002 T235843 [11:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:19] T235843: run refreshLinks.php then updateArticleCount.php on banwiki - https://phabricator.wikimedia.org/T235843 [11:58:20] !log installing sudo security updates for jessie [11:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:54] (03PS2) 10Jbond: rake_module: update spec_helper to use nuyaml3 [puppet] - 10https://gerrit.wikimedia.org/r/543279 [12:05:44] (03CR) 10Ayounsi: "This hangs with the following." [software/homer] - 10https://gerrit.wikimedia.org/r/543890 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [12:06:17] (03CR) 10Jbond: [C: 03+2] rake_module: update spec_helper to use nuyaml3 [puppet] - 10https://gerrit.wikimedia.org/r/543279 (owner: 10Jbond) [12:10:10] !log !log disable puppet on puppetmasters to fix puppet-merge [12:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:24] (03PS1) 10Jbond: puppet-merge: fix typo in LABS_PRIVATE variable [puppet] - 10https://gerrit.wikimedia.org/r/544178 [12:14:44] (03CR) 10Jbond: [C: 03+2] puppet-merge: fix typo in LABS_PRIVATE variable [puppet] - 10https://gerrit.wikimedia.org/r/544178 (owner: 10Jbond) [12:20:08] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission [12:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:46] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [12:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:07] (03PS1) 10Jbond: icinga: update cirrus CI checks to use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/544179 [12:23:32] (03CR) 10Jbond: [C: 03+2] icinga: update cirrus CI checks to use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/544179 (owner: 10Jbond) [12:24:09] (03PS16) 10Jbond: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [12:24:54] (03PS1) 10Muehlenhoff: Remove remaining eeden puppet references [puppet] - 10https://gerrit.wikimedia.org/r/544180 (https://phabricator.wikimedia.org/T235770) [12:24:58] (03PS1) 10Ema: cache: reimage cp4029 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/544181 (https://phabricator.wikimedia.org/T227432) [12:26:23] (03PS1) 10Muehlenhoff: Remove DNS entries for eeden [dns] - 10https://gerrit.wikimedia.org/r/544182 (https://phabricator.wikimedia.org/T235770) [12:26:30] (03CR) 10jerkins-bot: [V: 04-1] query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [12:31:12] (03CR) 10Ayounsi: "To follow up." [software/homer] - 10https://gerrit.wikimedia.org/r/543890 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [12:32:37] (03PS2) 10Muehlenhoff: Remove remaining eeden puppet references [puppet] - 10https://gerrit.wikimedia.org/r/544180 (https://phabricator.wikimedia.org/T235770) [12:34:16] (03CR) 10Muehlenhoff: [C: 03+2] Remove remaining eeden puppet references [puppet] - 10https://gerrit.wikimedia.org/r/544180 (https://phabricator.wikimedia.org/T235770) (owner: 10Muehlenhoff) [12:34:58] (03CR) 10Muehlenhoff: [C: 03+2] Remove DNS entries for eeden [dns] - 10https://gerrit.wikimedia.org/r/544182 (https://phabricator.wikimedia.org/T235770) (owner: 10Muehlenhoff) [12:39:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1076 after schema change', diff saved to https://phabricator.wikimedia.org/P9389 and previous config saved to /var/cache/conftool/dbconfig/20191018-123930-marostegui.json [12:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:02] (03PS17) 10Jbond: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:01:50] !log Compress db2084:3315 T235599 [13:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:54] T235599: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 [13:02:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2084:3315 for tables compression T235599', diff saved to https://phabricator.wikimedia.org/P9390 and previous config saved to /var/cache/conftool/dbconfig/20191018-130253-marostegui.json [13:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:09] (03PS1) 10Marostegui: instances.yaml: Remove db2059 [puppet] - 10https://gerrit.wikimedia.org/r/544190 (https://phabricator.wikimedia.org/T230884) [13:06:00] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2059 [puppet] - 10https://gerrit.wikimedia.org/r/544190 (https://phabricator.wikimedia.org/T230884) (owner: 10Marostegui) [13:07:46] (03CR) 10Ottomata: "32 is a lot! Should be fine in theory, but this means that unless your 32 partition topic has a really high throughput, each partition wi" [puppet] - 10https://gerrit.wikimedia.org/r/543873 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [13:08:53] (03PS1) 10Arturo Borrero Gonzalez: toolforge: rename k8s::apilb role/profile to k8s::haproxy [puppet] - 10https://gerrit.wikimedia.org/r/544191 (https://phabricator.wikimedia.org/T234037) [13:08:58] (03CR) 10Ottomata: "Also, this will only change the partition count for new topics going forward. Since often, most new topics are low volume, we keep a low " [puppet] - 10https://gerrit.wikimedia.org/r/543873 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [13:09:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I just realized you also need the entries in conftool-data for discovery." [puppet] - 10https://gerrit.wikimedia.org/r/542572 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [13:10:24] (03PS2) 10Arturo Borrero Gonzalez: toolforge: rename k8s::apilb role/profile to k8s::haproxy [puppet] - 10https://gerrit.wikimedia.org/r/544191 (https://phabricator.wikimedia.org/T234037) [13:13:30] (03CR) 10Ottomata: "Hm. Don't deb packages have a way of choosing the whether to overwrite installed config files? I have some vague memory of seeing things" [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/544156 (owner: 10Elukey) [13:22:01] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:22:27] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:23:13] (03CR) 10Elukey: "> Hm. Don't deb packages have a way of choosing the whether to" [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/544156 (owner: 10Elukey) [13:28:49] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/543873 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [13:29:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2059 from config, host decommissioned', diff saved to https://phabricator.wikimedia.org/P9391 and previous config saved to /var/cache/conftool/dbconfig/20191018-132934-marostegui.json [13:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:57] (03PS1) 10Reedy: Remove testwiki => true from wmgUseCentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544193 [13:31:32] (03CR) 10jerkins-bot: [V: 04-1] Remove testwiki => true from wmgUseCentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544193 (owner: 10Reedy) [13:32:12] (03CR) 10Ottomata: "Right, but doesn't this patch disable installing it?" [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/544156 (owner: 10Elukey) [13:33:09] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:33:35] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:37:51] (03CR) 10Ottomata: "It might be good to try and target a throughput rate per partition (and per consumer). At what point does a logstash consumer start laggi" [puppet] - 10https://gerrit.wikimedia.org/r/543873 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [13:40:29] (03CR) 10Ottomata: "Oh, sorry I read your comment as saying the topic currently had only one partition, you are saying that a logstash consumer can handle aro" [puppet] - 10https://gerrit.wikimedia.org/r/543873 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [13:42:33] PROBLEM - Host db1105 is DOWN: PING CRITICAL - Packet loss = 100% [13:43:08] what? [13:43:11] RECOVERY - Host db1105 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [13:43:26] it got rebooted [13:43:26] nice [13:44:13] (03PS2) 10Filippo Giunchedi: hieradata: bump kafka-logging default partitions [puppet] - 10https://gerrit.wikimedia.org/r/543873 (https://phabricator.wikimedia.org/T215904) [13:45:09] nothing in "racadm getsel" for db1105 (or kern.log) [13:45:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3311 and db1105:3312 host rebooted itself', diff saved to https://phabricator.wikimedia.org/P9392 and previous config saved to /var/cache/conftool/dbconfig/20191018-134517-marostegui.json [13:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:07] is that a production host? [13:46:12] yep [13:46:14] it is depooled now [13:46:16] I am investigating [13:46:19] I think it is a storage crash [13:46:43] let me see if I arrive on time to disable notifications [13:46:48] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/543873 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [13:47:38] ottomata: moved to 12 partitions, thanks for your help! after merge I'll change bump partitions for existing topics too [13:48:10] !log disabled notifications on db1105 [13:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:46] https://phabricator.wikimedia.org/T235877 [13:49:58] is logs bot down? [13:51:50] logmsgbot: help [13:51:56] :( [13:52:22] (03PS3) 10Eevans: [WIP] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [13:53:05] I am not going to pool this host back on a friday evening [13:53:10] let's leave it depooled for the weekend [13:53:14] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [13:53:27] jynus: thanks for disabling notifications - I am going to disable them on puppet for consistency [13:53:41] (03PS1) 10Herron: admin: add gsingers to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/544197 (https://phabricator.wikimedia.org/T235260) [13:54:32] (03PS1) 10Marostegui: db1105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/544198 (https://phabricator.wikimedia.org/T235877) [13:55:15] ok, I will enable them when you are done [13:55:33] (03CR) 10Marostegui: [C: 03+2] db1105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/544198 (https://phabricator.wikimedia.org/T235877) (owner: 10Marostegui) [13:55:40] oh, I think you did that already? [13:55:57] just merged, leave them disabled [13:56:00] no, wrong host [13:56:18] wrong host? [13:56:18] I prefer to disable them manually, because later icina ends up in weird state [13:56:36] and they are all gree right now [13:56:42] because it is all recovered [13:56:55] I am not going to pool it back [13:57:11] yes, I just want them to be disabled from your patch [13:57:17] I am going to run a check data and leave it running for the weekend, just in case it is a recurring issue [13:57:25] My patch is merged [13:57:41] otherwise it could do strange stuff due to double disabling, or at least it should [13:57:52] pluse we will not forget to enable them when you revert [13:57:57] you got me now? [13:58:05] jynus: I have stopped puppet on db1105, so you can enable them on the UI [13:58:13] I just did [13:58:18] you can run puppet [13:58:19] ah cool, can I run puppet then? [13:58:21] ok cool! [13:58:37] it will only take effect after they run also on icinga [13:58:41] yep [13:58:43] I know [13:58:48] I know you know :-) [13:59:04] I am going to clean the HW logs, they are not useful at all, they are just 1y old [13:59:20] is the load on s1 ok? [13:59:36] yeah [13:59:36] those were the hosts that had load issues recently, I think [13:59:42] db1099:3311 is fine serving rc [13:59:43] the contributor ones [14:00:18] only when we got that amount of queries [14:00:22] we spoke about a few days ago [14:00:40] no more ongoing log errors [14:03:07] (03PS18) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) [14:03:09] (03PS25) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [14:03:11] (03PS22) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [14:03:13] (03PS17) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [14:03:15] (03PS17) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [14:03:17] (03PS17) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [14:07:10] jynus, marostegui: db1105 is from the batch of servers which needs a firmware upgrade to address the reboot stalls (https://phabricator.wikimedia.org/T216240), might be a good time to piggyback that now that it's depooled [14:07:42] (03CR) 10jerkins-bot: [V: 04-1] rename service definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [14:09:00] moritzm: yeah, normally after a crash we ask DCOPs to upgrade everything in case it happens again and we have to open a case [14:09:39] (03CR) 10Ottomata: [C: 03+1] "Oh, I see you say that in the commit message duh." [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/544156 (owner: 10Elukey) [14:09:52] (03CR) 10Andrew Bogott: "> I am confused, why is this change needed at all if everything is already working fine? https://phabricator.wikimedia.org/T233236#5585066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [14:09:53] ack [14:10:06] !log Run compare.py on db1105 - T235877 [14:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:11] T235877: db1105 rebooted itself - https://phabricator.wikimedia.org/T235877 [14:17:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/544197 (https://phabricator.wikimedia.org/T235260) (owner: 10Herron) [14:19:42] <_joe_> !log uploading cassandra 3.11.4 to stretch-wikimedia [14:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:42] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2059.codfw.wmnet - https://phabricator.wikimedia.org/T230884 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vlan-private1-d-codfw] - member ge-6/0/7; [edit interfaces interface-ra... [14:28:13] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2059.codfw.wmnet - https://phabricator.wikimedia.org/T230884 (10Papaul) [14:28:43] 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10serviceops: Investigate recurrent latency spikes for the MediaWiki appservers - https://phabricator.wikimedia.org/T235872 (10elukey) Custom dashboard to get some metrics about the MW WANObjectCache (derived from the Performance Team's dash... [14:29:34] 10Operations, 10DBA: db1105 rebooted itself - https://phabricator.wikimedia.org/T235877 (10Marostegui) There is no trace of errors on HW logs as seen at T235877#5586910, our experience with this kind of crashes/reboots relates to storage crashes that crash the whole server. I have seen some logs related to sma... [14:30:22] Krinkle,AaronSchulz: o/ - I wanted to duplicate the WANObjectCache dashboard with my own version in grafana, but I mistakenly ended up modifying yours. Really sorry, I have restored it to its original state, so if you see me in the history you know why :( [14:30:36] will be more careful next time! [14:31:19] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1105 rebooted itself - https://phabricator.wikimedia.org/T235877 (10Marostegui) Let's upgrade its firmware and BIOS to make sure it is all up-to-date in case this happens again and we need to open a case with the vendor. @Cmjohnson @Jclark-ctr can we arrange a... [14:41:33] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add OSPF support [homer/public] - 10https://gerrit.wikimedia.org/r/543795 (owner: 10Ayounsi) [14:41:42] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add security zones for MRs [homer/public] - 10https://gerrit.wikimedia.org/r/543840 (owner: 10Ayounsi) [14:41:49] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add LLDP and IGMP snooping support [homer/public] - 10https://gerrit.wikimedia.org/r/543870 (owner: 10Ayounsi) [14:42:05] (03CR) 10Filippo Giunchedi: "One note: the index template isn't automatically re-uploaded by logstash, so the change isn't effective yet. I'll update the template on M" [puppet] - 10https://gerrit.wikimedia.org/r/543192 (https://phabricator.wikimedia.org/T234564) (owner: 10Hashar) [14:45:49] 10Operations, 10Acme-chief, 10Traffic: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10MoritzMuehlenhoff) @Vgutierrez I created a 2.6.1-3+deb10u2, it's in my home on acmechief1001. Let's deploy this on acmechief* hosts on Monday before I submit this to the Debian stable release... [14:46:52] 10Operations, 10DBA, 10serviceops: Backups on buster hosts fail to run - https://phabricator.wikimedia.org/T235838 (10akosiaris) I see 2 paths forward: * Push forward with the migration in order to fix this * Backport bacula 9 from the stretch backport to jessie and do the upgrade in place on the old hosts... [14:55:52] (03PS1) 10Herron: logstash: set template_overwrite true in elasticsearch outputs [puppet] - 10https://gerrit.wikimedia.org/r/544209 [14:56:50] (03PS2) 10Herron: logstash: set template_overwrite true in elasticsearch outputs [puppet] - 10https://gerrit.wikimedia.org/r/544209 (https://phabricator.wikimedia.org/T234564) [14:58:20] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Papaul) [14:58:50] (03CR) 10Herron: "this follows bd72f35bef2a9b65d275ea52c63c2a227054f5ed to help apply logstash es template changes automatically" [puppet] - 10https://gerrit.wikimedia.org/r/544209 (https://phabricator.wikimedia.org/T234564) (owner: 10Herron) [15:06:54] !log Reattach DannyS712@banwiki to DannyS712@SUL (T235446) [15:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:09] T235446: Unable to log into ban.wiki - https://phabricator.wikimedia.org/T235446 [15:13:14] 10Operations, 10netops: IRR updates needed - https://phabricator.wikimedia.org/T235886 (10ayounsi) p:05Triage→03Low [15:30:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/544209 (https://phabricator.wikimedia.org/T234564) (owner: 10Herron) [15:33:41] jouncebot: now [15:33:41] No deployments scheduled for the next 66 hour(s) and 56 minute(s) [15:33:48] ah, it's Friday :/ [15:33:52] * Urbanecm stagging at mwdebug1001 [15:34:59] (03PS3) 10Andrew Bogott: labtestwikitech: use the new codfw1-dev servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543943 [15:35:45] (03CR) 10jerkins-bot: [V: 04-1] labtestwikitech: use the new codfw1-dev servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543943 (owner: 10Andrew Bogott) [15:38:18] (03PS15) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [15:38:20] (03PS4) 10Andrew Bogott: labtestwikitech: use the new codfw1-dev servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543943 (https://phabricator.wikimedia.org/T229441) [15:38:24] !log Rename DannyS712@banwiki to DannyS712 (T235446) locally (T235446) [15:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:29] T235446: Unable to log into ban.wiki - https://phabricator.wikimedia.org/T235446 [15:38:44] !log Run extensions/CentralAuth/maintenance/createLocalAccount.php --wiki=banwiki DannyS712 (T235446) [15:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:05] (03CR) 10jerkins-bot: [V: 04-1] labtestwikitech: use the new codfw1-dev servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543943 (https://phabricator.wikimedia.org/T229441) (owner: 10Andrew Bogott) [15:39:12] (03CR) 10jerkins-bot: [V: 04-1] labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [15:39:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cassandra: Pin Cassandra packages to version 3.11.4 [puppet] - 10https://gerrit.wikimedia.org/r/543494 (https://phabricator.wikimedia.org/T200803) (owner: 10Eevans) [15:40:16] !log Reassign edits from DannyS712 (T235446) to DannyS712 at banwiki (T235446) [15:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:07] (03CR) 10Mobrovac: [C: 04-1] "Needs a reference in tests/TestServices.php changed too, otherwise LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [15:42:14] (03PS7) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [15:42:48] (03CR) 10Cwhite: "Marking inline comments as complete." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [15:49:07] (03PS2) 10Eevans: rename service definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) [15:49:24] (03PS1) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [15:49:47] (03CR) 10jerkins-bot: [V: 04-1] rename service definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [15:53:34] (03CR) 10Eevans: [C: 03+1] cassandra config updates for 3.11.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) (owner: 10Eevans) [15:56:44] (03CR) 10Urbanecm: [C: 04-1] Initial configuration for mnwwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) (owner: 10Jon Harald Søby) [15:57:09] 10Operations, 10Wikimedia-Logstash: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10fgiunchedi) [15:57:58] (03CR) 10Urbanecm: Initial configuration for mnwwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) (owner: 10Jon Harald Søby) [15:58:28] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) (owner: 10Jon Harald Søby) [15:59:30] (03PS2) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [15:59:39] (03PS1) 10Filippo Giunchedi: logstash: drop quotes from filter config options [puppet] - 10https://gerrit.wikimedia.org/r/544217 (https://phabricator.wikimedia.org/T235891) [16:01:44] (03PS3) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [16:03:50] (03PS1) 10Filippo Giunchedi: logstash: config readable by logstash only by default [puppet] - 10https://gerrit.wikimedia.org/r/544218 (https://phabricator.wikimedia.org/T235891) [16:11:41] PROBLEM - SSH on boron is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:14:39] RECOVERY - SSH on boron is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:16:34] (03PS1) 10Jcrespo: [WIP] bacula:First version of bacup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [16:17:30] (03CR) 10jerkins-bot: [V: 04-1] [WIP] bacula:First version of bacup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [16:17:40] (03PS2) 10Jcrespo: [WIP]bacula:Add 1st version of backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [16:18:35] (03CR) 10jerkins-bot: [V: 04-1] [WIP]bacula:Add 1st version of backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [16:21:16] (03CR) 1020after4: [C: 03+1] Phabricator: Uninstall Conpherence application also in default settings [puppet] - 10https://gerrit.wikimedia.org/r/542787 (https://phabricator.wikimedia.org/T127640) (owner: 10Aklapper) [16:22:39] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) With this first version I get: ` root@helium:~$ python3 check_bacula.py Jobs with fresh backups: 76 Jobs with stale full backups only: 1 Jobs... [16:23:57] (03PS8) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [16:25:12] (03PS1) 10CRusnov: netbox: Add parameters to communicate http scheme to django [puppet] - 10https://gerrit.wikimedia.org/r/544222 [16:26:23] (03CR) 10jerkins-bot: [V: 04-1] lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:27:42] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Dwisehaupt) 05Open→03Resolved @Papaul Yes, it is. Thanks. Closing it up. [16:29:56] (03PS6) 10Cwhite: profile, prometheus, role: install swagger exporter on prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) [16:32:54] (03PS9) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [16:33:23] (03PS1) 10Arturo Borrero Gonzalez: openstack: wmcs-makedomain: allow transfers on domains owned by admin project [puppet] - 10https://gerrit.wikimedia.org/r/544223 (https://phabricator.wikimedia.org/T235846) [16:35:31] (03CR) 10jerkins-bot: [V: 04-1] lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:51:11] (03PS10) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [16:51:55] (03CR) 10jerkins-bot: [V: 04-1] lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:54:19] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1105 rebooted itself - https://phabricator.wikimedia.org/T235877 (10wiki_willy) a:03Cmjohnson [16:56:44] (03CR) 10CDanis: "Thanks for doing this! Mostly LGTM, have some nits." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [16:57:21] (03PS7) 10Cwhite: profile, prometheus, role: install swagger exporter on prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) [16:57:23] (03PS11) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [16:59:26] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, and 2 others: Analytics Access for Grant (groups cn=wmf and analytics-privatedata-users) - https://phabricator.wikimedia.org/T235260 (10Nuria) Please @gsingers test whether you can access https://turnilo.wikimedia.org and https://super... [16:59:59] (03CR) 10jerkins-bot: [V: 04-1] lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:01:56] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Nuria) And also we need to add lex to nda group for access to turnilo and superset [17:04:29] (03PS12) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [17:04:39] (03PS4) 10Dzahn: discovery.yaml: add parsoid-php microservice [puppet] - 10https://gerrit.wikimedia.org/r/542572 (https://phabricator.wikimedia.org/T233654) [17:15:30] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Dzahn) @RStallman-legalteam Hi, do you have NDA on file for Lex Nasser? [17:15:34] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Dzahn) @Nuria Thank you for the groups, will prepare a change and move forward with this asap. @lexnasser Thank you, yes, i see the signature looks good! [17:15:44] (03CR) 10Dzahn: "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/542572 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [17:17:20] (03CR) 10Ayounsi: [C: 03+1] "Talked on irc, lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/544222 (owner: 10CRusnov) [17:17:32] (03CR) 10EBernhardson: [C: 03+1] "It's not clear from the docs if this actually monitors the file for updates, or at what point the file on disk is checked. Overall though " [puppet] - 10https://gerrit.wikimedia.org/r/544209 (https://phabricator.wikimedia.org/T234564) (owner: 10Herron) [17:17:34] (03CR) 10CRusnov: [C: 03+2] netbox: Add parameters to communicate http scheme to django [puppet] - 10https://gerrit.wikimedia.org/r/544222 (owner: 10CRusnov) [17:26:32] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Dzahn) @Nuria Is Lex going to get a @wikimedia.org email address? I was wondering because i need to specify one in the code change and it looks like it doesn't exist un... [17:28:46] 10Operations, 10SRE-Access-Requests, 10WMF-Legal: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10Dzahn) a:03RStallman-legalteam [17:29:46] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10Dzahn) a:03Nuria [17:30:08] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: NDA Request from WMDE employee Verena - https://phabricator.wikimedia.org/T233807 (10Dzahn) a:03RStallman-legalteam [17:31:28] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) [17:32:20] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10ssastry) [17:33:03] 10Operations, 10serviceops: Set up LVS for parsoid/PHP - https://phabricator.wikimedia.org/T233722 (10ssastry) [17:33:24] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10RStallman-legalteam) @Dzahn I do not have an NDA on file for Lex Nasser, but it is possible that the paperwork for Lex's internship was completed through HR. @Nuria can... [17:33:39] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) [17:44:42] (03PS1) 10Dzahn: parsoid: turn all wtp servers into Parsoid/PHP-MW-appservers [puppet] - 10https://gerrit.wikimedia.org/r/544232 (https://phabricator.wikimedia.org/T233654) [17:47:26] (03PS1) 10Mholloway: Add MachineVision tables/columns to filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/544233 (https://phabricator.wikimedia.org/T235887) [17:48:40] 10Operations, 10DC-Ops, 10decommission: decommission eeden - https://phabricator.wikimedia.org/T235770 (10RobH) a:05RobH→03Jclark-ctr [17:49:09] (03CR) 10Mholloway: "This may be premature since the extension is not yet enabled in production, but I had some time waiting for Vagrant to provision and thoug" [puppet] - 10https://gerrit.wikimedia.org/r/544233 (https://phabricator.wikimedia.org/T235887) (owner: 10Mholloway) [17:54:49] 10Operations, 10DC-Ops, 10decommission: decommission eeden - https://phabricator.wikimedia.org/T235770 (10RobH) a:05Jclark-ctr→03Papaul [17:58:11] (03PS2) 10CDanis: Reduce arclamp .log file retention from 90 to 45 days [puppet] - 10https://gerrit.wikimedia.org/r/543931 (https://phabricator.wikimedia.org/T235455) (owner: 10Aaron Schulz) [17:59:35] (03CR) 10CDanis: [C: 03+2] Reduce arclamp .log file retention from 90 to 45 days [puppet] - 10https://gerrit.wikimedia.org/r/543931 (https://phabricator.wikimedia.org/T235455) (owner: 10Aaron Schulz) [18:01:29] (03CR) 10Ayounsi: "I found the fix and Cas implemented it with https://gerrit.wikimedia.org/r/c/operations/puppet/+/544222" [software/homer] - 10https://gerrit.wikimedia.org/r/543890 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [18:03:24] (03CR) 10Dzahn: [C: 04-1] "needs "has_lvs" as well. https://puppet-compiler.wmflabs.org/compiler1002/18931/wtp1026.eqiad.wmnet/change.wtp1026.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/544232 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [18:03:58] (03CR) 10Ayounsi: [C: 03+1] "Tested with various types of devices and it works fine now." [software/homer] - 10https://gerrit.wikimedia.org/r/543890 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [18:08:39] (03PS2) 10Dzahn: parsoid: turn all wtp servers into Parsoid/PHP-MW-appservers [puppet] - 10https://gerrit.wikimedia.org/r/544232 (https://phabricator.wikimedia.org/T233654) [18:09:38] (03PS3) 10Dzahn: parsoid: turn all wtp servers into Parsoid/PHP-MW-appservers [puppet] - 10https://gerrit.wikimedia.org/r/544232 (https://phabricator.wikimedia.org/T233654) [18:14:19] 10Operations, 10ops-ulsfo, 10Traffic: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 (10RobH) p:05Triage→03Normal [18:15:15] ACKNOWLEDGEMENT - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T235911 [18:15:15] ACKNOWLEDGEMENT - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T235911 [18:18:13] (03CR) 10Andrew Bogott: [C: 03+1] "If it works, it works :)" [puppet] - 10https://gerrit.wikimedia.org/r/544223 (https://phabricator.wikimedia.org/T235846) (owner: 10Arturo Borrero Gonzalez) [18:19:25] (03CR) 10CDanis: [C: 03+1] "> Patch Set 14:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [18:19:28] !log temp. disabling puppet on all wtp* servers [18:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:17] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18933/" [puppet] - 10https://gerrit.wikimedia.org/r/544232 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [18:27:34] !log temp. disabled puppet on all wtp* servers, adding mediawiki appserver roles on them incrementally by re-enabling puppet, starting with wtp1026, scheduled icinga downtime for wtp* all services (T233654) [18:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:39] T233654: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 [18:39:31] (03CR) 10Dzahn: "@marostegui i can already connect from the new host without this though. application is using m1-master.eqiad.wmnet and dbname and user ar" [puppet] - 10https://gerrit.wikimedia.org/r/544079 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [18:41:39] (03CR) 10Alex Monk: "Let's please not put anything in the admin project that doesn't strictly need to be there - IIRC it has some special meaning/purpose, nova" [puppet] - 10https://gerrit.wikimedia.org/r/544223 (https://phabricator.wikimedia.org/T235846) (owner: 10Arturo Borrero Gonzalez) [18:43:24] (03PS2) 10Dzahn: mariadb/ferm_misc: allow moscovium to connect to rt database [puppet] - 10https://gerrit.wikimedia.org/r/544079 (https://phabricator.wikimedia.org/T180641) [18:47:43] ACKNOWLEDGEMENT - HTTPS-wmflabs on tools.wmflabs.org is CRITICAL: SSL CRITICAL - Certificate *.wmflabs.org valid until 2019-11-16 15:41:05 +0000 (expires in 28 days) daniel_zahn https://phabricator.wikimedia.org/T233225 https://phabricator.wikimedia.org/tag/toolforge/ [18:56:03] 10Operations, 10SRE-Access-Requests, 10WMF-NDA-Requests: Please add @jrobell and @spatton to WMF-NDA - https://phabricator.wikimedia.org/T161822 (10Dzahn) [18:56:50] 10Operations, 10SRE-Access-Requests, 10WMF-NDA-Requests: Please add @jrobell and @spatton to WMF-NDA - https://phabricator.wikimedia.org/T161822 (10Dzahn) a:03spatton [18:57:39] 10Operations, 10SRE-Access-Requests, 10WMF-NDA-Requests: Please add @jrobell and @spatton to WMF-NDA - https://phabricator.wikimedia.org/T161822 (10spatton) Thanks @Dzahn! I sent a request to Lisa and hope to have her approval soon. I'll stay on top of this :) [18:58:31] 10Operations, 10SRE-Access-Requests, 10WMF-NDA-Requests: Please add @jrobell and @spatton to WMF-NDA (access to private Phabricator tasks) - https://phabricator.wikimedia.org/T161822 (10Dzahn) [18:59:12] 10Operations, 10SRE-Access-Requests, 10WMF-NDA-Requests: Please add @jrobell and @spatton to WMF-NDA (access to private Phabricator tasks) - https://phabricator.wikimedia.org/T161822 (10Dzahn) @spatton Great, and i added some tags to make it visible. Thanks. [19:15:28] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544193 (owner: 10Reedy) [19:17:35] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [19:18:10] (03Abandoned) 10Jforrester: [DNM] Debug for T235816 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544106 (owner: 10DannyS712) [19:18:17] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544114 (https://phabricator.wikimedia.org/T221119) (owner: 10Abijeet Patro) [19:18:21] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [19:18:24] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543943 (https://phabricator.wikimedia.org/T229441) (owner: 10Andrew Bogott) [19:18:26] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [19:18:39] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [19:19:17] 10Operations, 10Release Pipeline, 10serviceops, 10Goal, 10Release-Engineering-Team (Pipeline): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10Ladsgroup) p:05Normal→03High [19:45:20] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1026.eqiad.wmnet,service=parsoid-php [19:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:04] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1027.eqiad.wmnet,service=parsoid-php [19:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:43] jouncebot: next [19:46:43] In 62 hour(s) and 43 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191021T1030) [19:47:12] plently of time [19:47:26] haha, indeed [19:49:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2002.codfw.wmnet,service=parsoid-php [19:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:34] (03CR) 10Hashar: [C: 03+1] "I have no idea what it affects, but it seems the doc says so :]" [puppet] - 10https://gerrit.wikimedia.org/r/544209 (https://phabricator.wikimedia.org/T234564) (owner: 10Herron) [19:55:00] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1028.eqiad.wmnet,service=parsoid-php [19:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:05] (03CR) 10Dzahn: [C: 03+1] "we are not receiving mails anymore anyways.." [puppet] - 10https://gerrit.wikimedia.org/r/544078 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [20:10:44] (03PS3) 10DannyS712: Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) [20:13:18] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2003.codfw.wmnet,service=parsoid-php [20:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:39] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1029.eqiad.wmnet,service=parsoid-php [20:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:18] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1030.eqiad.wmnet,service=parsoid-php [20:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:28] PROBLEM - Check the last execution of php7.2-fpm_check_restart on wtp2008 is CRITICAL: NRPE: Command check_check_php7.2-fpm_check_restart_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:23:30] PROBLEM - php7.2-fpm service on wtp2009 is CRITICAL: NRPE: Command check_php7.2-fpm-state not defined https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:24:42] RECOVERY - php7.2-fpm service on wtp2009 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:24:58] ACKNOWLEDGEMENT - Apache HTTP on wtp2009 is CRITICAL: connect to address 10.192.16.51 and port 80: Connection refused daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Application_servers [20:25:19] almost got them all before this happened.. almost [20:26:24] keeps reloading the "pending" page in Icinga to downtime them [20:27:36] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1031.eqiad.wmnet,service=parsoid-php [20:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:42] RECOVERY - Check the last execution of php7.2-fpm_check_restart on wtp2008 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:37:41] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1027.eqiad.wmnet,service=parsoid-php [20:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:32] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1031.eqiad.wmnet,service=parsoid-php [20:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:22] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2004.codfw.wmnet,service=parsoid-php [20:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:39] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2005.codfw.wmnet,service=parsoid-php [20:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:41] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Nuria) @RStallman-legalteam it was done through HR, yes, he probably needs to sign an NDA as well? [20:49:38] PROBLEM - Check the last execution of php7.2-fpm_check_restart on wtp2016 is CRITICAL: NRPE: Command check_check_php7.2-fpm_check_restart_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:50:42] PROBLEM - mediawiki-installation DSH group on wtp2013 is CRITICAL: Host wtp2013 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:50:42] PROBLEM - php7.2-fpm service on wtp2016 is CRITICAL: NRPE: Command check_php7.2-fpm-state not defined https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:50:42] PROBLEM - php7.2-fpm service on wtp2012 is CRITICAL: NRPE: Command check_php7.2-fpm-state not defined https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:51:35] ACKNOWLEDGEMENT - php7.2-fpm service on wtp2012 is CRITICAL: NRPE: Command check_php7.2-fpm-state not defined daniel_zahn . https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:51:35] ACKNOWLEDGEMENT - mediawiki-installation DSH group on wtp2013 is CRITICAL: Host wtp2013 is not in mediawiki-installation dsh group daniel_zahn . https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:51:35] ACKNOWLEDGEMENT - Check the last execution of php7.2-fpm_check_restart on wtp2016 is CRITICAL: NRPE: Command check_check_php7.2-fpm_check_restart_status not defined daniel_zahn . https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:51:35] ACKNOWLEDGEMENT - php7.2-fpm service on wtp2016 is CRITICAL: NRPE: Command check_php7.2-fpm-state not defined daniel_zahn . https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:52:54] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1032.eqiad.wmnet,service=parsoid-php [20:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:08] RECOVERY - php7.2-fpm service on wtp2012 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:55:51] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10RStallman-legalteam) Ok, probably best for me to just create one since it looks like shell access is needed. @lexnasser could you email your physical (snail mail) addre... [21:00:20] 10Operations, 10Cassandra, 10Core Platform Team Legacy (Later), 10User-Eevans: Upload 3.11.4 packages to APT repo - https://phabricator.wikimedia.org/T235675 (10Eevans) a:05Eevans→03Joe I believe this is complete. [21:05:00] (03PS4) 10DannyS712: Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) [21:06:22] ACKNOWLEDGEMENT - SSH mw1290.mgmt on mw1290.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T234153 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:07:31] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10lexnasser) @RStallman-legalteam Just sent an email [21:08:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1033.eqiad.wmnet,service=parsoid-php [21:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:23] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: NDA Request from WMDE employee Verena - https://phabricator.wikimedia.org/T233807 (10RStallman-legalteam) NDA is complete and on file. Thanks! [21:12:59] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: NDA Request from WMDE employee Verena - https://phabricator.wikimedia.org/T233807 (10Dzahn) a:05RStallman-legalteam→03Dzahn [21:14:49] PROBLEM - Check the last execution of php7.2-fpm_check_restart on wtp2017 is CRITICAL: NRPE: Command check_check_php7.2-fpm_check_restart_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:19:59] RECOVERY - Check the last execution of php7.2-fpm_check_restart on wtp2016 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:23:31] RECOVERY - php7.2-fpm service on wtp2016 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:29:48] (03PS5) 10DannyS712: Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) [21:34:00] (03PS6) 10DannyS712: Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) [21:35:25] RECOVERY - Check the last execution of php7.2-fpm_check_restart on wtp2017 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:41:47] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1034.eqiad.wmnet,service=parsoid-php [21:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:00] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1035.eqiad.wmnet,service=parsoid-php [21:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1036.eqiad.wmnet,service=parsoid-php [21:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:24] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1039.eqiad.wmnet,service=parsoid-php [21:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:23] PROBLEM - php7.2-fpm service on wtp1041 is CRITICAL: NRPE: Command check_php7.2-fpm-state not defined https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:50:17] PROBLEM - Check the last execution of php7.2-fpm_check_restart on wtp1043 is CRITICAL: NRPE: Command check_check_php7.2-fpm_check_restart_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:50:17] PROBLEM - Check the last execution of php7.2-fpm_check_restart on wtp1044 is CRITICAL: NRPE: Command check_check_php7.2-fpm_check_restart_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:50:53] PROBLEM - Nginx local proxy to apache on wtp1041 is CRITICAL: connect to address 10.64.32.233 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [21:50:55] PROBLEM - PHP opcache health on wtp1038 is CRITICAL: NRPE: Command check_opcache not defined https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:50:55] PROBLEM - PHP opcache health on wtp1040 is CRITICAL: NRPE: Command check_opcache not defined https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:52:47] PROBLEM - PHP7 rendering on wtp1038 is CRITICAL: connect to address 10.64.32.230 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:52:47] PROBLEM - PHP7 rendering on wtp1040 is CRITICAL: connect to address 10.64.32.232 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:52:53] PROBLEM - PHP opcache health on wtp1041 is CRITICAL: NRPE: Command check_opcache not defined https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:53:01] (03PS2) 10Ammarpad: Add custom Minerva wordmark for Hebrew wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542660 (https://phabricator.wikimedia.org/T234278) [21:55:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1037.eqiad.wmnet,service=parsoid-php [21:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:05] RECOVERY - php7.2-fpm service on wtp1041 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:57:52] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2006.codfw.wmnet,service=parsoid-php [21:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:00] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2007.codfw.wmnet,service=parsoid-php [21:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2008.codfw.wmnet,service=parsoid-php [21:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:21] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2009.codfw.wmnet,service=parsoid-php [21:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:39] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2010.codfw.wmnet,service=parsoid-php [21:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:15] RECOVERY - PHP7 rendering on wtp1040 is OK: HTTP OK: HTTP/1.1 200 OK - 81559 bytes in 0.213 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:01:35] RECOVERY - PHP opcache health on wtp1040 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:01:39] PROBLEM - Check systemd state on wtp1042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:23] hrmm [22:03:33] yea, those are new too [22:06:36] PROBLEM - Check systemd state on wtp1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:44] RECOVERY - PHP7 rendering on wtp1038 is OK: HTTP OK: HTTP/1.1 200 OK - 81583 bytes in 0.729 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:07:10] RECOVERY - PHP opcache health on wtp1038 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:08:22] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1045.eqiad.wmnet,service=parsoid-php [22:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:07] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2011.codfw.wmnet,service=parsoid-php [22:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:12] PROBLEM - Check systemd state on wtp1044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:18] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2012.codfw.wmnet,service=parsoid-php [22:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:22] 10Operations, 10Growth-Team, 10Notifications, 10serviceops, and 2 others: Dashboards for monitoring of echostore - https://phabricator.wikimedia.org/T235558 (10Eevans) This is now done: See https://logstash.wikimedia.org/app/kibana#/dashboard/AW3go_uCx3rdj6D8q-he & https://grafana.wikimedia.org/d/IfJykaTZk... [22:10:31] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2032.codfw.wmnet,service=parsoid-php [22:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:52] RECOVERY - Check the last execution of php7.2-fpm_check_restart on wtp1043 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:10:52] RECOVERY - Check the last execution of php7.2-fpm_check_restart on wtp1044 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:11:07] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2013.codfw.wmnet,service=parsoid-php [22:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:15] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2014.codfw.wmnet,service=parsoid-php [22:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:34] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2015.codfw.wmnet,service=parsoid-php [22:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:03] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2016.codfw.wmnet,service=parsoid-php [22:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:04] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2017.codfw.wmnet,service=parsoid-php [22:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:19] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2018.codfw.wmnet,service=parsoid-php [22:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:27] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2020.codfw.wmnet,service=parsoid-php [22:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:29] PROBLEM - Check systemd state on wtp1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:19:17] RECOVERY - PHP opcache health on wtp1041 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:19:19] ACKNOWLEDGEMENT - Check systemd state on wtp1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:19:33] RECOVERY - Nginx local proxy to apache on wtp1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:20:23] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1038.eqiad.wmnet,service=parsoid-php [22:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:51] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1040.eqiad.wmnet,service=parsoid-php [22:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:23] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2019.codfw.wmnet,service=parsoid-php [22:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:47] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2020.codfw.wmnet,service=parsoid-php [22:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:21] RECOVERY - mediawiki-installation DSH group on wtp2013 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:27:46] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1038.eqiad.wmnet,service=parsoid-php [22:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:45] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1041.eqiad.wmnet,service=parsoid-php [22:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:21] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 56.48 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:30:32] 10Operations, 10ops-codfw: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10Dzahn) from `dmesg | grep EDAC` ` [12639234.186693] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [12639234.186694] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8800004f00800092 [12639... [22:32:13] RECOVERY - Check systemd state on wtp1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:11] RECOVERY - Check systemd state on wtp1044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:05] RECOVERY - Check systemd state on wtp1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:00] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1046.eqiad.wmnet,service=parsoid-php [22:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1043.eqiad.wmnet,service=parsoid-php [22:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:33] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1042.eqiad.wmnet,service=parsoid-php [22:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:06] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1044.eqiad.wmnet,service=parsoid-php [22:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:15] RECOVERY - Check systemd state on wtp1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:18] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1048.eqiad.wmnet,service=parsoid-php [22:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:45] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1047.eqiad.wmnet,service=parsoid-php [22:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:29] subbu: ^ all done and pooled for parsoid-php. monitoring checks got added and fully green now: [22:55:33] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=wtp [22:55:41] yay. thanks. [22:55:57] yw:) [22:57:26] i need to do scap config and pull [23:00:23] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 77.86 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:01:22] 10Operations, 10Wikimedia-Mailing-lists: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10crusnov) Apologies for not updating until now, I shall complete this on Monday if that is sufficient so as to not break things over a weekend. [23:04:23] nevermind, i dont have to. scap pull does nothing (besides checking if php-fpm restart is needed) because it's up2date [23:20:00] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) With [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/544232 | this change ]] the parameters `profile::parsoid::use_php` and `has_lvs` have become defau... [23:21:14] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) [23:30:52] 10Operations, 10serviceops: Set up LVS for parsoid/PHP - https://phabricator.wikimedia.org/T233722 (10Dzahn) Linking my pending Gerrit changes here: [ ] [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/542572 | discovery.yaml: add parsoid-php microservice ]] [ ] [[ https://gerrit.wikimedia.org/r/c/ope... [23:31:03] (03PS3) 10Dzahn: LVS: add config for parsoid-php service [puppet] - 10https://gerrit.wikimedia.org/r/543243 (https://phabricator.wikimedia.org/T233654) [23:33:23] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) [23:34:16] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) p:05Normal→03High [23:34:27] 10Operations, 10serviceops, 10Patch-For-Review: Set up LVS for parsoid/PHP - https://phabricator.wikimedia.org/T233722 (10Dzahn) p:05Normal→03High [23:36:42] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) @Joe If from here on the remaining steps are T233722 then this ticket should be resolved i think. [23:55:06] 10Operations, 10ops-codfw: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10Papaul) @Dzahn Thanks. [23:57:35] (03PS1) 10Dzahn: admins: add Verena Lindner to ldap_only admins, NDA request [puppet] - 10https://gerrit.wikimedia.org/r/544275 (https://phabricator.wikimedia.org/T233807) [23:59:52] (03CR) 10Dzahn: [C: 03+2] "NDA on file - https://phabricator.wikimedia.org/T233807#5586377" [puppet] - 10https://gerrit.wikimedia.org/r/544275 (https://phabricator.wikimedia.org/T233807) (owner: 10Dzahn)