[00:06:43] (03PS2) 10Reedy: Add export-11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584095 (https://phabricator.wikimedia.org/T238921) [00:06:48] (03CR) 10Reedy: [C: 03+2] Add export-11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584095 (https://phabricator.wikimedia.org/T238921) (owner: 10Reedy) [00:07:54] (03Merged) 10jenkins-bot: Add export-11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584095 (https://phabricator.wikimedia.org/T238921) (owner: 10Reedy) [00:12:24] !log reedy@deploy1001 Synchronized docroot/mediawiki.org/xml/: Add export-0.11 (duration: 01m 05s) [00:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:06] (03PS2) 10Krinkle: jenkins: Change CSP header to allow inline CSS and video playback (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/585038 (https://phabricator.wikimedia.org/T245658) (owner: 10Hashar) [00:28:56] Reedy: something something about HTTPS :) [00:29:41] lol [00:29:43] let's fix the lot [00:30:30] though, the xmlns is http [00:30:33] so we can't change everything [00:34:48] (03PS1) 10Reedy: Use https for links in xml folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585056 [00:36:53] (03PS1) 10Reedy: Replace link to sitelist.txt with sitelist.md [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585057 [00:36:58] (03CR) 10Reedy: [C: 03+2] Use https for links in xml folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585056 (owner: 10Reedy) [00:37:02] (03CR) 10Reedy: [C: 03+2] Replace link to sitelist.txt with sitelist.md [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585057 (owner: 10Reedy) [00:37:54] (03Merged) 10jenkins-bot: Use https for links in xml folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585056 (owner: 10Reedy) [00:37:57] (03Merged) 10jenkins-bot: Replace link to sitelist.txt with sitelist.md [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585057 (owner: 10Reedy) [00:39:37] !log reedy@deploy1001 Synchronized docroot/mediawiki.org/xml/: Update http and prot rel links to https, fix link to sitelist in MW Core (duration: 01m 06s) [00:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:24] ^ and before anyone complains that it's not updated, it's cached :P [00:53:04] (03PS1) 10Andrew Bogott: DO NOT MERGE -- experimental no-op tox test [puppet] - 10https://gerrit.wikimedia.org/r/585058 [00:57:37] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE -- experimental no-op tox test [puppet] - 10https://gerrit.wikimedia.org/r/585058 (owner: 10Andrew Bogott) [01:01:03] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [01:11:15] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [01:26:52] (03PS1) 10Samwilson: Enable password-reset-update on all other than Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585059 (https://phabricator.wikimedia.org/T245791) [01:51:49] PROBLEM - PHP opcache health on mw2216 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [02:06:18] (03CR) 10HMonroy: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585059 (https://phabricator.wikimedia.org/T245791) (owner: 10Samwilson) [02:08:37] RECOVERY - PHP opcache health on mw2216 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [02:44:18] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10Jpita) >>! In T247722#6014253, @jcrespo wrote: > This is the status right now: > ` > uid: josepita > cn: Jose pita > email: jpita-ctr@wikimedia.org > ldap groups: wmf... [03:16:34] (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585059 (https://phabricator.wikimedia.org/T245791) (owner: 10Samwilson) [03:32:21] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:36:33] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 63 probes of 543 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:42:23] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 37 probes of 543 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:48:25] PROBLEM - Disk space on backup1001 is CRITICAL: DISK CRITICAL - free space: /srv/databases 1456921 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup1001&var-datasource=eqiad+prometheus/ops [03:58:33] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 109 probes of 547 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:58:45] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 103 probes of 547 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:59:07] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 107 probes of 547 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:04:49] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:29] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 43 probes of 547 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:16:15] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 36 probes of 547 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:22:37] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 37 probes of 547 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:27:35] PROBLEM - PHP opcache health on mw2244 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:40:43] RECOVERY - PHP opcache health on mw2244 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:07:25] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:35] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:37:43] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:38:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121 for schema change', diff saved to https://phabricator.wikimedia.org/P10840 and previous config saved to /var/cache/conftool/dbconfig/20200401-053827-marostegui.json [05:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:41] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:39:55] !log Deploy schema change on db1121 (this will create lag on s4 labs) [05:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:28] (03PS1) 10Marostegui: Revert "install_server: Allow db2093 reimage without formating /srv" [puppet] - 10https://gerrit.wikimedia.org/r/585069 [05:55:37] (03PS2) 10Marostegui: Revert "install_server: Allow db2093 reimage without formating /srv" [puppet] - 10https://gerrit.wikimedia.org/r/585069 [05:57:13] (03PS1) 10Vgutierrez: site: Reimage cp2038 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/585070 (https://phabricator.wikimedia.org/T248816) [05:57:35] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow db2093 reimage without formating /srv" [puppet] - 10https://gerrit.wikimedia.org/r/585069 (owner: 10Marostegui) [05:59:25] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2038 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/585070 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [06:00:51] the OSPF alerts are related to planned maintenance from telia [06:01:00] all good :) [06:01:18] (03CR) 10Muehlenhoff: [C: 03+1] admin: Add holger to restricted group to run maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/584932 (https://phabricator.wikimedia.org/T248922) (owner: 10Jcrespo) [06:01:37] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2038.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [06:02:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/584894 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [06:04:08] (03CR) 10Elukey: kibana: move httpd proxy authentication to a separate profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [06:05:14] (03CR) 10Muehlenhoff: "If this is for Hadoop access, the user also needs Kerberos access: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a" [puppet] - 10https://gerrit.wikimedia.org/r/584895 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [06:07:24] (03CR) 10Muehlenhoff: "For Hive access, the users also need Kerberos access: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_principal_fo" [puppet] - 10https://gerrit.wikimedia.org/r/584915 (https://phabricator.wikimedia.org/T248797) (owner: 10Jcrespo) [06:10:59] (03CR) 10Muehlenhoff: firewall: ensure abuse network blocks are placed first (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/584912 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [06:11:04] (03CR) 10Elukey: [C: 03+1] "@gehel: let me know if this is something that I can do with David's help, it seems relatively easy to do (probably there will be some data" [puppet] - 10https://gerrit.wikimedia.org/r/577324 (https://phabricator.wikimedia.org/T246343) (owner: 10Gehel) [06:18:43] (03PS1) 10Elukey: profile::swap: add mysql credentials for analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/585071 [06:22:13] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [06:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:41] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:56] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2038.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2038.codfw.wmnet'] ` [06:31:08] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 (10Vgutierrez) [06:33:13] (03CR) 10Elukey: [C: 03+2] profile::swap: add mysql credentials for analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/585071 (owner: 10Elukey) [06:33:32] (03PS1) 10Vgutierrez: site,install_server: Decommission cp2012 [puppet] - 10https://gerrit.wikimedia.org/r/585114 (https://phabricator.wikimedia.org/T249080) [06:35:24] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2012 [puppet] - 10https://gerrit.wikimedia.org/r/585114 (https://phabricator.wikimedia.org/T249080) (owner: 10Vgutierrez) [06:35:37] (03PS2) 10Vgutierrez: site,install_server: Decommission cp2012 [puppet] - 10https://gerrit.wikimedia.org/r/585114 (https://phabricator.wikimedia.org/T249080) [06:36:09] !log depool & decommission cp2012 - T249080 [06:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:15] T249080: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 [06:36:54] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [06:36:54] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:20] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [06:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:51] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:38:54] (03PS1) 10Marostegui: tendril.pp: Set default basedir depending on the OS [puppet] - 10https://gerrit.wikimedia.org/r/585115 (https://phabricator.wikimedia.org/T248957) [06:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:58] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2012.codfw.wmnet` - cp2012.codfw.wmnet (**PASS**)... [06:45:40] 10Operations: Onboarding Janis Meybohm - https://phabricator.wikimedia.org/T249081 (10MoritzMuehlenhoff) [06:50:57] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_codfw,swagger_check_cxserver_cluster_codfw,swagger_check_mathoid_cluster_codfw} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:52:01] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:04:31] (03CR) 10Gehel: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/577324 (https://phabricator.wikimedia.org/T246343) (owner: 10Gehel) [07:06:37] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:59] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:12:02] !log pool cp2038 - T248816 [07:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:08] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [07:13:11] (03PS2) 10Marostegui: control-mariadb-10.4: Increase package version [software] - 10https://gerrit.wikimedia.org/r/584958 (https://phabricator.wikimedia.org/T248957) [07:13:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1121 after schema change', diff saved to https://phabricator.wikimedia.org/P10841 and previous config saved to /var/cache/conftool/dbconfig/20200401-071339-marostegui.json [07:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:55] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:19:39] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:21:54] (03PS1) 10Vgutierrez: Remove cp2012 entries [dns] - 10https://gerrit.wikimedia.org/r/585122 (https://phabricator.wikimedia.org/T249009) [07:22:29] (03CR) 10Jcrespo: [C: 04-1] "Did you upload to the repo already? Removing TokuDB support was on purpose; we don't want it on 10.4/Buster." [software] - 10https://gerrit.wikimedia.org/r/584958 (https://phabricator.wikimedia.org/T248957) (owner: 10Marostegui) [07:23:02] (03CR) 10Marostegui: "> Did you upload to the repo already? Removing TokuDB support was on" [software] - 10https://gerrit.wikimedia.org/r/584958 (https://phabricator.wikimedia.org/T248957) (owner: 10Marostegui) [07:28:46] (03CR) 10Vgutierrez: [C: 03+2] Remove cp2012 entries [dns] - 10https://gerrit.wikimedia.org/r/585122 (https://phabricator.wikimedia.org/T249009) (owner: 10Vgutierrez) [07:34:05] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 (10Vgutierrez) a:05Vgutierrez→03Papaul [07:34:26] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [07:39:26] (03PS1) 10Vgutierrez: site: Reimage cp2039 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/585123 (https://phabricator.wikimedia.org/T248816) [07:41:17] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2039 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/585123 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [07:43:21] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2039.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [07:44:13] (03Abandoned) 10Marostegui: control-mariadb-10.4: Increase package version [software] - 10https://gerrit.wikimedia.org/r/584958 (https://phabricator.wikimedia.org/T248957) (owner: 10Marostegui) [07:57:01] RECOVERY - Disk space on backup1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup1001&var-datasource=eqiad+prometheus/ops [07:58:50] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 (10Vgutierrez) [08:01:47] 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10Marostegui) [08:04:04] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [08:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:32] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:08] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2039.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2039.codfw.wmnet'] ` [08:09:29] !log Deploy schema change on db1138 (s4 primary master) [08:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:21] 10Operations, 10Graphoid, 10Code-Stewardship-Reviews, 10Release-Engineering-Team (Code Health), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) >>! In T211881#6006347, @Milimetric wrote: > Recent events have compelled me to adopt graphoid. I will... [08:17:55] (03PS1) 10Gehel: maps: add parameter to disable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/585128 [08:18:56] (03PS1) 10KartikMistry: apertium-eo-es: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-eo-es] - 10https://gerrit.wikimedia.org/r/585129 (https://phabricator.wikimedia.org/T247585) [08:20:44] (03CR) 10Dzahn: [C: 03+2] ATS: use discovery name instead of miscweb again after migration [puppet] - 10https://gerrit.wikimedia.org/r/584607 (https://phabricator.wikimedia.org/T247648) (owner: 10Dzahn) [08:21:26] (03CR) 10jerkins-bot: [V: 04-1] maps: add parameter to disable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/585128 (owner: 10Gehel) [08:21:33] !log pool cp2039 - T248816 [08:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:39] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [08:22:25] (03PS2) 10Gehel: maps: add parameter to disable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/585128 [08:25:22] (03CR) 10Dzahn: [C: 03+2] switch backend for planet to planet1002 [dns] - 10https://gerrit.wikimedia.org/r/584887 (https://phabricator.wikimedia.org/T247651) (owner: 10Dzahn) [08:25:27] (03PS2) 10Dzahn: switch backend for planet to planet1002 [dns] - 10https://gerrit.wikimedia.org/r/584887 (https://phabricator.wikimedia.org/T247651) [08:25:43] (03CR) 10jerkins-bot: [V: 04-1] maps: add parameter to disable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/585128 (owner: 10Gehel) [08:27:02] (03PS1) 10Vgutierrez: site,install_server: Decommission cp2017 [puppet] - 10https://gerrit.wikimedia.org/r/585131 (https://phabricator.wikimedia.org/T249084) [08:28:21] !log depool & decommission cp2017 - T249084 [08:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:27] T249084: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 [08:28:32] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [08:28:33] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:49] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2017 [puppet] - 10https://gerrit.wikimedia.org/r/585131 (https://phabricator.wikimedia.org/T249084) (owner: 10Vgutierrez) [08:29:40] (03PS3) 10Gehel: maps: add parameter to disable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/585128 (https://phabricator.wikimedia.org/T249086) [08:29:54] (03CR) 10Gehel: "PCC confirms this is a noop: https://puppet-compiler.wmflabs.org/compiler1003/21646/" [puppet] - 10https://gerrit.wikimedia.org/r/585128 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [08:30:00] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [08:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:34] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [08:30:37] o/ If anyone would like to finish off the +2 / merging of https://phabricator.wikimedia.org/T248482 from ops that would be grand! [08:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:40] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2017.codfw.wmnet` - cp2017.codfw.wmnet (**PASS**)... [08:30:50] afaik all of the boxes are tickets, it is just a case of getting the patches in [08:31:01] * addshore thinks there might be a better channel for this chat, sre perhaps! [08:31:13] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 (10Vgutierrez) a:05Vgutierrez→03Papaul [08:31:42] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [08:32:48] (03PS1) 10Gehel: maps: disable tilerator in codfw for data reload [puppet] - 10https://gerrit.wikimedia.org/r/585141 (https://phabricator.wikimedia.org/T249086) [08:35:12] (03PS1) 10Vgutierrez: Remove cp2017 entries [dns] - 10https://gerrit.wikimedia.org/r/585142 (https://phabricator.wikimedia.org/T249084) [08:35:54] (03CR) 10Vgutierrez: [C: 03+2] Remove cp2017 entries [dns] - 10https://gerrit.wikimedia.org/r/585142 (https://phabricator.wikimedia.org/T249084) (owner: 10Vgutierrez) [08:37:08] !log restart bacula at backup1001 [08:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:21] (03CR) 10Addshore: [C: 03+1] "Indeed." [puppet] - 10https://gerrit.wikimedia.org/r/584895 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [08:43:18] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:29] !log Stop haproxy on dbproxy1010 T248944 [08:43:32] (03PS1) 10Vgutierrez: site: Reimage cp2040 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/585143 (https://phabricator.wikimedia.org/T248816) [08:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:36] T248944: Decommission dbproxy1010.eqiad.wmnet - https://phabricator.wikimedia.org/T248944 [08:44:55] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2040 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/585143 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [08:46:38] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2040.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [08:46:41] !log deneb, boron: systemctl reset-failed to clear up systemd state alerts [08:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:45] RECOVERY - Check systemd state on boron is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:50] dcausse: how do we go about https://phabricator.wikimedia.org/T249041 being deployed? [08:47:16] RhinosF1: we'll deploy it next monday during the wdqs deploy window [08:47:31] unless it's urgent I can deploy it now [08:47:50] (03PS1) 10Gehel: maps: isolate osm master from the codfw maps cluster [puppet] - 10https://gerrit.wikimedia.org/r/585153 (https://phabricator.wikimedia.org/T249086) [08:47:51] dcausse: great! I've never done a wdqs patch so didn't know. Not a clue, I just did a patch based on the task. [08:48:17] RhinosF1: I'll ping you on the task once it's deployed [08:48:24] great! [08:53:26] PROBLEM - Check systemd state on db2093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:20] marostegui: ^^? [08:54:28] I am checking [08:54:29] checking [08:54:33] have disabled notifications [08:55:43] ah, a crash [08:55:45] not surprising [08:55:51] ? [08:57:36] (03CR) 10Alexandros Kosiaris: firewall: ensure abuse network blocks are placed first (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/584912 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [08:59:20] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2013.codfw.wmnet - https://phabricator.wikimedia.org/T249088 (10Vgutierrez) [09:03:30] (03PS1) 10Vgutierrez: site,install_server: Decommission cp2013 [puppet] - 10https://gerrit.wikimedia.org/r/585159 (https://phabricator.wikimedia.org/T249088) [09:05:58] !log planet - the backend server has been switched from planet1001 (stretch) to planet1002 (buster) - T247651 [09:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:04] T247651: upgrade planet.wikimedia.org backends to buster - https://phabricator.wikimedia.org/T247651 [09:06:24] (03PS4) 10Jcrespo: admin: convert user itamar from ldap to shell [puppet] - 10https://gerrit.wikimedia.org/r/584894 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [09:07:21] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [09:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:57] (03PS2) 10Marostegui: tendril.pp: Set default basedir depending on the OS [puppet] - 10https://gerrit.wikimedia.org/r/585115 (https://phabricator.wikimedia.org/T248957) [09:09:45] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:57] (03CR) 10Ayounsi: completed rollout of sensible flow-table-sizes (034 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/583740 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [09:11:24] (03CR) 10Jcrespo: [C: 03+2] admin: convert user itamar from ldap to shell [puppet] - 10https://gerrit.wikimedia.org/r/584894 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [09:12:04] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2040.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2040.codfw.wmnet'] ` [09:12:24] (03CR) 10Muehlenhoff: tendril.pp: Set default basedir depending on the OS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/585115 (https://phabricator.wikimedia.org/T248957) (owner: 10Marostegui) [09:12:33] (03PS5) 10Jcrespo: admin: grant user itamar analytics access [puppet] - 10https://gerrit.wikimedia.org/r/584895 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [09:13:28] (03CR) 10Ayounsi: [C: 03+1] "2 critical nits." (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/584971 (owner: 10Volans) [09:14:40] (03CR) 10Marostegui: tendril.pp: Set default basedir depending on the OS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/585115 (https://phabricator.wikimedia.org/T248957) (owner: 10Marostegui) [09:15:29] (03CR) 10Jcrespo: [C: 03+1] tendril.pp: Set default basedir depending on the OS [puppet] - 10https://gerrit.wikimedia.org/r/585115 (https://phabricator.wikimedia.org/T248957) (owner: 10Marostegui) [09:15:56] (03CR) 10Urbanecm: [C: 04-2] "No EDP is linked on the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584913 (owner: 104nn1l2) [09:17:07] (03CR) 10Jcrespo: [C: 03+2] admin: grant user itamar analytics access [puppet] - 10https://gerrit.wikimedia.org/r/584895 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [09:21:00] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10jcrespo) [09:21:07] (03CR) 10Ema: [C: 03+2] conftool::scripts: add ispooled [puppet] - 10https://gerrit.wikimedia.org/r/584613 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [09:22:35] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10jcrespo) @ItamarWMDE, after a few minutes passes (~30) you should be able to log in following https://wikitech.wikimedia.org/wiki/Production_access#Setting_up_... [09:22:51] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10jcrespo) @addshore please report here any per-service access provided, too. [09:23:18] (03PS2) 10Dzahn: replace DCHP relays with new installservers [homer/public] - 10https://gerrit.wikimedia.org/r/584963 (https://phabricator.wikimedia.org/T224576) [09:23:57] (03CR) 10Marostegui: [C: 03+2] tendril.pp: Set default basedir depending on the OS [puppet] - 10https://gerrit.wikimedia.org/r/585115 (https://phabricator.wikimedia.org/T248957) (owner: 10Marostegui) [09:26:23] !log Downgrade mariadb package from 10.4.12-2 to 10.4.12-1 [09:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:30] !log last entry was for db2093 [09:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:37] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Addshore) Thanks all! The next step for data access will be https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos setup Ping #analyti... [09:26:48] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Addshore) Thanks all! The next step for data access will be https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos setup Ping #analytics... [09:27:26] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Addshore) [09:27:35] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Addshore) [09:27:51] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Addshore) a:05Volans→03None [09:29:13] (03CR) 10Jbond: completed rollout of sensible flow-table-sizes (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/583740 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [09:31:05] (03CR) 10Ayounsi: [C: 03+2] "Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/584963 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [09:31:25] (03Merged) 10jenkins-bot: replace DCHP relays with new installservers [homer/public] - 10https://gerrit.wikimedia.org/r/584963 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [09:33:02] (03PS5) 10Dzahn: install_server: switch TFTP servers in DHCP to new install servers [puppet] - 10https://gerrit.wikimedia.org/r/569684 (https://phabricator.wikimedia.org/T224576) [09:33:10] (03CR) 10Dzahn: [C: 03+2] install_server: switch TFTP servers in DHCP to new install servers [puppet] - 10https://gerrit.wikimedia.org/r/569684 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [09:34:33] !log install_servers: DHCP_relay in routers and TFTP server in DHCP server config have been switched from install1002/2002 to install1003/2003 - doing a test install, but if any issues report on T224576 [09:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:39] T224576: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 [09:35:02] !log Update install servers IPs (dhcp helpers + firewall rules) - T224576 [09:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:26] !log Deploy schema change on s8 codfw, this will generate lag on codfw [09:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:53] (03PS2) 10Jbond: firewall: ensure abuse network blocks are placed first [puppet] - 10https://gerrit.wikimedia.org/r/584912 (https://phabricator.wikimedia.org/T233945) [09:42:37] (03PS1) 10Dzahn: add planet2002.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/585183 (https://phabricator.wikimedia.org/T247651) [09:43:13] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/584912 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [09:43:42] (03CR) 10Dzahn: [C: 03+2] add planet2002.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/585183 (https://phabricator.wikimedia.org/T247651) (owner: 10Dzahn) [09:44:08] (03PS2) 10Dzahn: add planet2002.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/585183 (https://phabricator.wikimedia.org/T247651) [09:46:03] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [09:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:57] (03CR) 10MSantos: [C: 03+1] maps: add parameter to disable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/585128 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [09:53:09] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 543 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:53:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] firewall: ensure abuse network blocks are placed first [puppet] - 10https://gerrit.wikimedia.org/r/584912 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [09:55:23] (03PS6) 10Cparle: Enable WikibaseQualityConstraints on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581678 (https://phabricator.wikimedia.org/T248117) [09:55:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [09:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:01] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:58:27] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 36 probes of 543 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:58:48] * cormacparle__ waves [09:58:55] ready for the swat window [10:00:46] (03PS1) 10Dzahn: DHCP: remove mw1254-mw1258 [puppet] - 10https://gerrit.wikimedia.org/r/585185 (https://phabricator.wikimedia.org/T247780) [10:00:48] (03CR) 10Jbond: [C: 03+2] firewall: ensure abuse network blocks are placed first [puppet] - 10https://gerrit.wikimedia.org/r/584912 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [10:00:50] (03PS1) 10Dzahn: DHCP: add planet2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/585186 (https://phabricator.wikimedia.org/T247651) [10:01:17] (03CR) 10Urbanecm: [C: 03+1] "Code is good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [10:01:36] (03PS2) 10Dzahn: DHCP: add planet2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/585186 (https://phabricator.wikimedia.org/T247651) [10:05:46] (03CR) 10Dzahn: [C: 03+2] DHCP: add planet2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/585186 (https://phabricator.wikimedia.org/T247651) (owner: 10Dzahn) [10:08:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21637/deploy1001.eqiad.wmnet/ does the right thing, merging." [puppet] - 10https://gerrit.wikimedia.org/r/465411 (owner: 10Giuseppe Lavagetto) [10:09:43] (03PS3) 10Giuseppe Lavagetto: conftool-data: add "canary" faux service to appservers [puppet] - 10https://gerrit.wikimedia.org/r/584861 [10:10:30] (03CR) 10Urbanecm: [C: 03+1] robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [10:11:50] 10Operations, 10LDAP-Access-Requests: Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10darthmon_wmde) >>! In T248949#6013600, @Aklapper wrote: > (Thanks for filing this! In the future feel free to also file a task to disable the Phab account - done t... [10:12:58] (03Abandoned) 10Alexandros Kosiaris: Use conftool to populate mw canaries in scap [puppet] - 10https://gerrit.wikimedia.org/r/463469 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [10:13:09] (03PS4) 10Giuseppe Lavagetto: conftool-data: add "canary" faux service to appservers [puppet] - 10https://gerrit.wikimedia.org/r/584861 [10:13:11] (03PS8) 10Giuseppe Lavagetto: scap: define MW canaries dynamically [puppet] - 10https://gerrit.wikimedia.org/r/465411 [10:14:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool-data: add "canary" faux service to appservers [puppet] - 10https://gerrit.wikimedia.org/r/584861 (owner: 10Giuseppe Lavagetto) [10:14:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] systemd: add support for network accounting [puppet] - 10https://gerrit.wikimedia.org/r/584553 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [10:16:31] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes:weight=1; selector: service=canary [10:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:24] <_joe_> jouncebot: next [10:18:24] In 0 hour(s) and 41 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200401T1100) [10:18:48] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "This makes me angry because the same setting was once imposed on my main project, the German Wikipedia, without any actual "consensus". I " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [10:22:26] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests: Grant analytics access to Anti Harassment Tools engineers - https://phabricator.wikimedia.org/T249059 (10jcrespo) Hi, we are having lately a lot of access requests (or grants changes, like this one). Might I ask you to use the recommended template: htt... [10:22:33] (03PS1) 10Giuseppe Lavagetto: profile::scap::dsh: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/585191 [10:22:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::scap::dsh: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/585191 (owner: 10Giuseppe Lavagetto) [10:25:50] !log pool cp2040 - T248816 [10:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:56] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [10:26:23] <_joe_> jenkins, look, we've never been friends [10:26:34] XDDDDD [10:26:41] <_joe_> but you could at least show me the decency of answering my requests [10:27:12] <_joe_> see? it worked [10:27:24] <_joe_> a polite complaint goes a long way with software [10:35:32] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10jbond) 05Resolved→03Open I noticed that the disk on gerrit1002 was full again today, ` $ df -h Filesystem Size Used Avail Use% Mounted on udev 7.9G 0 7.9G 0% /dev tmpfs... [10:35:35] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10jbond) [10:41:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall LGTM but I don't think we need the output redirect anymore." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/582933 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [10:46:22] (03PS1) 10Dzahn: start DHCP service on new installserver::light servers [puppet] - 10https://gerrit.wikimedia.org/r/585194 (https://phabricator.wikimedia.org/T224576) [10:47:27] !log Deploy schema change on dbstore1005:3318 [10:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:32] (03CR) 10Dzahn: [C: 03+2] start DHCP service on new installserver::light servers [puppet] - 10https://gerrit.wikimedia.org/r/585194 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [10:47:40] (03PS2) 10Dzahn: start DHCP service on new installserver::light servers [puppet] - 10https://gerrit.wikimedia.org/r/585194 (https://phabricator.wikimedia.org/T224576) [10:49:15] 10Operations, 10LDAP-Access-Requests, 10serviceops: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10jcrespo) Hi, AMooney, 1 requests- * I cannot see @Peter.ovchyn as an employee in our database or with a @wikimedia.org mail, that is probabl... [10:50:05] 10Operations, 10LDAP-Access-Requests, 10serviceops: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10jcrespo) [10:50:57] PROBLEM - Confd template for /etc/dsh/group/mediawiki-jobrunner-canaries on bast4002 is CRITICAL: File not found: /etc/dsh/group/mediawiki-jobrunner-canaries https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:53:03] PROBLEM - Confd template for /etc/dsh/group/mediawiki-parsoid-canaries on bast4002 is CRITICAL: File not found: /etc/dsh/group/mediawiki-parsoid-canaries https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:53:43] (03PS8) 10Gergő Tisza: Enable GrowthExperiments welcome survey on Ukrainian, Hungarian, Armenian Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584135 (https://phabricator.wikimedia.org/T238295) [10:53:53] (03PS7) 10Gergő Tisza: Enable GrowthExperiments on French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584183 (https://phabricator.wikimedia.org/T235964) [10:55:23] RECOVERY - Confd template for /etc/dsh/group/mediawiki-jobrunner-canaries on bast4002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:55:23] RECOVERY - Confd template for /etc/dsh/group/mediawiki-parsoid-canaries on bast4002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200401T1100). [11:00:04] cormacparle, Ammarpad, and samwilson: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:18] I'm here [11:01:16] !log planet1001 - reinstall OS to test install_server switch, ATS switched to planet1002 earlier [11:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:39] I'm here too [11:02:27] ok for me to go ahead? Urbanecm ? [11:02:41] cormacparle__: sure, feel free to self-service! [11:02:51] cool, starting now [11:03:05] !log install bluez update on ganeti-canary and cloudvirt/cloudcontrol-dev [11:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:10] 10Operations, 10LDAP-Access-Requests, 10serviceops: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10jcrespo) [11:04:01] jbond42: bluez heh, shouldn't we remove that globally? [11:04:29] PROBLEM - Check that envoy is running on planet1001 is CRITICAL: connect to address 10.64.0.50 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [11:04:33] jouncebot is snubbing me [11:05:16] mutante: its a dependency on qemu [11:05:17] anyone willing to add https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/584579 to their SWAT routine? it needs no testing, probably doesn't even need to be synced [11:05:18] ACKNOWLEDGEMENT - Check size of conntrack table on planet1001 is CRITICAL: connect to address 10.64.0.50 port 5666: Connection refused daniel_zahn test install https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [11:05:18] ACKNOWLEDGEMENT - Check systemd state on planet1001 is CRITICAL: connect to address 10.64.0.50 port 5666: Connection refused daniel_zahn test install https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:18] ACKNOWLEDGEMENT - Check that envoy is running on planet1001 is CRITICAL: connect to address 10.64.0.50 port 5666: Connection refused daniel_zahn test install https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [11:05:32] jbond42: oh really? ok [11:05:42] hmm ... I'm getting "no version entry for `--wiki=commonswiki` when I'm trying to create a table using `mwscript sql.php` [11:05:47] yes i guess so it can expose bluetooth to VM's [11:05:48] anyone know why? [11:05:50] bluetooth for qemu.. guess so :p [11:06:03] heh, ok [11:06:13] and im not sure its worth managing our own qemu just to drop that [11:06:24] but i was supprised to see it myself [11:06:44] (03PS1) 10Elukey: admin: add kerberos flag to user tarrow [puppet] - 10https://gerrit.wikimedia.org/r/585197 [11:07:29] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10elukey) ` elukey@krb1001:~$ sudo manage_principals.py create tarrow --email_address=thomas.arrow_ext@wikimedia.de Principal successfully created. Ma... [11:07:51] PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:08:16] cormacparle__: could you post your full command? [11:08:26] cormacparle__: works for me [11:08:27] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:08:40] cormacparle__: works for me too [11:08:41] mwscript sql.php —-wiki commonswiki extensions/WikibaseQualityConstraints/sql/create_wbqc_constraints.sql [11:09:08] weird dashes? [11:09:12] does that literally say —-wiki commonswiki? [11:09:17] the correct is --wiki=commonswiki [11:09:21] then it should work [11:09:22] probably sees --wiki as the wiki name [11:09:32] yeah, that too [11:09:39] (03CR) 10jerkins-bot: [V: 04-1] admin: add kerberos flag to user tarrow [puppet] - 10https://gerrit.wikimedia.org/r/585197 (owner: 10Elukey) [11:09:50] (03PS3) 10Jcrespo: admin: Add aaron, dpifke, phedenskog to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/584915 (https://phabricator.wikimedia.org/T248797) [11:09:52] (03PS2) 10Jcrespo: admin: Add holger to restricted group to run maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/584932 (https://phabricator.wikimedia.org/T248922) [11:09:54] (03PS1) 10Jcrespo: admin: Add bitpogo to the list of absented ldap users [puppet] - 10https://gerrit.wikimedia.org/r/585198 (https://phabricator.wikimedia.org/T248949) [11:10:00] also you'll need --write for that to work [11:10:26] (03PS2) 10Elukey: admin: add kerberos flag to user tarrow [puppet] - 10https://gerrit.wikimedia.org/r/585197 (https://phabricator.wikimedia.org/T248498) [11:10:41] (03PS1) 10Dzahn: stop DHCP service on jessie install servers [puppet] - 10https://gerrit.wikimedia.org/r/585199 (https://phabricator.wikimedia.org/T224576) [11:10:56] (03CR) 10Jcrespo: "I moved this former privileged (but no shell) ldap user to set it as absent, is both set it to absent AND add it to this group required? (" [puppet] - 10https://gerrit.wikimedia.org/r/585198 (https://phabricator.wikimedia.org/T248949) (owner: 10Jcrespo) [11:10:59] --write in the command? [11:11:14] sql.php --wiki=... --write [11:11:15] (03CR) 10Dzahn: [C: 03+2] stop DHCP service on jessie install servers [puppet] - 10https://gerrit.wikimedia.org/r/585199 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [11:11:25] I think it connects to a slave otherwise [11:11:35] +1 to what tgr said [11:11:39] ok [11:12:03] (03CR) 10Jcrespo: "Only this was done an a previous patch." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/585198 (https://phabricator.wikimedia.org/T248949) (owner: 10Jcrespo) [11:12:29] RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [11:14:35] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag to user tarrow [puppet] - 10https://gerrit.wikimedia.org/r/585197 (https://phabricator.wikimedia.org/T248498) (owner: 10Elukey) [11:14:44] just checking it worked ... [11:15:34] nice cormacparle__ [11:16:06] 10Operations, 10LDAP-Access-Requests, 10serviceops: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10jcrespo) p:05Triage→03Medium [11:17:19] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10elukey) 05Open→03Resolved [11:19:06] (03CR) 10Cparle: [C: 03+2] Enable WikibaseQualityConstraints on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581678 (https://phabricator.wikimedia.org/T248117) (owner: 10Cparle) [11:19:36] seems to have worked, proceeding with the config change [11:19:50] cormacparle__: nice. Could you please !log the table creation? [11:20:02] (03Merged) 10jenkins-bot: Enable WikibaseQualityConstraints on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581678 (https://phabricator.wikimedia.org/T248117) (owner: 10Cparle) [11:20:43] !log created table wbqc_constraints on commonswiki [11:20:46] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10darthmon_wmde) >>! In T248949#6014134, @jcrespo wrote: > On the SRE-production side this is done, only waiting legal for the above consultati... [11:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:02] (03PS1) 10Jcrespo: admin: Add Peter.ovchyn to the list of privileged ldap users [puppet] - 10https://gerrit.wikimedia.org/r/585200 (https://phabricator.wikimedia.org/T249037) [11:22:11] PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:22:21] (03PS3) 10Ammarpad: Restrict short URL management log to stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580290 (https://phabricator.wikimedia.org/T221073) [11:23:21] ACKNOWLEDGEMENT - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn test [11:23:23] RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [11:23:47] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580290 (https://phabricator.wikimedia.org/T221073) (owner: 10Ammarpad) [11:24:23] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10aborrero) >>! In T241719#6017131, @Bstorm wrote: > This is an m1.small. Maybe the instance size is just too low for recent versions? I can tell the... [11:25:15] testing on mwdebug1001 [11:25:23] (03CR) 10Jcrespo: [C: 04-1] "Waiting for the actual LDAP grant change, which is waiting on NDA." [puppet] - 10https://gerrit.wikimedia.org/r/585200 (https://phabricator.wikimedia.org/T249037) (owner: 10Jcrespo) [11:28:02] 10Operations, 10LDAP-Access-Requests, 10serviceops, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10Dzahn) For contractors we would expect to see an address of the format povchyn-ctr@wikimedia.org. [11:28:08] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10jcrespo) > The account has been removed from wikidata-dev. #cloud-services-team could you help us removing... [11:28:51] hmmm that seems to have broken something, doing a revert [11:30:42] (03CR) 10Jbond: [C: 03+1] systemd: add support for network accounting [puppet] - 10https://gerrit.wikimedia.org/r/584553 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [11:32:19] cormacparle__: is the revert going fine? 🙂 [11:32:57] (03PS4) 10Gehel: maps: add parameter to disable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/585128 (https://phabricator.wikimedia.org/T249086) [11:33:21] yep, just checking with Matthias here that I'm not imagining the broken-ness [11:33:29] k [11:37:09] (03CR) 10Gehel: [C: 03+2] maps: add parameter to disable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/585128 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [11:38:48] cormacparle__: note, we have 20 minutes and 3 patches left. I think we should halt deployment of this patch and leave investigation for later. [11:39:11] (03PS2) 10Gehel: maps: disable tilerator in codfw for data reload [puppet] - 10https://gerrit.wikimedia.org/r/585141 (https://phabricator.wikimedia.org/T249086) [11:39:33] ok - looks like things are broken on all the debug servers even though I've only pulled onto one [11:39:45] so I *could* proceed anyway? [11:39:50] or just revert? [11:40:06] proceeding is quickest (if it works) [11:40:09] (sorry for holding everyone up!) [11:40:13] (03PS1) 10Elukey: admin: add kerberos flag for user itamar [puppet] - 10https://gerrit.wikimedia.org/r/585201 (https://phabricator.wikimedia.org/T248482) [11:40:52] I don't know anything about this particular codebase. It's up to you to decide if it's okay to deploy or not. If you're not sure, not deploying is always better. [11:40:55] what's broken exactly? [11:41:03] 10Operations, 10LDAP-Access-Requests, 10serviceops, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10jcrespo) Both OIT and @kfrancis may help with T249037#6018141 respectively regarding wikimedia email (corporate LDAP) a... [11:41:15] CORS problems when calling wbsearchentities vis js [11:41:45] I would also recommend not deploying if in-doubt, now that we are in low-risk mode [11:42:16] that said, the next deploy window is in 6 hours so I wouldn't worry about overrunning this one [11:42:29] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10elukey) ` elukey@krb1001:~$ sudo manage_principals.py create itamar --email_address=itamar.givon@wikimedia.de Principal successfully created. Ma... [11:42:51] I think I'll go ahead with the deployment, revert is ready anyway in case there's a problem [11:42:55] (03CR) 10Gehel: [C: 03+2] maps: disable tilerator in codfw for data reload [puppet] - 10https://gerrit.wikimedia.org/r/585141 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [11:42:56] and it works on the non-debug hosts? or is that new functionality? [11:43:15] yeah, works on non-debug hosts [11:43:46] do you have a reproducible example? [11:44:13] does the CORS request itself rely on the deploy? [11:44:14] !log cparle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [SDC] Enable WikibaseQualityConstraints on Commons (duration: 01m 18s) [11:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:26] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag for user itamar [puppet] - 10https://gerrit.wikimedia.org/r/585201 (https://phabricator.wikimedia.org/T248482) (owner: 10Elukey) [11:44:35] is it possible that the chrome plugin drops some request headers (Origin) [11:45:15] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:45:18] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10elukey) 05Open→03Resolved [11:45:38] ok deployed and it's working [11:45:41] panic over [11:45:47] don't know what the story is with CORS [11:45:54] ok for everyone else to fo ahead [11:45:58] sorry for the delay [11:45:59] probably we are just not whitelisting X-Wikimedia-Debug [11:46:12] worth a task ig [11:46:16] you should file a task with reproduction steps [11:46:22] * Urbanecm goes ahead with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/580290 [11:46:23] sure, will do [11:46:28] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580290 (https://phabricator.wikimedia.org/T221073) (owner: 10Ammarpad) [11:46:39] thanks for all yr help folks [11:47:17] and before that merges, syncing IS.php for the second time [11:47:25] (03Merged) 10jenkins-bot: Restrict short URL management log to stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580290 (https://phabricator.wikimedia.org/T221073) (owner: 10Ammarpad) [11:47:52] (due to T236104) [11:47:53] T236104: Cache of wmf-config/InitialiseSettings often 1 step behind - https://phabricator.wikimedia.org/T236104 [11:48:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, both the absent_ldap list and the user entry need to be updated." [puppet] - 10https://gerrit.wikimedia.org/r/585198 (https://phabricator.wikimedia.org/T248949) (owner: 10Jcrespo) [11:48:07] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [SDC] Enable WikibaseQualityConstraints on Commons take II (duration: 01m 06s) [11:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:33] cormacparle__: please re-check your patch [11:48:37] (03CR) 10Jcrespo: [C: 03+2] admin: Add bitpogo to the list of absented ldap users [puppet] - 10https://gerrit.wikimedia.org/r/585198 (https://phabricator.wikimedia.org/T248949) (owner: 10Jcrespo) [11:49:03] (03PS2) 10Jcrespo: admin: Add bitpogo to the list of absented ldap users [puppet] - 10https://gerrit.wikimedia.org/r/585198 (https://phabricator.wikimedia.org/T248949) [11:49:33] (03PS3) 10Jcrespo: admin: Add bitpogo to the list of absented ldap users [puppet] - 10https://gerrit.wikimedia.org/r/585198 (https://phabricator.wikimedia.org/T248949) [11:49:45] Urbanecm: oh! didn't it sync? [11:49:58] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:02] cormacparle__: it did, but it's current recommendation to sync it twice due to a bug [11:50:05] (it affects only IS.php) [11:50:12] maps is me, checking [11:50:20] 10Operations, 10Analytics, 10netops: Move netflow to TLS encryption/authentication via librdkafka - https://phabricator.wikimedia.org/T248980 (10elukey) We could start with TLS authentication only, with: ` security.protocol=SSL ssl.ca.location=/etc/ssl/certs/Puppet_Internal_CA.pem ssl.cipher.suites=ECDHE-EC... [11:50:24] PROBLEM - Check systemd state on maps2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:18] well, if it worked after one sync, re-checking won't tell much [11:51:57] unless we were lucky and it worked only on that particular server the request was routed to [11:51:58] (03PS5) 10Zoranzoki21: robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) [11:52:23] (03CR) 10Jcrespo: [C: 03+2] admin: Add bitpogo to the list of absented ldap users [puppet] - 10https://gerrit.wikimedia.org/r/585198 (https://phabricator.wikimedia.org/T248949) (owner: 10Jcrespo) [11:52:30] aha ok ... we have a error, digging into it atm [11:52:56] it won't tell much in that case either [11:53:10] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 4968501: Restrict short URL management log to stewards (T221073) (duration: 01m 07s) [11:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:16] T221073: URL shortener link management should be logged - https://phabricator.wikimedia.org/T221073 [11:54:29] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 4968501: Restrict short URL management log to stewards (T221073; take II) (duration: 01m 05s) [11:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:43] * Urbanecm done [11:55:59] (03PS1) 10Cparle: Revert "Enable WikibaseQualityConstraints on commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585203 [11:56:27] (03PS2) 10Cparle: Revert "Enable WikibaseQualityConstraints on commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585203 [11:56:46] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10ItamarWMDE) @jcrespo I confirm that I was able to log in. Thank you. @elukey Thanks for the prompt response! [11:57:24] 10Operations, 10Mail, 10MediaWiki-Email: Domain of sender address of Wikimedia mail notifications is set to mw1337.eqiad.wmn for emails from Sinhala Wikipedia - https://phabricator.wikimedia.org/T249014 (10Aklapper) It does, thanks for forwarding that email! I imported your email into my email application:... [11:57:33] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10jcrespo) Oh, sorry, cloud-services-team, I thought project membership for VPS was handled on a separate aut... [11:58:30] I can deploy the rest [11:58:48] (03PS2) 10Gergő Tisza: Enable password-reset-update on all other than Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585059 (https://phabricator.wikimedia.org/T245791) (owner: 10Samwilson) [11:59:23] tgr: we're getting a db error after deploying that change, I need to deploy a revert [11:59:32] ok, I'll wait [11:59:33] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/585203 [11:59:35] 10Operations, 10Mail: Password reminders for lists.wikimedia.org/mailman don't seem to be working - https://phabricator.wikimedia.org/T249101 (10Jhernandez) [11:59:43] (03PS3) 10Cparle: Revert "Enable WikibaseQualityConstraints on commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585203 [12:00:20] 10Operations, 10Mail: Password reminders for lists.wikimedia.org/mailman don't seem to be working - https://phabricator.wikimedia.org/T249101 (10Jhernandez) We should verify somehow that: * It isn't just me * If other kinds of email delivery are working, like the welcome emails [12:00:43] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10MoritzMuehlenhoff) >>! In T241719#6017131, @Bstorm wrote: > This is an m1.small. Maybe the instance size is just too low for recent versions? Probab... [12:01:19] (03CR) 10Cparle: [C: 03+2] Revert "Enable WikibaseQualityConstraints on commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585203 (owner: 10Cparle) [12:02:01] 10Operations, 10Mail, 10MediaWiki-Email: Domain of sender address of Wikimedia mail notifications is set to mw1337.eqiad.wmn for emails from Sinhala Wikipedia - https://phabricator.wikimedia.org/T249014 (10Ammarpad) Duplicate of T232199? [12:02:05] 10Operations, 10Mail: Password reminders for lists.wikimedia.org/mailman don't seem to be working - https://phabricator.wikimedia.org/T249101 (10Jhernandez) [12:02:15] (03Merged) 10jenkins-bot: Revert "Enable WikibaseQualityConstraints on commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585203 (owner: 10Cparle) [12:03:43] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10jcrespo) 05Open→03Resolved @darthmon_wmde this should be now done, no special privileges left: https://... [12:04:06] !log cparle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [SDC] Revert enabling WikibaseQualityConstraints on Commons (duration: 01m 05s) [12:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:01] 10Operations, 10Mail: Password reminders for lists.wikimedia.org/mailman don't seem to be working - https://phabricator.wikimedia.org/T249101 (10Tgr) Monthly reminders are working (at least, I got one today). Granted, from a security perspective it would probably be better if they didn't :) [12:05:27] !log cparle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [SDC] Revert enabling WikibaseQualityConstraints on Commons take 2 (duration: 01m 08s) [12:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:42] tgr: done [12:05:47] thx [12:05:55] (03PS3) 10Gergő Tisza: Enable password-reset-update on all other than Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585059 (https://phabricator.wikimedia.org/T245791) (owner: 10Samwilson) [12:06:41] (03CR) 10Gergő Tisza: [C: 03+2] Enable password-reset-update on all other than Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585059 (https://phabricator.wikimedia.org/T245791) (owner: 10Samwilson) [12:07:37] (03Merged) 10jenkins-bot: Enable password-reset-update on all other than Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585059 (https://phabricator.wikimedia.org/T245791) (owner: 10Samwilson) [12:09:07] !log Deploy schema change on db1116:3318 [12:09:15] samwilson1: on mwdebug1002 [12:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:29] tgr: thanks; checking now [12:09:45] (03CR) 10Gergő Tisza: [C: 03+2] Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584579 (https://phabricator.wikimedia.org/T248844) (owner: 10Gergő Tisza) [12:09:56] tgr: yep, all looks good [12:10:35] (03Merged) 10jenkins-bot: Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584579 (https://phabricator.wikimedia.org/T248844) (owner: 10Gergő Tisza) [12:12:13] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:585059|Enable password-reset-update on all other than Wikipedias (T245791)]] (duration: 01m 07s) [12:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:18] T245791: Enable PRU for all other projects [small] - https://phabricator.wikimedia.org/T245791 [12:13:44] cormacparle__: sorry, I just realized – creating that db table shouldn’t have been necessary anyways, right? [12:13:58] WikibaseQualityConstraints should read constraints from wikidatawiki’s table [12:14:07] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: re-sync (duration: 01m 06s) [12:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:59] Lucas_WMDE: cormac is having lunch atm [12:15:34] yes, that table shouldn't be necessary, but we figured it's probably best to have it anyway [12:15:56] !log depool & decommission cp2013 - T249088 [12:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:03] T249088: decommission cp2013.codfw.wmnet - https://phabricator.wikimedia.org/T249088 [12:16:04] running only a partial installation of an extension we don't really maintain/know ourselves sounds like a bad plan :p [12:16:12] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2013 [puppet] - 10https://gerrit.wikimedia.org/r/585159 (https://phabricator.wikimedia.org/T249088) (owner: 10Vgutierrez) [12:16:20] (03PS2) 10Vgutierrez: site,install_server: Decommission cp2013 [puppet] - 10https://gerrit.wikimedia.org/r/585159 (https://phabricator.wikimedia.org/T249088) [12:17:01] so the (empty) table should be there in case some day something forgets that there might not be a local DB :P [12:17:22] I’m about to have lunch too, picked a bad time to ask that question ^^ [12:17:29] but ok [12:17:31] !log restart nfacct on netflow4001 for kafka tls tests - T248980 [12:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:37] T248980: Move netflow to TLS encryption/authentication via librdkafka - https://phabricator.wikimedia.org/T248980 [12:17:41] !log tgr@deploy1001 Synchronized dblists/growthexperiments.dblist: SWAT: [[gerrit:584579|Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments (T248844)]] (duration: 01m 05s) [12:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:47] T248844: Some wikis aren't included in growthexperiments.dblist - https://phabricator.wikimedia.org/T248844 [12:17:57] that said, we ran into db issues: "Cannot access the database: Unknown error" [12:18:07] (03PS1) 10Ema: ATS: check if fifo-log-demux is logging [puppet] - 10https://gerrit.wikimedia.org/r/585208 (https://phabricator.wikimedia.org/T248067) [12:18:10] from ConstraintRepository->queryConstraintsForProperty [12:18:26] any chance this could have something to do with recent federation refactors? [12:18:36] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:18:36] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:56] well if something tries to read/write the local DB, it's probably better to not find the expected table and fail then to silently read/write a table that's not actually used [12:19:09] !log tgr@deploy1001 Synchronized wmf-config/config: SWAT: [[gerrit:584579|Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments (T248844)]] (duration: 01m 06s) [12:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:21] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [12:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:53] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [12:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:00] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2013.codfw.wmnet - https://phabricator.wikimedia.org/T249088 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2013.codfw.wmnet` - cp2013.codfw.wmnet (**PASS**)... [12:20:53] (03PS1) 10Vgutierrez: Remove cp2013 entries [dns] - 10https://gerrit.wikimedia.org/r/585210 (https://phabricator.wikimedia.org/T249088) [12:21:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Will merge this on 2020-04-15" [puppet] - 10https://gerrit.wikimedia.org/r/583098 (https://phabricator.wikimedia.org/T135046) (owner: 10Arturo Borrero Gonzalez) [12:21:32] tgr: that also makes sense - though I imagine in such case it would probably use both local & federated data, then (instead of fatal) [12:21:33] huh, we don't have a fatal-monitor dashboard on logstash anymore? [12:21:42] did that get merged into mediawiki-errors? [12:22:15] (03CR) 10Vgutierrez: [C: 03+2] Remove cp2013 entries [dns] - 10https://gerrit.wikimedia.org/r/585210 (https://phabricator.wikimedia.org/T249088) (owner: 10Vgutierrez) [12:22:21] IDK what's best - we just figured we'd create the table so that the extension is being run in the way it was actually designed to be used (even if we're not actually using local data) [12:23:22] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2013.codfw.wmnet - https://phabricator.wikimedia.org/T249088 (10Vgutierrez) a:05Vgutierrez→03Papaul [12:23:40] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [12:23:54] matthiasmullie: no idea about that error, is there a phabricator task? [12:23:58] then I can take a look later [12:24:32] not yet :) [12:24:39] enjoy your lunch! [12:27:36] 10Operations, 10Mail, 10MediaWiki-Email: Domain of sender address of Wikimedia mail notifications is set to mw1337.eqiad.wmn for emails from Sinhala Wikipedia - https://phabricator.wikimedia.org/T249014 (10Rehman) > Indeed something seems wrong on the Wikimedia side, as the sender address shows `mw1337.eqiad... [12:28:57] (03PS2) 10Ema: ATS: check if fifo-log-demux is logging [puppet] - 10https://gerrit.wikimedia.org/r/585208 (https://phabricator.wikimedia.org/T248067) [12:31:20] tgr: we have? https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor [12:31:29] (03CR) 10Vgutierrez: ATS: check if fifo-log-demux is logging (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/585208 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [12:31:31] (03CR) 10Ema: "pcc output looks correct to me: https://puppet-compiler.wmflabs.org/compiler1001/21650/" [puppet] - 10https://gerrit.wikimedia.org/r/585208 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [12:33:00] (03CR) 10Urbanecm: [C: 03+1] "To be honest, I don't see the purpose of CR-1, because I don't see how this is a code review, or even relevant to srwiki's decision or thi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [12:33:20] Urbanecm: used to be in the dashboard navbar, and it's not anymore [12:33:43] (03PS3) 10Ema: ATS: check if fifo-log-demux is logging [puppet] - 10https://gerrit.wikimedia.org/r/585208 (https://phabricator.wikimedia.org/T248067) [12:33:54] tgr: true. It's however still accessible by going to dashboard->fatal monitor [12:34:25] (not sure when/why that was removed) [12:34:50] (03PS4) 10Ema: ATS: check if fifo-log-demux is logging [puppet] - 10https://gerrit.wikimedia.org/r/585208 (https://phabricator.wikimedia.org/T248067) [12:34:56] 10Operations, 10Mail, 10MediaWiki-Email: Domain of sender address of Wikimedia mail notifications is set to mw1337.eqiad.wmn for emails from Sinhala Wikipedia - https://phabricator.wikimedia.org/T249014 (10Aklapper) (If you copied and pasted something from somewhere, instead of directly *saving* the full ema... [12:37:58] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "> That's how communities decide about all sort of stuff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [12:40:02] (03CR) 10Ema: ATS: check if fifo-log-demux is logging (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/585208 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [12:42:23] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:42:25] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Password reminders for lists.wikimedia.org/mailman don't seem to be working - https://phabricator.wikimedia.org/T249101 (10Reedy) [12:43:57] (03CR) 10Urbanecm: [C: 03+1] "> Patch Set 5:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [12:44:28] (03CR) 10Vgutierrez: ATS: check if fifo-log-demux is logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/585208 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [12:45:49] (03PS1) 10Vgutierrez: site: Reimage cp2041 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/585212 (https://phabricator.wikimedia.org/T248816) [12:45:57] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:46:37] (03CR) 10jerkins-bot: [V: 04-1] site: Reimage cp2041 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/585212 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [12:47:03] (03PS2) 10Vgutierrez: site: Reimage cp2041 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/585212 (https://phabricator.wikimedia.org/T248816) [12:48:36] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2041 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/585212 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [12:49:12] 10Operations, 10Mail, 10MediaWiki-Email: Domain of sender address of Wikimedia mail notifications is set to mw1337.eqiad.wmn for emails from Sinhala Wikipedia - https://phabricator.wikimedia.org/T249014 (10Rehman) >>! In T249014#6018521, @Aklapper wrote: > (If you copied and pasted something from somewhere,... [12:50:04] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2041.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [12:50:38] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:44] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:04] this is me and Arzhel testing --^ [12:53:35] (03PS1) 10JMeybohm: Add user jayme [puppet] - 10https://gerrit.wikimedia.org/r/585213 [12:53:37] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/585213 (owner: 10JMeybohm) [12:58:40] (03PS5) 10Ema: ATS: check if fifo-log-demux is logging [puppet] - 10https://gerrit.wikimedia.org/r/585208 (https://phabricator.wikimedia.org/T248067) [12:58:51] (03CR) 10Ema: ATS: check if fifo-log-demux is logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/585208 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [12:59:35] (03CR) 10Vgutierrez: [C: 03+1] ATS: check if fifo-log-demux is logging [puppet] - 10https://gerrit.wikimedia.org/r/585208 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [13:00:50] 10Operations, 10cloud-services-team (Kanban): Remove old OpenStack config and manifests - https://phabricator.wikimedia.org/T249058 (10Reedy) [13:04:31] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10ItamarWMDE) @elukey Should this also provide me with access to hue.wikimedia.org? [13:06:51] 10Operations, 10Maps (Tilerator), 10Product-Infrastructure-Team-Backlog (Kanban): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939 (10MSantos) 05Stalled→03Open Next step is: - Tweak hourly replication rate and monitor disk usage [[ https://gerrit.wikimedia.org/r/c/operati... [13:08:00] 10Operations, 10Maps (Maps-data), 10Product-Infrastructure-Team-Backlog (Kanban): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939 (10MSantos) [13:09:09] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [13:10:06] (03PS1) 10Dzahn: decom planet1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/585218 (https://phabricator.wikimedia.org/T247651) [13:10:48] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:08] (03CR) 10Zoranzoki21: "The opinion of the community is important, as is the existence of a technical capacity." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [13:12:38] (03PS2) 10Dzahn: decom planet1001 and planet2001 [puppet] - 10https://gerrit.wikimedia.org/r/585218 (https://phabricator.wikimedia.org/T247651) [13:13:08] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:00] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=rpkicounter site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:15:40] (03PS1) 10Gergő Tisza: [WIP] Enable GrowthExperiments suggested edits on uk, hu, hy, eu wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585219 (https://phabricator.wikimedia.org/T247308) [13:15:51] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2041.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2041.codfw.wmnet'] ` [13:17:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318 for schema change', diff saved to https://phabricator.wikimedia.org/P10843 and previous config saved to /var/cache/conftool/dbconfig/20200401-131719-marostegui.json [13:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:37] !log Deploy schema change on db1099:3318 [13:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:55] 10Operations, 10Maps (Maps-data): Monitor PostgreSQL connection slots - https://phabricator.wikimedia.org/T168767 (10MSantos) Max connections can be tracked from [[ https://grafana.wikimedia.org/d/000000039/maps-osm-database-msantos?orgId=1&refresh=10s | Grafana ]], but monitoring the limit of connections and... [13:21:17] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Password reminders for lists.wikimedia.org/mailman don't seem to be working - https://phabricator.wikimedia.org/T249101 (10Reedy) I just requested a password reminder and didn't get it... I haven't seen a monthly reminder in a long time either [13:21:28] PROBLEM - puppet last run on planet1001 is CRITICAL: connect to address 10.64.0.50 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:22:08] (03CR) 10CDanis: completed rollout of sensible flow-table-sizes (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/583740 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [13:22:13] (03PS5) 10CDanis: completed rollout of sensible flow-table-sizes [homer/public] - 10https://gerrit.wikimedia.org/r/583740 (https://phabricator.wikimedia.org/T248394) [13:23:02] PROBLEM - configured eth on planet1001 is CRITICAL: connect to address 10.64.0.50 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:23:07] (03PS1) 10Dzahn: installserver: allow stopping tftp service with parameter [puppet] - 10https://gerrit.wikimedia.org/r/585221 (https://phabricator.wikimedia.org/T224576) [13:24:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [13:24:45] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:52] 10Operations, 10serviceops, 10Patch-For-Review: upgrade planet.wikimedia.org backends to buster - https://phabricator.wikimedia.org/T247651 (10ops-monitoring-bot) Icinga downtime for 1 day, 0:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: reinstall ` planet1001.eqiad.wmnet ` [13:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:11] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "So you did not asked all users? What if a user realizes in a month that all their pages disappeared from search engines? Do you seriously " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [13:25:25] ACKNOWLEDGEMENT - configured eth on planet1001 is CRITICAL: connect to address 10.64.0.50 port 5666: Connection refused daniel_zahn reinstalled https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:29:55] 10Operations, 10Maps (Maps-data): Monitor PostgreSQL connection slots - https://phabricator.wikimedia.org/T168767 (10MSantos) [13:30:09] 10Operations, 10serviceops: CORS errors on commons on debug servers - https://phabricator.wikimedia.org/T249107 (10Reedy) [13:30:55] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10JHedden) [13:31:27] (03PS2) 10Dzahn: installserver: stop tftp service on old install servers [puppet] - 10https://gerrit.wikimedia.org/r/585221 (https://phabricator.wikimedia.org/T224576) [13:31:34] (03PS1) 10Elukey: Enable TLS encryption to Kafka Jumbo for netflow4001 [puppet] - 10https://gerrit.wikimedia.org/r/585223 (https://phabricator.wikimedia.org/T248980) [13:32:01] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:32:29] elukey: ^^ related to that CR, we are talking about non buster environments, right? [13:33:28] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/21652/" [puppet] - 10https://gerrit.wikimedia.org/r/585223 (https://phabricator.wikimedia.org/T248980) (owner: 10Elukey) [13:33:47] vgutierrez: I was about to ping you [13:33:48] (03CR) 10Ayounsi: [C: 03+1] Enable TLS encryption to Kafka Jumbo for netflow4001 [puppet] - 10https://gerrit.wikimedia.org/r/585223 (https://phabricator.wikimedia.org/T248980) (owner: 10Elukey) [13:33:57] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Password reminders for lists.wikimedia.org/mailman don't seem to be working - https://phabricator.wikimedia.org/T249101 (10Jhernandez) I just got all the reminders and some other “Unsubscribe” emails all at once. I’m not sure if someone actually did something... [13:34:05] vgutierrez: I used the same settings as vk basically, didn't check the os [13:34:14] sorry for ruining your ping [13:34:15] I guess you are talking about TLS 1.3 [13:34:19] ahahhaha [13:34:20] indeed [13:34:24] !log sodium (mirror): sudo -u mirror ftpsync to get Debian mirror updated (Icinga says it's old) [13:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:41] vgutierrez: netflow4001 is buster! [13:34:55] * elukey plays sad_trombone.wav [13:34:59] and the other end? [13:35:08] kafka jumbo, so stretch [13:35:14] ah nice no tls on the other side [13:35:20] err no 1.3 [13:35:33] then the CR is perfect as it is [13:35:36] so we'll need to review when jumbo moves to buster right? [13:35:43] yeah [13:35:46] <3 [13:35:48] merging then [13:35:49] and maybe patch librdkafka [13:36:01] I know you are looking forward to it [13:36:12] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21653/" [puppet] - 10https://gerrit.wikimedia.org/r/585221 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [13:36:22] (03CR) 10Elukey: [C: 03+2] Enable TLS encryption to Kafka Jumbo for netflow4001 [puppet] - 10https://gerrit.wikimedia.org/r/585223 (https://phabricator.wikimedia.org/T248980) (owner: 10Elukey) [13:36:49] (03CR) 10Ayounsi: [C: 03+2] Remove prepending in esams and eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/584008 (owner: 10Ayounsi) [13:37:13] (03Merged) 10jenkins-bot: Remove prepending in esams and eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/584008 (owner: 10Ayounsi) [13:37:29] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10TheDJ) Someone can remove the puppetmaster of maps, if they can restore the proper puppetmaster on the maps-tiles servers. I tried setting that stuff... [13:38:18] vgutierrez: I think that we can keep TLS as encryption transport together with SASL+GSS-API for authentication with Kafka [13:38:47] nice [13:38:59] but need to test it! [13:39:08] if I am right it should be easier to enable it [13:39:43] !log pool cp2041 - T248816 [13:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:50] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [13:39:59] (03PS2) 10Volans: examples: add comments to example config [software/homer] - 10https://gerrit.wikimedia.org/r/584971 [13:40:01] (03PS2) 10Volans: config: complete test coverage [software/homer] - 10https://gerrit.wikimedia.org/r/584972 [13:40:03] (03PS2) 10Volans: plugins: initial implementation for Netbox data [software/homer] - 10https://gerrit.wikimedia.org/r/584973 [13:40:33] (03CR) 10Volans: "addressed comments" (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/584971 (owner: 10Volans) [13:40:42] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10elukey) >>! In T248482#6018579, @ItamarWMDE wrote: > @elukey Should this also provide me with access to hue.wikimedia.org? Needs another access, just added you! [13:41:03] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [13:42:48] ^ i just manually run the sync but as the right user so we should not have permission issues [13:42:54] (03PS1) 10Vgutierrez: site: Reimage cp2042 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/585231 (https://phabricator.wikimedia.org/T248816) [13:44:49] (03PS1) 10Ayounsi: Manage static flowspec rules via Homer [homer/public] - 10https://gerrit.wikimedia.org/r/585232 [13:45:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] php-admin: remove dead code for partial opcache invalidation [puppet] - 10https://gerrit.wikimedia.org/r/577652 (owner: 10Ori.livneh) [13:45:25] (03CR) 10Zoranzoki21: "> So you did not asked all users? What if a user realizes in a month" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [13:45:56] (03PS1) 10Dzahn: installserver/tftp: add missing ensure_service parameter [puppet] - 10https://gerrit.wikimedia.org/r/585233 (https://phabricator.wikimedia.org/T224576) [13:46:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] php-admin: remove dead code for partial opcache invalidation [puppet] - 10https://gerrit.wikimedia.org/r/577652 (owner: 10Ori.livneh) [13:46:37] (03PS3) 10Hashar: jenkins: Adjust CSP header to allow inline CSS and video playback [puppet] - 10https://gerrit.wikimedia.org/r/585038 (https://phabricator.wikimedia.org/T245658) [13:46:46] (03CR) 10Dzahn: [C: 03+2] installserver/tftp: add missing ensure_service parameter [puppet] - 10https://gerrit.wikimedia.org/r/585233 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [13:46:57] (03CR) 10Ema: [C: 03+2] ATS: check if fifo-log-demux is logging [puppet] - 10https://gerrit.wikimedia.org/r/585208 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [13:47:22] !log remove AS-path prepending in eqsin [13:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:40] (03CR) 10jerkins-bot: [V: 04-1] jenkins: Adjust CSP header to allow inline CSS and video playback [puppet] - 10https://gerrit.wikimedia.org/r/585038 (https://phabricator.wikimedia.org/T245658) (owner: 10Hashar) [13:49:04] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21654/" [puppet] - 10https://gerrit.wikimedia.org/r/585233 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [13:49:07] (03PS1) 10Elukey: profile::pmacct: use kafka TLS connection string when needed [puppet] - 10https://gerrit.wikimedia.org/r/585234 (https://phabricator.wikimedia.org/T248980) [13:49:27] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2042 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/585231 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [13:49:51] (03PS4) 10Hashar: jenkins: Adjust CSP header to allow inline CSS and video playback [puppet] - 10https://gerrit.wikimedia.org/r/585038 (https://phabricator.wikimedia.org/T245658) [13:50:21] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/21655/" [puppet] - 10https://gerrit.wikimedia.org/r/585234 (https://phabricator.wikimedia.org/T248980) (owner: 10Elukey) [13:51:47] (03CR) 10Hashar: "I have tested it on contint with a dummy service that output the list of arguments. Somehow the systemd on Jessie split the arguments at" [puppet] - 10https://gerrit.wikimedia.org/r/585038 (https://phabricator.wikimedia.org/T245658) (owner: 10Hashar) [13:52:31] vgutierrez: you have no issues with reimage, right? [13:52:47] mutante: what do you mean? [13:53:12] vgutierrez: just want to confirm installservers work normal [13:53:13] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2042.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [13:53:24] i just stopped tftp on the old ones [13:53:25] mutante: oh sure, the installation works like a charm [13:53:32] switched to buster [13:53:34] at least for cp2041 [13:53:42] cp2042 is starting right now [13:53:44] I'll let you know [13:53:47] ok, perfect [13:53:52] i am also doing a VM [13:56:01] atftpd[20015]: Serving lpxelinux.0 to 10.64.0.50:40484 [13:56:31] PROBLEM - TFTP service on install1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [13:56:33] 10Operations, 10Traffic, 10Patch-For-Review: check_trafficserver_log_fifo: false positives when changing log format - https://phabricator.wikimedia.org/T248067 (10ema) 05Open→03Resolved a:03ema [13:56:44] oh, there was monitoring for that :) [13:56:47] ACKing [13:56:59] PROBLEM - TFTP service on install2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [13:57:19] ACKNOWLEDGEMENT - TFTP service on install1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* daniel_zahn service moved to install1003 https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [13:57:51] 10Operations, 10serviceops, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): Build and publish a python2 based container to build wheels - https://phabricator.wikimedia.org/T249110 (10hashar) [13:58:00] ACKNOWLEDGEMENT - TFTP service on install2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* daniel_zahn service moved to install2003 https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [13:58:18] 10Operations, 10serviceops, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): Build and publish a python2 based container to build wheels - https://phabricator.wikimedia.org/T249110 (10hashar) [13:59:43] 10Operations, 10Puppet, 10User-jbond: Add CI check to ensure defaults exist in cloud.yaml - https://phabricator.wikimedia.org/T248994 (10jbond) @Andrew the following keys live in the production hiera but dont exist in cloud.yaml. do you want entries for all oif theses or ar there some we can safly skip?... [14:00:22] 10Operations, 10Analytics, 10Performance-Team, 10Traffic: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10ema) 05Open→03Resolved Done: ` $ curl -v -H "X-Wikimedia-Debug: mwdebug1001.eqiad.wmnet" https://en.wikipedia.org/wiki/Main_Page 2>&1... [14:00:28] 10Operations, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Reduce amount of headers sent from web responses - https://phabricator.wikimedia.org/T194814 (10ema) [14:01:23] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "> Wikipedia is not a self-promotional site. Also, Wikipedia is not hosting of advertising material. […] For data in user sub(pages), nor i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [14:03:59] (03CR) 10Ayounsi: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/21657/netflow4001.ulsfo.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/585234 (https://phabricator.wikimedia.org/T248980) (owner: 10Elukey) [14:04:09] 10Operations, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Reduce amount of headers sent from web responses - https://phabricator.wikimedia.org/T194814 (10ema) 05Open→03Resolved [14:04:37] (03PS1) 10Dzahn: installserver/tftp: if service is ensured stop, do not monitor it [puppet] - 10https://gerrit.wikimedia.org/r/585238 (https://phabricator.wikimedia.org/T224576) [14:05:17] mutante: cp2042 booted on the debian installer as expected, so tftp seems to be OK :) [14:06:00] (03PS2) 10Dzahn: installserver/tftp: if service is ensured stop, do not monitor it [puppet] - 10https://gerrit.wikimedia.org/r/585238 (https://phabricator.wikimedia.org/T224576) [14:06:01] vgutierrez: great, thanks! [14:06:27] i'm removing the (false positive) monitoring [14:06:32] (03CR) 10Zoranzoki21: "> If this is true, it is true for *all* wikis, and this change needs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [14:06:41] (03CR) 10Ayounsi: [C: 03+1] examples: add comments to example config [software/homer] - 10https://gerrit.wikimedia.org/r/584971 (owner: 10Volans) [14:08:11] (03CR) 10Jbond: "Hi Janis," [puppet] - 10https://gerrit.wikimedia.org/r/585213 (owner: 10JMeybohm) [14:09:05] !log remove AS-path prepending in esams [14:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:13] (03CR) 10jerkins-bot: [V: 04-1] installserver/tftp: if service is ensured stop, do not monitor it [puppet] - 10https://gerrit.wikimedia.org/r/585238 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [14:11:16] (03CR) 10Volans: [C: 03+2] examples: add comments to example config [software/homer] - 10https://gerrit.wikimedia.org/r/584971 (owner: 10Volans) [14:11:24] (03PS3) 10Dzahn: tftp: if service is told to be stopped, do not monitor it [puppet] - 10https://gerrit.wikimedia.org/r/585238 (https://phabricator.wikimedia.org/T224576) [14:13:57] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [14:13:59] (03CR) 10Elukey: [C: 03+2] profile::pmacct: use kafka TLS connection string when needed [puppet] - 10https://gerrit.wikimedia.org/r/585234 (https://phabricator.wikimedia.org/T248980) (owner: 10Elukey) [14:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:13] (03Merged) 10jenkins-bot: examples: add comments to example config [software/homer] - 10https://gerrit.wikimedia.org/r/584971 (owner: 10Volans) [14:15:08] (03CR) 10Dzahn: "Welcome Janis! :)" [puppet] - 10https://gerrit.wikimedia.org/r/585213 (owner: 10JMeybohm) [14:16:20] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:45] (03PS2) 10Dzahn: admin: Add user jayme [puppet] - 10https://gerrit.wikimedia.org/r/585213 (https://phabricator.wikimedia.org/T249081) (owner: 10JMeybohm) [14:17:13] (03CR) 10Dzahn: [C: 03+1] "PS2: added ticket link" [puppet] - 10https://gerrit.wikimedia.org/r/585213 (https://phabricator.wikimedia.org/T249081) (owner: 10JMeybohm) [14:18:00] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Password reminders for lists.wikimedia.org/mailman don't seem to be working - https://phabricator.wikimedia.org/T249101 (10Reedy) 05Open→03Invalid https://grafana.wikimedia.org/d/nULM0E1Wk/mailman?orgId=1 Seems it just gets backlogged at the start of the... [14:18:13] (03CR) 10Dzahn: [C: 03+1] "adding the ticket number in this way made a bot add this comment: https://phabricator.wikimedia.org/T249081#6018848" [puppet] - 10https://gerrit.wikimedia.org/r/585213 (https://phabricator.wikimedia.org/T249081) (owner: 10JMeybohm) [14:18:35] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2042.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2042.codfw.wmnet'] ` [14:22:36] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21659/" [puppet] - 10https://gerrit.wikimedia.org/r/585238 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [14:27:20] (03PS3) 10Dzahn: DHCP: remove old install servers and use new servers as next-server [puppet] - 10https://gerrit.wikimedia.org/r/569686 (https://phabricator.wikimedia.org/T224576) [14:29:29] (03CR) 10Dzahn: [C: 03+2] DHCP: remove old install servers and use new servers as next-server [puppet] - 10https://gerrit.wikimedia.org/r/569686 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [14:30:16] !log pool cp2042 - T248816 [14:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:23] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [14:32:17] !log depooling wdqs1006 to allow catching up on lag [14:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:05] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "Why not? I don't have to prove the usefulness of this feature." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [14:34:07] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[18,20,22,24-26].codfw.wmnet - https://phabricator.wikimedia.org/T249115 (10Vgutierrez) [14:34:13] 10Operations, 10ops-codfw, 10DC-Ops: (Need by:TBD) rack/setup/install backup2002 - https://phabricator.wikimedia.org/T249116 (10RobH) [14:34:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/585213 (https://phabricator.wikimedia.org/T249081) (owner: 10JMeybohm) [14:34:47] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [14:35:25] 10Operations, 10ops-codfw, 10DC-Ops: (Need by:TBD) rack/setup/install backup2002 - https://phabricator.wikimedia.org/T249116 (10RobH) 05Open→03Resolved this is handled by https://phabricator.wikimedia.org/T248934 and not linked to parent task [14:35:26] (03CR) 10Dzahn: [C: 03+1] "this needs a change in puppet repo as well! hieradata/common.yaml:aptrepo_server: install1002.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/575404 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [14:35:38] 10Operations, 10ops-codfw, 10DBA: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10RobH) [14:36:49] (03CR) 10Hashar: "recheck T249076" [puppet] - 10https://gerrit.wikimedia.org/r/585058 (owner: 10Andrew Bogott) [14:37:39] (03CR) 10Ayounsi: [C: 03+1] "Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/583740 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [14:39:55] (03PS1) 10Vgutierrez: site: decommission cp20[18,20,22,24-26] [puppet] - 10https://gerrit.wikimedia.org/r/585244 (https://phabricator.wikimedia.org/T249115) [14:41:51] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10ItamarWMDE) @elukey Thank you, am able to access hue now :) [14:46:41] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [14:46:44] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:49] (03CR) 10CDanis: [C: 03+2] completed rollout of sensible flow-table-sizes [homer/public] - 10https://gerrit.wikimedia.org/r/583740 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [14:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:01] (03Merged) 10jenkins-bot: completed rollout of sensible flow-table-sizes [homer/public] - 10https://gerrit.wikimedia.org/r/583740 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [14:47:56] (03CR) 10Ppchelko: "Also, needs a manual rebase" [deployment-charts] - 10https://gerrit.wikimedia.org/r/584637 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [14:49:18] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [14:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:57] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:04] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp20[18,20,22,24-26].codfw.wmnet - https://phabricator.wikimedia.org/T249115 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2018.codfw.wmnet` - cp2018.codfw.wmne... [14:50:09] (03PS3) 10Andrew Bogott: neutron: enable l3_agent_only_dmz_cidr_hack in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/585031 (https://phabricator.wikimedia.org/T247505) [14:50:11] (03PS5) 10Andrew Bogott: Openstack Neutron: add neutron l3 hacks for Rocky [puppet] - 10https://gerrit.wikimedia.org/r/585034 (https://phabricator.wikimedia.org/T248635) [14:50:31] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [14:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:42] (03PS1) 10Dzahn: hiera/apt.wikimedia.org: switch from install1002 to apt1001 [puppet] - 10https://gerrit.wikimedia.org/r/585245 (https://phabricator.wikimedia.org/T224576) [14:51:03] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [14:51:19] (03PS2) 10Ayounsi: Manage static flowspec rules via Homer [homer/public] - 10https://gerrit.wikimedia.org/r/585232 [14:51:30] (03CR) 10jerkins-bot: [V: 04-1] Manage static flowspec rules via Homer [homer/public] - 10https://gerrit.wikimedia.org/r/585232 (owner: 10Ayounsi) [14:51:37] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [14:52:19] (03CR) 10Dzahn: [C: 03+1] "also merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/585245" [dns] - 10https://gerrit.wikimedia.org/r/575404 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [14:52:56] (03PS3) 10Ayounsi: Manage static flowspec rules via Homer [homer/public] - 10https://gerrit.wikimedia.org/r/585232 [14:52:58] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:10] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp20[18,20,22,24-26].codfw.wmnet - https://phabricator.wikimedia.org/T249115 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp[2020,2022,2024-2026].codfw.wmnet` -... [14:55:17] 10Operations, 10serviceops: CORS errors on commons on debug servers - https://phabricator.wikimedia.org/T249107 (10Tgr) MediaViewer is also broken, due to a slightly different error: ` Request header field x-wikimedia-debug is not allowed by Access-Control-Allow-Headers in preflight response. ` [14:57:22] (03PS1) 10Dzahn: site: fix comment about public/private IPs of apt repo [puppet] - 10https://gerrit.wikimedia.org/r/585247 [14:58:57] (03PS2) 10Dzahn: site: fix comment about public/private IPs of apt repo [puppet] - 10https://gerrit.wikimedia.org/r/585247 [14:59:03] (03CR) 10Ayounsi: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/584972 (owner: 10Volans) [14:59:38] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10Papaul) ` [edit interfaces interface-range disabled] member xe-2/0/5 { ... } + member xe-2/0/3; [edit interfaces] - xe-2/0/3 { - description cp200... [14:59:55] (03CR) 10Vgutierrez: [C: 03+2] site: decommission cp20[18,20,22,24-26] [puppet] - 10https://gerrit.wikimedia.org/r/585244 (https://phabricator.wikimedia.org/T249115) (owner: 10Vgutierrez) [15:00:05] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10Papaul) [15:00:21] (03CR) 10CDanis: [C: 03+1] "lg, nits" (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/585232 (owner: 10Ayounsi) [15:01:17] (03CR) 10Volans: [C: 03+2] config: complete test coverage [software/homer] - 10https://gerrit.wikimedia.org/r/584972 (owner: 10Volans) [15:03:09] (03PS1) 10Jbond: profile::maps::tlsproxy: update profile to use envoy for tls termination [puppet] - 10https://gerrit.wikimedia.org/r/585248 [15:03:52] (03PS1) 10Vgutierrez: Remove cp[2018,2020,2022,2024-2026] entries [dns] - 10https://gerrit.wikimedia.org/r/585249 (https://phabricator.wikimedia.org/T249115) [15:03:54] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/585248 (owner: 10Jbond) [15:04:05] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Papaul) ` [edit interfaces interface-range disabled] member xe-2/0/3 { ... } + member xe-7/0/3; [edit interfaces] - xe-7/0/3 { - description cp201... [15:04:07] (03PS2) 10Jbond: profile::maps::tlsproxy: update profile to use envoy for tls termination [puppet] - 10https://gerrit.wikimedia.org/r/585248 [15:04:26] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Papaul) [15:04:45] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:05:05] (03Merged) 10jenkins-bot: config: complete test coverage [software/homer] - 10https://gerrit.wikimedia.org/r/584972 (owner: 10Volans) [15:06:31] (03CR) 10Muehlenhoff: admin: Add user jayme (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/585213 (https://phabricator.wikimedia.org/T249081) (owner: 10JMeybohm) [15:07:14] (03PS13) 10RLazarus: profile::mediawiki::maintenance: Migrate pagetriage jobs to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/582933 (https://phabricator.wikimedia.org/T211250) [15:07:16] (03PS1) 10RLazarus: profile::mediawiki::maintenance: Migrate translationnotifications jobs to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/585250 [15:08:02] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[18,20,22,24-26].codfw.wmnet - https://phabricator.wikimedia.org/T249115 (10Vgutierrez) a:05Vgutierrez→03Papaul [15:08:13] (03CR) 10jerkins-bot: [V: 04-1] profile::maps::tlsproxy: update profile to use envoy for tls termination [puppet] - 10https://gerrit.wikimedia.org/r/585248 (owner: 10Jbond) [15:08:23] (03CR) 10Jbond: [C: 03+1] Manage static flowspec rules via Homer [homer/public] - 10https://gerrit.wikimedia.org/r/585232 (owner: 10Ayounsi) [15:08:31] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:10:51] (03CR) 10Acamicamacaraca: [C: 03+1] "What the hell are we talking about here? The Serbian Wikipedia community has decided to be this way. Who are we to challenge?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [15:11:49] !log performing kafka-main rolling restarts to pick up security updates [15:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:20] (03CR) 10RLazarus: "> Patch Set 12: Code-Review-1" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/582933 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [15:13:00] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::maintenance: Migrate translationnotifications jobs to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/585250 (owner: 10RLazarus) [15:13:02] (03CR) 10JMeybohm: admin: Add user jayme (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/585213 (https://phabricator.wikimedia.org/T249081) (owner: 10JMeybohm) [15:14:23] jerkins could you don't >:| [15:15:05] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[16,19,23,27].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10Vgutierrez) [15:15:37] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [15:15:40] (03PS2) 10RLazarus: maintenance: Migrate translationnotifications jobs to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/585250 [15:15:46] (03PS3) 10Jbond: profile::maps::tlsproxy: update profile to use envoy for tls termination [puppet] - 10https://gerrit.wikimedia.org/r/585248 [15:16:27] (03PS3) 10RLazarus: maintenance: Migrate translationnotifications jobs to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/585250 [15:18:27] (03PS1) 10Gergő Tisza: Whitelist X-Wikimedia-Debug header for CORS media requests [puppet] - 10https://gerrit.wikimedia.org/r/585252 (https://phabricator.wikimedia.org/T249107) [15:20:11] (03PS1) 10Vgutierrez: site,install_server: Decommission cp20[16,19,23,27] [puppet] - 10https://gerrit.wikimedia.org/r/585254 (https://phabricator.wikimedia.org/T249125) [15:20:50] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:21:58] (03PS1) 10Elukey: Enable TLS encryption to Kafka Jumbo for all pmacct instances [puppet] - 10https://gerrit.wikimedia.org/r/585255 (https://phabricator.wikimedia.org/T248980) [15:22:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1099:3318 after schema change', diff saved to https://phabricator.wikimedia.org/P10845 and previous config saved to /var/cache/conftool/dbconfig/20200401-152258-marostegui.json [15:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:06] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:27:40] !log depool & decommission cp20[16,19,23,27] - T249125 [15:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:46] T249125: decommission cp20[16,19,23,27].codfw.wmnet - https://phabricator.wikimedia.org/T249125 [15:28:21] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp20[16,19,23,27] [puppet] - 10https://gerrit.wikimedia.org/r/585254 (https://phabricator.wikimedia.org/T249125) (owner: 10Vgutierrez) [15:29:06] (03Abandoned) 10Thcipriani: Gerrit: apache proxy not pooled [puppet] - 10https://gerrit.wikimedia.org/r/579601 (https://phabricator.wikimedia.org/T246763) (owner: 10Thcipriani) [15:29:12] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:29:14] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:05] (03CR) 10Muehlenhoff: [C: 03+2] admin: Add user jayme [puppet] - 10https://gerrit.wikimedia.org/r/585213 (https://phabricator.wikimedia.org/T249081) (owner: 10JMeybohm) [15:31:19] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [15:31:22] !log vgutierrez@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [15:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:23] (03CR) 10Muehlenhoff: [C: 03+2] "Key was validated over the phone" [puppet] - 10https://gerrit.wikimedia.org/r/585213 (https://phabricator.wikimedia.org/T249081) (owner: 10JMeybohm) [15:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:34] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [15:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:31] (03PS1) 10Ema: ATS: increase check_trafficserver_log_fifo timeout [puppet] - 10https://gerrit.wikimedia.org/r/585256 (https://phabricator.wikimedia.org/T248067) [15:33:10] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:33:32] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:38] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/21660/" [puppet] - 10https://gerrit.wikimedia.org/r/585255 (https://phabricator.wikimedia.org/T248980) (owner: 10Elukey) [15:33:44] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp20[16,19,23,27].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp[2016,2019,2023,2027].codfw.wmnet` - cp2... [15:34:40] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:35:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:36:22] (03CR) 10Vgutierrez: [C: 03+1] ATS: increase check_trafficserver_log_fifo timeout [puppet] - 10https://gerrit.wikimedia.org/r/585256 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [15:36:47] (03CR) 10Ayounsi: [C: 03+1] "LGTM once we confirm that it's not crashing on netflow4001." [puppet] - 10https://gerrit.wikimedia.org/r/585255 (https://phabricator.wikimedia.org/T248980) (owner: 10Elukey) [15:37:02] (03PS12) 10Mstyles: kibana: move httpd proxy authentication to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) [15:39:43] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:40:45] (03PS1) 10Ssingh: Allow remote connections to the Postgres database for OONI tests [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/585258 [15:41:00] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:41:44] (03PS1) 10Vgutierrez: Remove cp20[16,19,23,27] entries [dns] - 10https://gerrit.wikimedia.org/r/585259 (https://phabricator.wikimedia.org/T249125) [15:41:53] 10Operations, 10Patch-For-Review: Onboarding Janis Meybohm - https://phabricator.wikimedia.org/T249081 (10MoritzMuehlenhoff) [15:42:27] 10Operations, 10netops: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) 05Open→03Resolved Homer-ized and done. [15:43:04] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10Papaul) ` [edit interfaces interface-range disabled] member xe-7/0/3 { ... } + member xe-7/0/4; [edit interfaces] - xe-7/0/4 { - description cp201... [15:43:12] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10RobH) [15:43:51] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10Papaul) [15:44:22] (03CR) 10Vgutierrez: [C: 03+2] Remove cp20[16,19,23,27] entries [dns] - 10https://gerrit.wikimedia.org/r/585259 (https://phabricator.wikimedia.org/T249125) (owner: 10Vgutierrez) [15:45:10] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 (10Papaul) ` [edit interfaces interface-range disabled] member xe-7/0/4 { ... } + member xe-7/0/5; [edit interfaces] - xe-7/0/5 { - description cp2012; - e... [15:46:58] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[16,19,23,27].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10Vgutierrez) a:05Vgutierrez→03Papaul [15:47:52] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [15:48:01] \o/ [15:48:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:49:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:51:06] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 (10Papaul) [15:51:26] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[16,19,23].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10Vgutierrez) [15:52:00] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:52:01] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) 05Resolved→03Open [15:52:05] sigh... [15:54:25] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 (10RobH) [15:54:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:57:13] (03PS1) 10Vgutierrez: Restore cp2027 entries [dns] - 10https://gerrit.wikimedia.org/r/585262 (https://phabricator.wikimedia.org/T248816) [15:57:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:58:08] (03PS2) 10Vgutierrez: Restore cp2027 entries [dns] - 10https://gerrit.wikimedia.org/r/585262 (https://phabricator.wikimedia.org/T248816) [15:58:52] (03CR) 10Vgutierrez: [C: 03+2] Restore cp2027 entries [dns] - 10https://gerrit.wikimedia.org/r/585262 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [15:59:07] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2013.codfw.wmnet - https://phabricator.wikimedia.org/T249088 (10Papaul) ` [edit interfaces interface-range disabled] member xe-2/0/46 { ... } + member xe-2/0/3; [edit interfaces] - xe-2/0/3 { - description cp2013; -... [15:59:39] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2013.codfw.wmnet - https://phabricator.wikimedia.org/T249088 (10Papaul) [16:00:53] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10Papaul) ` [edit interfaces interface-range disabled] member xe-2/0/3 { ... } + member xe-2/0/4; [edit interfaces] - xe-2/0/4 { - description cp201... [16:01:37] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10Papaul) [16:03:07] (03PS1) 10Vgutierrez: site,install_server: Restore cp2027 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/585263 (https://phabricator.wikimedia.org/T248816) [16:03:52] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:04:42] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Restore cp2027 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/585263 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [16:06:00] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 (10Papaul) ` [edit interfaces interface-range disabled] member xe-2/0/4 { ... } + member xe-7/0/4; [edit interfaces] - xe-7/0/4 { - description cp2017; - e... [16:06:29] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 (10Papaul) [16:06:56] (03PS1) 10ArielGlenn: fix up check for page range of prefetch files [dumps] - 10https://gerrit.wikimedia.org/r/585264 [16:07:48] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2027.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [16:09:02] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:14:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:16:56] (03CR) 10ArielGlenn: [C: 03+2] fix up check for page range of prefetch files [dumps] - 10https://gerrit.wikimedia.org/r/585264 (owner: 10ArielGlenn) [16:17:28] !log ariel@deploy1001 Started deploy [dumps/dumps@21363c1]: page range prefetch fixup [16:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:38] !log ariel@deploy1001 Finished deploy [dumps/dumps@21363c1]: page range prefetch fixup (duration: 00m 09s) [16:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:04] 10Operations: Onboarding Janis Meybohm - https://phabricator.wikimedia.org/T249081 (10MoritzMuehlenhoff) [16:21:28] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:21:44] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10thcipriani) [16:23:10] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:23:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:24:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Move netflow to TLS encryption/authentication via librdkafka - https://phabricator.wikimedia.org/T248980 (10elukey) p:05Triage→03High [16:25:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:25:40] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:29] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [16:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:30:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:31:01] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:13] (03PS1) 10CDanis: enable sampling on eqdfw & eqord [homer/public] - 10https://gerrit.wikimedia.org/r/585269 [16:37:31] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1040 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:37:45] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2027.codfw.wmnet'] ` and were **ALL** successful. [16:39:08] !log pool cp2027 - T248816 [16:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:14] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [16:40:17] (03CR) 10Ayounsi: [C: 03+1] enable sampling on eqdfw & eqord [homer/public] - 10https://gerrit.wikimedia.org/r/585269 (owner: 10CDanis) [16:40:29] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [16:40:39] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) 05Open→03Resolved [16:41:19] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:42:35] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:48:27] (03CR) 10CDanis: [C: 03+2] enable sampling on eqdfw & eqord [homer/public] - 10https://gerrit.wikimedia.org/r/585269 (owner: 10CDanis) [16:48:53] (03Merged) 10jenkins-bot: enable sampling on eqdfw & eqord [homer/public] - 10https://gerrit.wikimedia.org/r/585269 (owner: 10CDanis) [16:49:13] 10Operations, 10LDAP-Access-Requests: Offboard Sandra Müllrick from WMF systems - https://phabricator.wikimedia.org/T247750 (10jcrespo) 05Open→03Resolved a:03jcrespo I am going to be bold, @WMDE-leszek and close this ticket as resolved. Everything that SREs could do was done (LDAP and shell access handli... [16:53:24] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:54:30] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cloudvirt2001-dev.codfw.wmnet, cloudvirt2002-dev.codfw.wmnet, cloudnet2003-dev.codfw.wmnet, cloudnet2002-dev.codfw.wmnet, cloudvirt2003-dev.codfw.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [16:54:54] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕐☕ homer 'cr*eqdfw*' commit 'enable sampling on eqdfw Iac15379cc' [16:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:05] volans: isn't homer supposed to !log ? [16:55:52] cdanis: no, it's supposed to be integrated with a cookbook that will log *or* it will ! log when we'll do the wmfpylib stuff [16:55:53] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:55:56] that I would like to do anyway [16:56:01] ahah [16:56:18] volans: tbh seems like we have some good opportunities for wmfpylib soon [16:56:24] (03CR) 10Elukey: kibana: move httpd proxy authentication to a separate profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [16:56:43] yeah I'd like to start this Q if the world doesn't end [16:57:20] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 (10RobH) [16:58:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:59:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission dbproxy1002.eqiad.wmnet - https://phabricator.wikimedia.org/T245384 (10RobH) [16:59:28] 10Operations, 10ops-eqiad, 10decommission: Decommission dbproxy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T244463 (10RobH) [17:00:20] 10Operations, 10ops-eqiad, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10RobH) [17:01:20] 10Operations, 10ops-eqiad, 10decommission: decommission dbstore1001.eqiad.wmnet - https://phabricator.wikimedia.org/T236227 (10RobH) [17:01:58] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:02:16] 10Operations, 10ops-eqiad, 10decommission: Reclaim torrelay1001 to spares - https://phabricator.wikimedia.org/T243390 (10RobH) [17:02:46] 10Operations, 10ops-eqiad, 10decommission, 10User-jbond: decommission rhodium - https://phabricator.wikimedia.org/T235503 (10RobH) [17:03:50] 10Operations, 10ops-eqiad, 10decommission: Reclaim labpuppetmaster1001 and 1002 - https://phabricator.wikimedia.org/T234462 (10RobH) [17:04:19] 10Operations, 10ops-eqiad, 10decommission: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10RobH) [17:04:42] 10Operations, 10ops-eqiad, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10RobH) [17:05:06] 10Operations, 10ops-eqiad, 10decommission: Return sulfur to spares - https://phabricator.wikimedia.org/T224475 (10RobH) [17:05:37] 10Operations, 10ops-eqiad, 10Data-Services, 10decommission: Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10RobH) [17:05:58] 10Operations, 10ops-eqiad, 10decommission, 10fundraising-tech-ops: decommission frav1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T222109 (10RobH) [17:06:00] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[16,19,23].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10Papaul) ` [edit interfaces interface-range disabled] member xe-7/0/4 { ... } + member xe-7/0/3; [edit interfaces] - xe-7/0/3 { - description cp2016; -... [17:06:24] 10Operations, 10ops-eqiad, 10decommission, 10cloud-services-team (Kanban): labsdb1002-array1: status clarification - https://phabricator.wikimedia.org/T214903 (10RobH) [17:06:42] 10Operations, 10ops-eqiad, 10Data-Services, 10decommission, 10cloud-services-team (Hardware): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10RobH) [17:06:56] (03CR) 10Jforrester: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/585058 (owner: 10Andrew Bogott) [17:07:18] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[16,19,23].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10Papaul) [17:07:25] (03CR) 10Jforrester: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/584020 (owner: 10RLazarus) [17:07:29] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash, 10decommission, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10RobH) [17:07:55] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/585058 (owner: 10Andrew Bogott) [17:08:02] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH) [17:09:58] 10Operations, 10Wikimedia-Mailing-lists: Creation of three Wikimedia CH mailing lists - https://phabricator.wikimedia.org/T248910 (10Quiddity) @Ilario Per https://meta.wikimedia.org/wiki/Mailing_lists#Create_a_new_list please add to the description above: * (x3) description of the list for the list info page (... [17:10:10] 10Operations, 10ops-codfw, 10Analytics, 10decommission, 10serviceops: decommission kraz.wikimedia.org - https://phabricator.wikimedia.org/T245279 (10RobH) [17:11:44] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[16,19,23].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10RobH) [17:12:31] (03Abandoned) 10Hashar: DO NOT MERGE -- experimental no-op tox test [puppet] - 10https://gerrit.wikimedia.org/r/585058 (owner: 10Andrew Bogott) [17:19:09] 10Operations, 10Release-Engineering-Team-TODO: Should 'doc' machines (i.e. doc1001) have contint-roots as a group? - https://phabricator.wikimedia.org/T245691 (10thcipriani) a:03thcipriani [17:21:39] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕐☕ homer 'cr*eqord*' commit 'enable sampling on eqord Iac15379cc' [17:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:26] 10Operations, 10DC-Ops, 10decommission: decommission WMF6144 (old pay-lvs2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247571 (10RobH) [17:24:52] 10Operations, 10ops-codfw, 10decommission: decommission WMF6149 (old pay-lvs2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247572 (10RobH) [17:25:15] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2002.codfw.wmnet - https://phabricator.wikimedia.org/T248818 (10RobH) [17:25:34] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 (10RobH) [17:25:54] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2001.codfw.wmnet - https://phabricator.wikimedia.org/T248815 (10RobH) [17:26:30] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission heka.frack.codfw.wmnet - https://phabricator.wikimedia.org/T248627 (10RobH) [17:26:52] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2005.codfw.wmnet - https://phabricator.wikimedia.org/T248848 (10RobH) [17:27:02] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10RobH) [17:27:30] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10RobH) [17:27:39] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10RobH) [17:27:55] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10RobH) [17:28:14] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10RobH) [17:28:30] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10RobH) [17:28:46] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 (10RobH) [17:29:04] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10RobH) [17:30:11] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10RobH) [17:30:27] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 (10RobH) [17:30:46] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2013.codfw.wmnet - https://phabricator.wikimedia.org/T249088 (10RobH) [17:31:32] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10HMarcus) @jbond Thank you, I've created an admin account for you and sent you a password reset email. Please sign in at console.jumpcloud.com... [17:31:34] 10Operations, 10ops-eqiad, 10decommission: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10RobH) [17:33:23] 10Operations, 10ops-eqiad, 10Data-Services, 10decommission, 10cloud-services-team (Hardware): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10RobH) [17:33:54] 10Operations, 10ops-eqiad, 10decommission: Reclaim labpuppetmaster1001 and 1002 - https://phabricator.wikimedia.org/T234462 (10RobH) [17:34:03] 10Operations, 10ops-eqiad, 10decommission: decommission elastic1017 - https://phabricator.wikimedia.org/T234045 (10RobH) [17:34:47] 10Operations, 10ops-eqiad, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10RobH) [17:35:07] 10Operations, 10ops-eqiad, 10Data-Services, 10decommission: Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10RobH) [17:36:10] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash, 10decommission, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10RobH) [17:37:05] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10RobH) [17:37:20] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH) [17:38:24] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for snowick - https://phabricator.wikimedia.org/T248943 (10SNowick_WMF) @jcrespo I don't need this access any longer, I can access a replica of wikishared and see the same tables. I initially had some permission issues on our not... [17:38:37] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[18,20,22,24-26].codfw.wmnet - https://phabricator.wikimedia.org/T249115 (10RobH) [17:39:08] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:39:09] 10Operations, 10ops-eqiad, 10decommission, 10fundraising-tech-ops: decommission bismuth.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T248516 (10RobH) [17:39:21] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for snowick - https://phabricator.wikimedia.org/T248943 (10SNowick_WMF) 05Open→03Resolved [17:39:35] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decommission dbproxy1007.eqiad.wmnet - https://phabricator.wikimedia.org/T245385 (10RobH) [17:39:58] 10Operations, 10ops-eqiad, 10decommission, 10fundraising-tech-ops: decommission americium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T245038 (10RobH) [17:40:35] 10Operations, 10ops-eqiad, 10decommission: Reclaim labpuppetmaster1001 and 1002 - https://phabricator.wikimedia.org/T234462 (10RobH) [17:41:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission dbproxy1002.eqiad.wmnet - https://phabricator.wikimedia.org/T245384 (10RobH) [17:41:21] 10Operations, 10ops-eqiad, 10decommission: Decommission dbproxy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T244463 (10RobH) [17:42:25] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) [17:46:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:47:17] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers(mw2158 through mw2172) in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10RobH) [17:50:14] PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [17:52:18] 10Operations, 10Fundraising-Backlog, 10MediaWiki-Vagrant: Package XDebug 2.7.2 for apt.wikimedia.org - https://phabricator.wikimedia.org/T220406 (10hashar) [17:57:42] 10Operations, 10Toolforge: Restart Copyvios - https://phabricator.wikimedia.org/T249147 (10doctaxon) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200401T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:02:10] 10Operations, 10Cloud-Services: Restart Copyvios - https://phabricator.wikimedia.org/T249147 (10doctaxon) [18:03:03] 10Operations, 10Toolforge: Restart Copyvios - https://phabricator.wikimedia.org/T249147 (10doctaxon) [18:03:21] 10Operations, 10Cloud-Services: Restart Copyvios - https://phabricator.wikimedia.org/T249147 (10doctaxon) [18:04:57] 10Operations, 10Fundraising-Backlog, 10MediaWiki-Vagrant: Package XDebug 2.7.2 for apt.wikimedia.org - https://phabricator.wikimedia.org/T220406 (10hashar) I have added a couple tasks that are related to Xdebug 2.7.0 having bug. For T234418, I have identified the upstream patch and got it added to our Debia... [18:06:50] RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [18:08:28] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 87, down: 0, dormant: 0, excluded: 2, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:10:22] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [18:12:06] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:13:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 87, down: 0, dormant: 0, excluded: 2, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:23:38] (03PS2) 10Ppchelko: changeprop: Make service features toggles rather than comments [deployment-charts] - 10https://gerrit.wikimedia.org/r/584637 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [18:25:02] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [18:25:31] (03PS3) 10Ppchelko: changeprop: Make service features toggles rather than comments [deployment-charts] - 10https://gerrit.wikimedia.org/r/584637 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [18:26:46] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:27:02] (03CR) 10Ppchelko: [C: 03+2] "PS2 and 3 are manual rebase." [deployment-charts] - 10https://gerrit.wikimedia.org/r/584637 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [18:27:22] (03Merged) 10jenkins-bot: changeprop: Make service features toggles rather than comments [deployment-charts] - 10https://gerrit.wikimedia.org/r/584637 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [18:33:46] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests: Grant analytics access to Anti Harassment Tools engineers - https://phabricator.wikimedia.org/T249059 (10aezell) @Mooeypoo If you create a new one with the format suggested above, I will provide approval. [18:39:48] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:41:40] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [18:43:24] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:44:13] (03PS1) 10Joal: Add imagelinks table to tables sqooped on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/585292 (https://phabricator.wikimedia.org/T249113) [18:47:08] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:48:27] 10Operations, 10MediaWiki-Parser, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-Incident: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10AMooney) a:05nnikkhoui→03None [18:51:31] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests: Grant analytics access to Anti Harassment Tools engineers - https://phabricator.wikimedia.org/T249059 (10jcrespo) Just to be clear, a new ticket is not required- it can be done on this same ticket, as I requested, and Gilles did here: T248797#6013982.... [18:53:47] 10Operations, 10MediaWiki-Parser, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-Incident: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10AMooney) a:03Peter.ovchyn Peter, Can you take a look at this... [19:00:04] dduvall and longma: That opportune time is upon us again. Time for a Mediawiki train - American Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200401T1900). [19:07:43] 10Operations, 10Wikimedia-Mailing-lists: Creation of three Wikimedia CH mailing lists - https://phabricator.wikimedia.org/T248910 (10Ilario) @Quiddity here the descriptions: wikimediach-fr - Open mailing list to coordinate activities in Switzerland in French / liste de diffusion ouverte pour la coordinat... [19:10:49] longma: o/ [19:11:00] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:11:23] * longma waves [19:12:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:16:37] (03PS1) 10Dduvall: group1 wikis to 1.35.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585300 [19:16:39] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.35.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585300 (owner: 10Dduvall) [19:17:36] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585300 (owner: 10Dduvall) [19:18:02] !log promoting group1 to 1.35.0-wmf.26 to group1 [19:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:04] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.26 [19:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:11] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.26 (duration: 01m 06s) [19:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:48] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:24:50] seeing quite a few jsonTruncated errors with message `%{message}` [19:25:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:27:03] rolling back [19:27:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:28:01] (03PS1) 10Dduvall: Revert "group1 wikis to 1.35.0-wmf.26" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585303 [19:29:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:30:05] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: rollback 1.35.0-wmf.26 from group1 [19:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:12] Fun times. [19:30:39] (03CR) 10Dduvall: [C: 03+2] Revert "group1 wikis to 1.35.0-wmf.26" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585303 (owner: 10Dduvall) [19:31:00] so fun [19:31:30] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:31:40] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.35.0-wmf.26" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585303 (owner: 10Dduvall) [19:32:13] If it was seamless, we’d be bored. :) [19:32:14] (03PS1) 10Niedzielski: [prod] [beta] [Vector] remove outdated config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585304 [19:33:18] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:35:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:36:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:36:40] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:40:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:41:14] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:11] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:42:45] 10Operations, 10MediaWiki-Parser, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-Incident: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10nnikkhoui) Talked with @Anomie and he brought up a good point... [19:48:57] !log rollback of 1.35.0-wmf.26 from group1 (T247773). blocked by T249162 [19:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:04] T249162: High rate of timeouts on jsonTruncated channel upon group1 1.35.0-wmf.26 promotion - https://phabricator.wikimedia.org/T249162 [19:49:04] T247773: 1.35.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T247773 [19:50:55] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit - https://phabricator.wikimedia.org/T249059 (10Tchanders) [19:51:21] 10Operations, 10MediaWiki-General, 10observability, 10serviceops, 10Patch-For-Review: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10colewhite) [19:52:44] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit - https://phabricator.wikimedia.org/T249059 (10Tchanders) Thanks @jcrespo, I've updated this task according to the template. @dbarratt I think you have production... [20:00:04] halfak and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200401T2000). [20:04:30] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:09:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:09:56] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:11:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:13:22] (03CR) 10Nuria: [C: 03+1] Add imagelinks table to tables sqooped on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/585292 (https://phabricator.wikimedia.org/T249113) (owner: 10Joal) [20:17:22] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:19:06] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:22:31] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:24:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:57:11] (03PS5) 10Hashar: jenkins: Adjust CSP header to allow inline CSS and video playback [puppet] - 10https://gerrit.wikimedia.org/r/585038 (https://phabricator.wikimedia.org/T245658) [20:57:24] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/585038 (https://phabricator.wikimedia.org/T245658) (owner: 10Hashar) [20:58:28] (03CR) 10Hashar: "My commit message is outdated, I am not using it against a ssh-agent socket but against a credential file jenkins populates." [puppet] - 10https://gerrit.wikimedia.org/r/583392 (https://phabricator.wikimedia.org/T210271) (owner: 10Hashar) [21:32:08] 10Operations, 10ops-esams, 10Traffic: cp3057 crash (was: network down) - https://phabricator.wikimedia.org/T244127 (10faidon) [21:32:11] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10faidon) [21:32:39] 10Operations, 10Traffic: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10faidon) [21:32:43] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10faidon) [21:32:49] 10Operations, 10Traffic: cp3051 crashed - https://phabricator.wikimedia.org/T241306 (10faidon) [21:32:51] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10faidon) [21:34:00] (03CR) 10Hashar: "The cron jobs work fine and I have confirmed on integration-agent-docker-1010 the repositories got pruned properly." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579602 (owner: 10Thcipriani) [21:36:36] PROBLEM - Host ms-be1023 is DOWN: PING CRITICAL - Packet loss = 100% [21:40:04] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10faidon) What's the latest here? I haven't heard about these crashes lately but it may just be that I missed it. Do we know more about this now? Also, it's great to hear that we have a traceback no... [21:40:32] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10faidon) [21:40:38] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10faidon) [21:40:44] ms-be1023 is unresponsive to ping, ssh and nothing in console, opening a task and force-powercycling it [21:41:01] 10Operations, 10ops-esams, 10Traffic: cp3057 crash (was: network down) - https://phabricator.wikimedia.org/T244127 (10faidon) [21:41:06] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10faidon) [21:41:20] 10Operations, 10ops-esams, 10Traffic: cp3057 crash (was: network down) - https://phabricator.wikimedia.org/T244127 (10faidon) [21:41:22] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10faidon) [21:43:36] PROBLEM - Disk space on netflow2001 is CRITICAL: DISK CRITICAL - free space: / 298 MB (3% inode=89%): /tmp 298 MB (3% inode=89%): /var/tmp 298 MB (3% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops [21:45:32] kafkatee is spamming /var/log/messages in netflow2001 [21:45:42] the kafkatee systemd unit is masked [21:46:05] and loop-restart-fail-to-be-restarted [21:46:11] with OSError: [Errno 98] Address already in use [21:46:22] from /usr/local/bin/rpkicounter.py [21:49:26] ah, the kafkatee.service is masked, while the kafkatee-webrequest.service is the one failing [21:50:55] !log stopped and restarted kafkatee-webrequest.service on netflow2001, was in a restart loop [21:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:11] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1023 crashed - https://phabricator.wikimedia.org/T249174 (10Volans) p:05Triage→03Medium [21:53:14] !log force-rebooting ms-be1023, unresponsive - T249174 [21:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:20] T249174: ms-be1023 crashed - https://phabricator.wikimedia.org/T249174 [21:56:02] RECOVERY - Host ms-be1023 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [22:02:26] !log forcing logrotate on netflow2001 to compress yesterday's logs [22:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:35] (03CR) 10Hashar: "You probably want to split the xdebug and tideway_xhprof in two separate ini files?" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/584733 (https://phabricator.wikimedia.org/T246921) (owner: 10Jeena Huneidi) [22:04:28] RECOVERY - Disk space on netflow2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops [22:05:11] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1023 crashed - https://phabricator.wikimedia.org/T249174 (10Volans) Upon forced reboot the host is back up but Icinga is reporting `Cache: Permanently Disabled - Battery count: 0` and the iLO logged an additional message: ` hpiLO-> show /system1/log1/re... [22:06:05] 10Operations, 10netops: netflow2001 kafkatee-webrequest restart loop - https://phabricator.wikimedia.org/T249176 (10Volans) p:05Triage→03Medium [22:07:44] 10Operations, 10netops: netflow2001 kafkatee-webrequest restart loop - https://phabricator.wikimedia.org/T249176 (10Volans) I've then stopped and restarted the systemd unit and it was able to start properly, but it should be investigated. My understanding is that the systemd Icinga check was already alarming f... [22:09:09] (03PS4) 10Thcipriani: Integration Cluster: update gitcache nightly [puppet] - 10https://gerrit.wikimedia.org/r/579602 [22:09:13] (03CR) 10Thcipriani: Integration Cluster: update gitcache nightly (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579602 (owner: 10Thcipriani) [22:13:22] 10Operations, 10netops: netflow hosts spamming /var/log - https://phabricator.wikimedia.org/T249177 (10Volans) p:05Triage→03Medium [22:15:34] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:15:36] ACKNOWLEDGEMENT - HP RAID on ms-be1023 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T249178 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [22:15:40] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1023 - https://phabricator.wikimedia.org/T249178 (10ops-monitoring-bot) [22:16:54] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1023 - https://phabricator.wikimedia.org/T249178 (10Volans) p:05Triage→03Medium [22:17:16] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1023 - https://phabricator.wikimedia.org/T249178 (10Volans) [22:17:18] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1023 crashed - https://phabricator.wikimedia.org/T249174 (10Volans) [22:25:18] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Lauren Dickinson - https://phabricator.wikimedia.org/T245524 (10LDickinsonWMF) Thanks, everyone! Sorry it took me forever to reply and say that :) I appreciate your help. [22:31:30] (03CR) 10Hashar: "I have cherry picked PS 4 on the integration puppet master :]" [puppet] - 10https://gerrit.wikimedia.org/r/579602 (owner: 10Thcipriani) [22:38:53] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:40:33] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:43:55] !log volker-e@deploy1001 Started deploy [design/style-guide@4bfe647]: Deploy design/style-guide: [22:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:04] !log volker-e@deploy1001 Finished deploy [design/style-guide@4bfe647]: Deploy design/style-guide: (duration: 00m 08s) [22:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:53] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:52:33] 10Operations, 10Traffic: Servers freezing across the caching cluster (November 2019) - https://phabricator.wikimedia.org/T238305 (10Krinkle) [22:55:13] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:55:19] RECOVERY - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200401T2300). [23:00:04] DannyS712: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:31:39] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [23:32:19] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [23:50:41] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [23:51:49] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [23:54:55] 10Operations, 10Analytics, 10decommission, 10serviceops: decommission kraz.wikimedia.org - https://phabricator.wikimedia.org/T245279 (10Papaul) [23:57:38] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission WMF6144 (old pay-lvs2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247571 (10Papaul)