[00:00:32] (03PS3) 10Dzahn: add maintenance.discovery.wmnet and point to mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/539635 (https://phabricator.wikimedia.org/T210411) [00:01:32] (03CR) 10Dzahn: [C: 03+2] add maintenance.discovery.wmnet and point to mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/539635 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [00:02:35] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2037 [dns] - 10https://gerrit.wikimedia.org/r/539993 [00:03:45] (03CR) 10Dzahn: [C: 03+1] DNS: Remove mgmt DNS for db2037 [dns] - 10https://gerrit.wikimedia.org/r/539993 (owner: 10Papaul) [00:04:52] (03PS2) 10Dzahn: misc webservices: style cleanup and comment [dns] - 10https://gerrit.wikimedia.org/r/539992 [00:12:05] !log phabricator - upgrading PHP version to 7.2.22 - T230024 [00:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:09] T230024: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 [00:19:51] 10Operations, 10serviceops: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 (10Dzahn) Thank you @MoritzMuehlenhoff and @jijiki first done on 2001 now also done on 1001. upgrade command: ` sudo cumin phab1003.eqiad.wmnet 'export DEBIAN_FRONTEND=noninteractive; apt-get install... [00:23:05] (03PS3) 10Dzahn: misc webservices: style cleanup and comment [dns] - 10https://gerrit.wikimedia.org/r/539992 [00:24:08] (03CR) 10Dzahn: [C: 03+2] misc webservices: style cleanup and comment [dns] - 10https://gerrit.wikimedia.org/r/539992 (owner: 10Dzahn) [00:25:32] (03PS2) 10Dzahn: DNS: Remove mgmt DNS for db2037 [dns] - 10https://gerrit.wikimedia.org/r/539993 (owner: 10Papaul) [00:25:47] RECOVERY - Memory correctable errors -EDAC- on mw1239 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [00:26:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Dzahn) self-healing?? <+icinga-wm> RECOVERY - Memory correctable errors -EDAC- on mw1239 is OK: (C)4 ge (W)2 ge 1 [00:27:47] (03CR) 10Dzahn: [C: 03+2] DNS: Remove mgmt DNS for db2037 [dns] - 10https://gerrit.wikimedia.org/r/539993 (owner: 10Papaul) [00:30:41] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=eqiad [00:31:09] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2037 - https://phabricator.wikimedia.org/T224720 (10Papaul) [00:31:26] (03Abandoned) 10Dzahn: ganeti/icinga: allow 3 ganeti-noded processes before alerting [puppet] - 10https://gerrit.wikimedia.org/r/532657 (owner: 10Dzahn) [00:31:33] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=codfw [00:31:42] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2037 - https://phabricator.wikimedia.org/T224720 (10Papaul) 05Open→03Resolved This is complete [00:33:14] 10Operations, 10ops-codfw, 10decommission, 10media-storage, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Papaul) [00:33:57] (03PS2) 10Dzahn: DNS: Remove mgmt DNS for db2036 [dns] - 10https://gerrit.wikimedia.org/r/539983 (owner: 10Papaul) [00:34:23] (03CR) 10jerkins-bot: [V: 04-1] DNS: Remove mgmt DNS for db2036 [dns] - 10https://gerrit.wikimedia.org/r/539983 (owner: 10Papaul) [00:35:13] (03PS3) 10Dzahn: DNS: Remove mgmt DNS for db2036 [dns] - 10https://gerrit.wikimedia.org/r/539983 (owner: 10Papaul) [00:36:14] (03CR) 10Dzahn: [C: 03+2] DNS: Remove mgmt DNS for db2036 [dns] - 10https://gerrit.wikimedia.org/r/539983 (owner: 10Papaul) [00:37:53] (03PS8) 10Dzahn: gerrit: add role on gerrit1001 and remove spare [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [00:38:05] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:19] 10Operations, 10ops-codfw, 10decommission, 10media-storage, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Papaul) [00:39:23] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2036 - https://phabricator.wikimedia.org/T223885 (10Papaul) [00:39:53] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2036 - https://phabricator.wikimedia.org/T223885 (10Papaul) 05Open→03Resolved This is complete [00:44:35] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:03] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2038 [dns] - 10https://gerrit.wikimedia.org/r/539995 [00:49:23] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2040 [dns] - 10https://gerrit.wikimedia.org/r/539996 [00:51:15] (03CR) 10Dzahn: [C: 03+1] DNS: Remove mgmt DNS for db2038 [dns] - 10https://gerrit.wikimedia.org/r/539995 (owner: 10Papaul) [00:52:10] (03CR) 10Dzahn: "wmnet file not added yet" [dns] - 10https://gerrit.wikimedia.org/r/539996 (owner: 10Papaul) [00:54:16] (03CR) 10Dzahn: [C: 04-1] "double check the alt names in the cert" [puppet] - 10https://gerrit.wikimedia.org/r/539633 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [00:56:07] (03PS1) 10Dzahn: admins: create deployment shell user for Andrew Kostka [puppet] - 10https://gerrit.wikimedia.org/r/539997 [00:58:21] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/18694/gerrit1001.wikimedia.org/change.gerrit1001.wikimedia.org.pson" [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [00:59:18] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "lgtm, let's merge it at a better time together" [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [01:00:06] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Papaul) [01:04:39] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2047.codfw.wmnet - https://phabricator.wikimedia.org/T231852 (10Papaul) [01:06:39] (03PS2) 10Dzahn: admins: create deployment shell user for Andrew Kostka [puppet] - 10https://gerrit.wikimedia.org/r/539997 (https://phabricator.wikimedia.org/T233202) [01:07:24] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for db2038 [dns] - 10https://gerrit.wikimedia.org/r/539995 (owner: 10Papaul) [01:10:49] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Papaul) [01:11:27] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db2038 - https://phabricator.wikimedia.org/T227565 (10Papaul) [01:11:41] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db2038 - https://phabricator.wikimedia.org/T227565 (10Papaul) 05Open→03Resolved This is complete [01:17:22] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10Dzahn) @Andrew-WMDE Alright, thanks. I made a code change at https://gerrit.wikimedia.org/r/c/operations/puppet/+/539997 and added reviewers. [01:25:03] (03PS1) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [01:34:41] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 970.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:39:33] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:42:16] (03PS11) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [01:42:18] (03PS8) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [01:42:20] (03PS3) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [01:42:22] (03PS3) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [01:42:24] (03PS2) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [01:49:49] (03PS3) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [01:52:27] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:45] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:01:04] (03CR) 10Krinkle: [C: 03+1] scap: mediawiki logstash_checker [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T233828) (owner: 10Thcipriani) [02:05:00] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Krinkle) [02:05:25] 10Operations, 10Wikimedia-Logstash, 10observability, 10serviceops, 10Patch-For-Review: Errors managed by php-wmerrors (like OOMs) lack normalized_message on logstash - https://phabricator.wikimedia.org/T233828 (10Krinkle) >>! In T233828#5534006, @Krinkle wrote: >>>! In T233828#5532983, @Joe wrote: >>[…]... [02:05:32] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Krinkle) [02:05:46] 10Operations, 10Wikimedia-Logstash, 10observability, 10serviceops, 10Patch-For-Review: Errors managed by php-wmerrors (like OOMs) lack normalized_message on logstash - https://phabricator.wikimedia.org/T233828 (10Krinkle) 05Open→03Resolved a:03herron [02:06:15] (03PS4) 10Krinkle: scap: mediawiki logstash_checker [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T234283) (owner: 10Thcipriani) [02:10:09] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:37] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:44:05] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:48:15] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 63408072 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:54:45] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 121880 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:05:23] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:09:10] (03PS9) 10CRusnov: Initial support for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) [03:21:01] (03CR) 10Ayounsi: [C: 03+1] "Not tested but logic and code LGTM." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) (owner: 10CRusnov) [03:24:16] (03CR) 10CRusnov: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi) [03:37:31] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:33] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:40:28] (03CR) 10CRusnov: "> overall conceptually this sounds good, but I was doing some digging, and it seems as though /metrics/ is frustratingly behind authentica" [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi) [03:45:39] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:54:04] (03PS1) 10Marostegui: db-eqiad.php: Depool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540006 (https://phabricator.wikimedia.org/T233698) [04:55:19] (03CR) 10Marostegui: [C: 04-2] "Do not depool before 10th October, as es1014 is in B1 and that host needs to be depooled the 10th of October for B1 PDU maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540006 (https://phabricator.wikimedia.org/T233698) (owner: 10Marostegui) [04:55:37] (03PS2) 10Marostegui: dump-misc.sh.erb: Remove puppet database from the backups [puppet] - 10https://gerrit.wikimedia.org/r/539840 (https://phabricator.wikimedia.org/T231539) [04:56:40] (03PS15) 10CRusnov: backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) [04:57:59] (03CR) 10CRusnov: backends: add Netbox backend (039 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [05:03:31] (03CR) 10jerkins-bot: [V: 04-1] backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [05:12:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:52] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2054.codfw.wmnet` - db2054.codfw.wmnet (**PASS**) - Downtimed host on Ic... [05:14:35] (03PS1) 10Marostegui: site.pp: Remove references to db2054 [puppet] - 10https://gerrit.wikimedia.org/r/540008 (https://phabricator.wikimedia.org/T232969) [05:15:51] (03PS1) 10Marostegui: wmnet: Remove production entries for db2054 [dns] - 10https://gerrit.wikimedia.org/r/540009 (https://phabricator.wikimedia.org/T232969) [05:15:56] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove references to db2054 [puppet] - 10https://gerrit.wikimedia.org/r/540008 (https://phabricator.wikimedia.org/T232969) (owner: 10Marostegui) [05:16:46] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production entries for db2054 [dns] - 10https://gerrit.wikimedia.org/r/540009 (https://phabricator.wikimedia.org/T232969) (owner: 10Marostegui) [05:17:59] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Marostegui) a:05RobH→03Papaul [05:18:21] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Marostegui) Host ready for @Papaul for on-site steps + switch disablement [05:21:43] 10Operations, 10ops-eqiad: rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10ArielGlenn) I'd like to request that both eth interfaces be cabled, as I'd like to try to set up bonding for this host. [05:54:23] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 56.85 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:54:41] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 43.49 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:54:57] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:57:37] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 102.4 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:57:55] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 84.81 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:04:27] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:05:41] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:19] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:05] RECOVERY - Check systemd state on an-presto1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:49] RECOVERY - Check systemd state on an-presto1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:09] RECOVERY - Check systemd state on an-presto1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:27] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Marostegui) [06:16:11] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:16:14] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Marostegui) @wiki_willy labsdb1009 has a broken PSU (T233273), I think the new one will arrive before this maintenance although it has not been ordered yet (T233277#5532768), but... [06:16:45] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Marostegui) [06:16:48] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10Marostegui) [06:19:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2091:3314 schema change - T233625', diff saved to https://phabricator.wikimedia.org/P9223 and previous config saved to /var/cache/conftool/dbconfig/20191001-061956-marostegui.json [06:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:04] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [06:20:46] 10Operations, 10Analytics, 10Code-Stewardship-Reviews, 10Tools, 10Wikimedia-IRC-RC-Server: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319 (10elukey) [06:28:22] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [06:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:36] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [06:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:40] 10Operations, 10decommission: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `analytics1032.eqiad.wmnet` - analytics1032.eqiad.wmnet (**FAIL**) - Downtimed host on Icinga - Downtimed man... [06:29:46] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [06:29:46] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [06:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:19] ok interesting case --^ [06:31:44] analytics1032 is stuck in boot so the script couldn't wipe the bootloader [06:31:58] then, I used the old mgmt pass and it failed to powerdown the host [06:32:13] retried with the new pass, but now it doesn't find the host anymore [06:38:38] 10Operations, 10decommission: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey) The host is stuck while booting, so the above script failed. I manually powered it off, but the clean up in puppet/debmonitor/etc.. should have been done anyway. [06:41:15] 10Operations, 10decommission: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey) [06:42:23] (03PS1) 10Elukey: Remove analytics1032 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/540017 (https://phabricator.wikimedia.org/T233080) [06:43:29] (03CR) 10Elukey: [C: 03+2] Remove analytics1032 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/540017 (https://phabricator.wikimedia.org/T233080) (owner: 10Elukey) [06:43:31] PROBLEM - ElasticSearch shard size check - 9200 on logstash2001 is CRITICAL: CRITICAL - logstash-2019.09.29(230gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:43:31] PROBLEM - ElasticSearch shard size check - 9200 on logstash2002 is CRITICAL: CRITICAL - logstash-2019.09.29(230gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:43:33] PROBLEM - ElasticSearch shard size check - 9200 on logstash2006 is CRITICAL: CRITICAL - logstash-2019.09.29(230gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:43:33] PROBLEM - ElasticSearch shard size check - 9200 on logstash2005 is CRITICAL: CRITICAL - logstash-2019.09.29(230gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:43:33] PROBLEM - ElasticSearch shard size check - 9200 on logstash2004 is CRITICAL: CRITICAL - logstash-2019.09.29(230gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:43:33] PROBLEM - ElasticSearch shard size check - 9200 on logstash2003 is CRITICAL: CRITICAL - logstash-2019.09.29(230gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:43:49] gehel, onimisionipe --^ [06:44:33] also dcausse :) [06:44:43] :) [06:45:29] Why would a 2019.09.29 shard still grows, in other words still receives events? [06:48:53] 10Operations, 10decommission, 10Patch-For-Review: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey) ` elukey@asw2-c-eqiad> show interfaces descriptions | match analytics1032 ge-3/0/12 up down analytics1032 - no-bw-mon ` [06:49:20] 10Operations, 10decommission, 10Patch-For-Review: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey) [06:51:00] (03CR) 10Giuseppe Lavagetto: "Shouldn't we just remove that code at this point?" [puppet] - 10https://gerrit.wikimedia.org/r/539842 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [06:55:03] godog: shard size explosion on logstash (alert above), any idea what's happening? [06:55:46] there are a lot more events on that day [06:56:06] not sure why it's just alerting now [06:56:40] err no, a lot more events on 09/30 not 29 [06:58:00] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "It's a bit sad to have such an implementation detail in the code, but we can't do otherwise." [software/conftool] - 10https://gerrit.wikimedia.org/r/539859 (https://phabricator.wikimedia.org/T233679) (owner: 10CDanis) [07:00:25] !log gerrit: forcing reindex of changes # T233989 [07:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:29] T233989: Lots of "Skipping change xxx because the corresponding repository was not found" in the logs - https://phabricator.wikimedia.org/T233989 [07:03:49] (03PS1) 10Elukey: Remove analytics1032's prod DNS records [dns] - 10https://gerrit.wikimedia.org/r/540019 (https://phabricator.wikimedia.org/T233080) [07:04:04] (03CR) 10Giuseppe Lavagetto: "A couple smaller comments, and a larger one - we maintain this logstash query in exact carbon copy here and in the kibana dashboards defin" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T234283) (owner: 10Thcipriani) [07:04:36] (03CR) 10Elukey: [C: 03+2] Remove analytics1032's prod DNS records [dns] - 10https://gerrit.wikimedia.org/r/540019 (https://phabricator.wikimedia.org/T233080) (owner: 10Elukey) [07:07:44] 10Operations, 10decommission, 10Patch-For-Review: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey) [07:08:50] (03CR) 10Giuseppe Lavagetto: "> questions i have: has_lvs is disabled but webserver has_tls is" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [07:11:58] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I just realized has_lvs also affects parsoid, so sorry for my suggestion, it can't be applied." [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [07:23:13] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10MoritzMuehlenhoff) 05Resolved→03Open There's two issues with the patch merged for Erin Yener: (1) I... [07:24:08] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10wiki_willy) @Marostegui - sure, will do. This week is the approval & ordering phase of the procurement cycle, so it shouldn't be an issue getting the PO submitted for labsdb1009.... [07:24:43] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Marostegui) Excellent, thanks! [07:38:31] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:14] (03CR) 10Gilles: "It's a decent fallback if something else that is gzip-capable sends requests to MediaWiki." [puppet] - 10https://gerrit.wikimedia.org/r/539842 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [07:53:44] (03PS1) 10Daniel Kinzler: Labs: set MCR migration stage to SCHEMA_COMPAT_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540083 (https://phabricator.wikimedia.org/T198559) [08:00:36] !log draining ganeti2003 for upcoming reboot (combined kernel/qemu security updates) [08:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:49] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: LDF service does not Vary responses by Accept, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10Gehel) The MIME types used are: * text/html * application/rdf+xml * application/n... [08:17:35] 10Operations, 10serviceops: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 (10MoritzMuehlenhoff) 05Open→03Resolved 7.2.22 is rolled out fleet-wide to all servers using PHP 7.2 [08:17:38] 10Operations, 10serviceops: Remove PHP 7.0 from production application servers - https://phabricator.wikimedia.org/T220600 (10MoritzMuehlenhoff) [08:20:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks pending, can be merged once approved by Greg" [puppet] - 10https://gerrit.wikimedia.org/r/539997 (https://phabricator.wikimedia.org/T233202) (owner: 10Dzahn) [08:24:15] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:27:21] (03CR) 10Muehlenhoff: [C: 04-1] "The patch is technically correct, but I don't see any point in the class installing either libmariadb3 or libmariadbclient18: These are ju" [puppet] - 10https://gerrit.wikimedia.org/r/539973 (owner: 10Dzahn) [08:41:18] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:41:19] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:27] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:57:54] !log upgrading python3-cryptography to version 2.6.1-3+deb10u1~wmf1 on acmechief-test1001 - T234131 [08:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:58] T234131: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 [09:03:31] !log rebalancing ganeti/row_B after rolling reboot [09:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:39] !log upgrading python3-cryptography to version 2.6.1-3+deb10u1~wmf1 on acme-chief hosts - T234131 [09:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:43] T234131: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 [09:07:09] !log restarting acme-chief on acmechief1001 to catch up with python3-cryptography upgrades - T234131 [09:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:17] (03PS1) 10Elukey: profile::kerberos: add missing service resource ensure [puppet] - 10https://gerrit.wikimedia.org/r/540087 [09:11:56] 10Operations, 10netops: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10jbond) Yep seems reasonable to me [09:12:55] (03CR) 10Jbond: profile: sanity checks for cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [09:16:09] (03CR) 10Jcrespo: "I give Daniel and Moritz free range to decide any solution and removing myself from the equation as this does not affect production databa" [puppet] - 10https://gerrit.wikimedia.org/r/539973 (owner: 10Dzahn) [09:16:55] (03CR) 10MarcoAurelio: "> the link in the commit message links back to itself" [puppet] - 10https://gerrit.wikimedia.org/r/539676 (owner: 10MarcoAurelio) [09:17:36] (03CR) 10MarcoAurelio: "> > the link in the commit message links back to itself" [puppet] - 10https://gerrit.wikimedia.org/r/539676 (owner: 10MarcoAurelio) [09:19:50] !log upgrading ferm on a number of systems to 2.4.2pre T153468 [09:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:54] T153468: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 [09:21:21] (03CR) 10MarcoAurelio: "Using the full commit hash (10107cf5056eb6ef3903f49d59cd27387831c5b5) also leads to this commit." [puppet] - 10https://gerrit.wikimedia.org/r/539676 (owner: 10MarcoAurelio) [09:22:31] (03CR) 10MarcoAurelio: "> This is a safe change, all it does is changes the default group" [puppet] - 10https://gerrit.wikimedia.org/r/539676 (owner: 10MarcoAurelio) [09:35:56] !log run systemctl reset-failed on puppetmaster2002 to clear failed puppet-master.service [09:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:47] RECOVERY - Check systemd state on puppetmaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:07] thx moritzm [09:42:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I plaud the idea, but you should avoid using $::cluster directly here. See the inline comments." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [10:01:21] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:04:14] (03PS1) 10Elukey: profile::kerberos::replication: fix kprop acl file [puppet] - 10https://gerrit.wikimedia.org/r/540090 [10:07:56] (03PS2) 10Elukey: profile::kerberos::replication: fix kprop acl file [puppet] - 10https://gerrit.wikimedia.org/r/540090 [10:10:21] (03PS2) 10Elukey: profile::kerberos: add missing service resource ensure [puppet] - 10https://gerrit.wikimedia.org/r/540087 [10:10:23] (03PS3) 10Elukey: profile::kerberos::replication: fix kprop acl file [puppet] - 10https://gerrit.wikimedia.org/r/540090 [10:11:57] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:12:40] (03CR) 10jerkins-bot: [V: 04-1] profile::kerberos: add missing service resource ensure [puppet] - 10https://gerrit.wikimedia.org/r/540087 (owner: 10Elukey) [10:13:36] hi all i plan to upgrade puppetmaster2001 today. while this takes place ill disable puppet in codfw, ulsfo and eqsin while the reimage takes place. i plan to start at 11:00 UTC. please say if there are issues [10:13:56] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/540087 (owner: 10Elukey) [10:15:13] jbond42: o/ maybe an email to ops@ will reach a bigger audience (to avoid people not reading in here etc..) [10:16:12] (03CR) 10jerkins-bot: [V: 04-1] profile::kerberos: add missing service resource ensure [puppet] - 10https://gerrit.wikimedia.org/r/540087 (owner: 10Elukey) [10:16:56] elukey: ack ill send now [10:17:15] !log upgrading ferm on remaining mw servers 2.4.2pre T153468 [10:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:19] T153468: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 [10:18:50] (03PS3) 10Elukey: profile::kerberos: add missing service resource ensure [puppet] - 10https://gerrit.wikimedia.org/r/540087 [10:18:52] (03PS4) 10Elukey: profile::kerberos::replication: fix kprop acl file [puppet] - 10https://gerrit.wikimedia.org/r/540090 [10:21:55] (03CR) 10Elukey: [C: 03+2] profile::kerberos: add missing service resource ensure [puppet] - 10https://gerrit.wikimedia.org/r/540087 (owner: 10Elukey) [10:22:12] (03CR) 10Elukey: [C: 03+2] profile::kerberos::replication: fix kprop acl file [puppet] - 10https://gerrit.wikimedia.org/r/540090 (owner: 10Elukey) [10:25:18] (03PS1) 10Hashar: gerrit: install openjdk dbg package [puppet] - 10https://gerrit.wikimedia.org/r/540094 (https://phabricator.wikimedia.org/T231872) [10:26:27] RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:25] \o/ [10:27:44] (03CR) 10Muehlenhoff: [C: 04-1] "The package names are wrong; it's openjdk-8-dbg and openjdk-11-dbg" [puppet] - 10https://gerrit.wikimedia.org/r/540094 (https://phabricator.wikimedia.org/T231872) (owner: 10Hashar) [10:35:25] (03PS2) 10Hashar: gerrit: install openjdk dbg package [puppet] - 10https://gerrit.wikimedia.org/r/540094 (https://phabricator.wikimedia.org/T231872) [10:45:43] !log update puppet.esqin.wmnet to point to puppetmaster1001 [10:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:46:39] 10Operations, 10Puppet, 10Traffic, 10serviceops: Puppet systemd::mask is an anti pattern that has unwanted side effect - https://phabricator.wikimedia.org/T233839 (10ema) We are using `systemd::mask` and `systemd::unmask` to ensure that package installation does not trigger service startup (see for example... [10:49:39] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:49:44] 10Operations, 10Puppet, 10Patch-For-Review: Rebuild puppet master backends - https://phabricator.wikimedia.org/T233915 (10jbond) 05Open→03Resolved [10:49:46] 10Puppet, 10Patch-For-Review: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10jbond) [10:50:39] 10Puppet: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) [10:50:50] 10Puppet: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) p:05Triage→03Normal [10:51:32] (03PS1) 10Jbond: puppet: point puppet,eqsin.wmnet to puppetmaster1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/540096 (https://phabricator.wikimedia.org/T234315) [10:52:07] (03PS2) 10Jbond: puppet: point puppet,eqsin.wmnet to puppetmaster1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/540096 (https://phabricator.wikimedia.org/T234315) [10:58:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/540096 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [10:59:53] (03CR) 10Jbond: [C: 03+2] puppet: point puppet,eqsin.wmnet to puppetmaster1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/540096 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191001T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:40] o/ [11:00:59] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:02:56] (03PS1) 10Jbond: pybal_config: switch all backends to eqiad while puppetmaster2001 is reimaged [puppet] - 10https://gerrit.wikimedia.org/r/540099 (https://phabricator.wikimedia.org/T2343151) [11:07:10] (03CR) 10Zfilipin: "Removing myself from reviewers since I'm not familiar with this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540083 (https://phabricator.wikimedia.org/T198559) (owner: 10Daniel Kinzler) [11:13:23] (03PS1) 10Jbond: puppet: point puppet.ulsfo.wmnet to puppetmaster1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/540100 (https://phabricator.wikimedia.org/T234315) [11:14:41] (03CR) 10Jbond: [C: 03+2] puppet: point puppet.ulsfo.wmnet to puppetmaster1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/540100 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [11:16:10] !log update puppet.ulsfo.wmnet to point to puppetmaster1001 [11:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:17] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:25:15] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:27:53] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:27:56] (03PS1) 10Jbond: puppet: point puppet.codfw.wmnet to puppetmaster1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/540102 (https://phabricator.wikimedia.org/T234315) [11:33:17] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:36:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:36:49] (03PS1) 10Daimona Eaytoy: Enable AbuseFilterCachingParser on itwiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540103 (https://phabricator.wikimedia.org/T156095) [11:37:42] (03CR) 10jerkins-bot: [V: 04-1] Enable AbuseFilterCachingParser on itwiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540103 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [11:41:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/540102 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [11:41:55] (03CR) 10Jbond: [C: 03+2] puppet: point puppet.codfw.wmnet to puppetmaster1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/540102 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [11:47:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:01:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:02:35] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:08:16] (03PS1) 10Pmiazga: Set new MFMobileFormatterOptions config using old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540107 (https://phabricator.wikimedia.org/T232690) [12:09:20] (03CR) 10jerkins-bot: [V: 04-1] Set new MFMobileFormatterOptions config using old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540107 (https://phabricator.wikimedia.org/T232690) (owner: 10Pmiazga) [12:12:36] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10EYener) Thank you! I also have access to Turnilo. I have two follow-up questions: 1. Can @jkumalah and... [12:23:57] (03CR) 10Giuseppe Lavagetto: Document Apache gzip sidestepping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539842 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [12:26:25] PROBLEM - Check whether ferm is active by checking the default input chain on webperf1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:26:47] PROBLEM - DPKG on webperf1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:31:57] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [12:32:23] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [12:32:53] RECOVERY - Check whether ferm is active by checking the default input chain on webperf1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:33:07] PROBLEM - SSH access on cobalt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit [12:33:11] ouch [12:33:15] RECOVERY - DPKG on webperf1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:34:15] I can access cobalt via ssh [12:34:20] gerrit seems up (the daemon) [12:34:35] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.15.14-16-g855b179b5f (SSHD-CORE-1.6.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [12:34:53] gerrit back [12:35:01] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.345 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [12:35:13] hashar: did you do anything? [12:35:22] the garbage collector is misbehaving :-\ [12:35:27] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26356 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [12:35:36] ah lovely [12:36:15] ah gerrit uses X1, nice [12:36:22] hashar: uhm, isn't git gc cronned? [12:36:30] it is broken [12:36:38] oh [12:36:48] somehow it just keeps gcing stuff over and over [12:36:51] which gerrit kidnly refill [12:36:58] well so I guess the manual git gc I ran a week ago for labs/tools/stewardbots wasn't bad heh [12:37:05] so the CPUs are busy just gcing stuff :\ [12:37:06] git gc != java gc [12:37:15] this is the java gc [12:37:16] what is broken then? [12:37:20] ah, java gc [12:37:38] https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?panelId=14&fullscreen&orgId=1&from=now-1h&to=now is basically just above the -XX:MaxGCPauseMillis=300 [12:37:40] !log Gerrit misbehaved temporarily due to human operator error (hashar ran jstack -l -m which bring the jvm to an halt) [12:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:48] elukey: or in short, that was my fault :-\ [12:38:10] I noticed we were getting some strange gc behavior and lots of load last week, but wasn't able to investigate [12:38:24] the load abruptly stopped Friday evening, weirdly [12:38:24] so, in short as well, git gc is okay but java gc is not? [12:38:26] hashar: nah don't say that :) [12:38:55] given the current status, should we restart gerrit? [12:39:01] I mean the gc time [12:39:10] I am still taking traces :| [12:39:25] hoping for the mystery to finally shows up [12:39:54] sorry I didn't get it, ack, I'll shut up :) [12:40:10] but yeah it should be restarted eventually [12:41:14] threaddump: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMTAvMS8tLWFwaS0wNmUyYWFmMy0zMzU1LTQ4MjgtYWI0OC0xZjk3MjM1ZDUzYmZmNzRmMThhOC1iNjExLTQ2ZTAtOGQzOC1mNmY4NDAxM2RmZDAudHh0LS0=& [12:41:57] gc report: https://api.gceasy.io/my-gc-report.jsp?p=YXJjaGl2ZWQvMjAxOS8xMC8xLy0tYXBpLTA2ZTJhYWYzLTMzNTUtNDgyOC1hYjQ4LTFmOTcyMzVkNTNiZmUyN2NkYWIyLWQ5OWItNDRmYS05MzlkLTRlNzFiMGZmOGIzZC50eHQtLQ==&channel=API [12:42:56] my current theory is that there may be some problematic traffic that causes a lot of churn that started last week [12:43:27] whatever traffic stopped abruptly at 00:15ish last Friday: http://tyler.zone/gerrit-cpu-2019-09-27.png [12:43:59] thcipriani: also lots of https://gerrit.wikimedia.org/r/monitoring?part=graph&graph=log1ba80895aa45bb6826f2563d53a8abf66c92c093 [12:44:10] this is another missing repo [12:45:27] hrm, yeah, may be related or may be a red-herring [12:51:11] I also started a full reindex for another bug [12:52:09] (03PS2) 10Jbond: pybal_config: switch all backends to eqiad while puppetmaster2001 is reimaged [puppet] - 10https://gerrit.wikimedia.org/r/540099 (https://phabricator.wikimedia.org/T2343151) [12:52:10] we also have a bot that browse every single changes [12:52:15] which might be causing the issue [12:53:41] (03CR) 10Ema: [C: 03+1] pybal_config: switch all backends to eqiad while puppetmaster2001 is reimaged [puppet] - 10https://gerrit.wikimedia.org/r/540099 (https://phabricator.wikimedia.org/T2343151) (owner: 10Jbond) [12:54:47] (03CR) 10Muehlenhoff: "The fixed ferm package was updated on all Cloud VPS instances, you should be able to un-cherrypick this patch from deployment-prep now." [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) (owner: 10Hashar) [12:55:09] (03CR) 10Jbond: [C: 03+2] pybal_config: switch all backends to eqiad while puppetmaster2001 is reimaged [puppet] - 10https://gerrit.wikimedia.org/r/540099 (https://phabricator.wikimedia.org/T2343151) (owner: 10Jbond) [12:55:20] (03PS3) 10Jbond: pybal_config: switch all backends to eqiad while puppetmaster2001 is reimaged [puppet] - 10https://gerrit.wikimedia.org/r/540099 (https://phabricator.wikimedia.org/T2343151) [12:56:04] (03PS2) 10Daimona Eaytoy: Enable AbuseFilterCachingParser on itwiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540103 (https://phabricator.wikimedia.org/T156095) [12:56:15] (03PS3) 10MarcoAurelio: gerrit: Fix renamed group name "Project and Group Creators" [puppet] - 10https://gerrit.wikimedia.org/r/539676 [12:57:23] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:57:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:58:21] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:58:26] <_joe_> uh oh [12:58:41] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:58:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:59:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:59:03] <_joe_> nothing at the applayer level I can see right now [12:59:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:59:09] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:59:21] <_joe_> oh there it is [12:59:26] <_joe_> a spike on the API [12:59:37] <_joe_> just recovered [12:59:57] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:59:59] <_joe_> it's probably the same as the 100s such cases we've seen over time [13:00:17] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:00:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:00:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:00:45] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:01:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:07:48] (03PS1) 10Muehlenhoff: Add repo sync for buster/grafana [puppet] - 10https://gerrit.wikimedia.org/r/540113 [13:22:00] (03PS1) 10Thcipriani: gerrit: G1GC tuning. Increase NewGen space [puppet] - 10https://gerrit.wikimedia.org/r/540114 [13:22:42] (03CR) 10jerkins-bot: [V: 04-1] gerrit: G1GC tuning. Increase NewGen space [puppet] - 10https://gerrit.wikimedia.org/r/540114 (owner: 10Thcipriani) [13:24:25] !log reimage puppetmaster2001 [13:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:28] (03PS2) 10Thcipriani: gerrit: G1GC tuning. Increase NewGen space [puppet] - 10https://gerrit.wikimedia.org/r/540114 [13:33:25] (03CR) 10Hashar: [C: 03+1] "That should help with the Eden space that keeps being filled by Gerrit just to be immediately garbage collected :-\" [puppet] - 10https://gerrit.wikimedia.org/r/540114 (owner: 10Thcipriani) [13:40:15] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [13:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:27] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:34] (03CR) 10CDanis: [C: 03+2] dbctl schema: disallow 's3' value when appropriate [software/conftool] - 10https://gerrit.wikimedia.org/r/539859 (https://phabricator.wikimedia.org/T233679) (owner: 10CDanis) [13:46:02] (03PS1) 1020after4: Fix phatality deployment script [puppet] - 10https://gerrit.wikimedia.org/r/540117 [13:48:00] *ph*abricator, *ph*atality, mani*ph*est -- Phacility is such fun :) [13:48:24] (03Merged) 10jenkins-bot: dbctl schema: disallow 's3' value when appropriate [software/conftool] - 10https://gerrit.wikimedia.org/r/539859 (https://phabricator.wikimedia.org/T233679) (owner: 10CDanis) [13:48:39] I suppose you meant phun :-) [13:52:39] (03PS1) 10CDanis: dbctl schema: apply Ic60e40d2 in production [puppet] - 10https://gerrit.wikimedia.org/r/540119 (https://phabricator.wikimedia.org/T233679) [13:52:46] moritzm: ha ha, true that [13:54:23] (03CR) 10CDanis: [C: 03+2] dbctl schema: apply Ic60e40d2 in production [puppet] - 10https://gerrit.wikimedia.org/r/540119 (https://phabricator.wikimedia.org/T233679) (owner: 10CDanis) [13:55:43] !sal [13:55:43] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [13:59:05] !log ✔️ cdanis@puppetmaster2001.codfw.wmnet ~ 🕙☕ sudo touch /srv/config-master/puppet-sha1.txt /srv/config-master/labsprivate-sha1.txt && sudo chown gitpuppet:gitpuppet /srv/config-master/puppet-sha1.txt /srv/config-master/labsprivate-sha1.txt [13:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:14] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:01:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:43] !log rebooting mw1265 for some tests [14:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:38] !log beginning rolling reboots of eqiad and codfw logstash collectors [14:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:44] !log ✔️ cdanis@puppetmaster2001.codfw.wmnet ~ 🕙☕ (cd /var/lib/git/operations/puppet ; git rev-parse HEAD | sudo tee /srv/config-master/puppet-sha1.txt ) [14:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:47] !log ✔️ cdanis@puppetmaster2001.codfw.wmnet ~ 🕙☕ (cd /var/lib/git/labs/private ; git rev-parse HEAD | sudo tee /srv/config-master/labsprivate-sha1.txt ) [14:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:34] !log Restarting CI Jenkins [14:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:24] hi Daimona [14:11:31] Sup [14:12:03] Daimona: I was wondering if you could access Logstash and see what may be causing T234326 ? [14:12:04] T234326: nqo.wikipedia: "An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later." - https://phabricator.wikimedia.org/T234326 [14:12:20] (03CR) 10Paladox: [C: 03+1] gerrit: G1GC tuning. Increase NewGen space [puppet] - 10https://gerrit.wikimedia.org/r/540114 (owner: 10Thcipriani) [14:12:36] Hm I'm unsure, but let me try [14:12:42] (03CR) 10Paladox: [C: 03+1] "> > This is a safe change, all it does is changes the default group" [puppet] - 10https://gerrit.wikimedia.org/r/539676 (owner: 10MarcoAurelio) [14:14:42] I see something [14:17:20] ^Posted on phab, no further clues tho [14:17:28] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:17:29] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:02] !log rebooting krb1001 for some tests [14:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:15] thanks [14:20:07] (03CR) 10Thcipriani: [C: 03+1] "The currently listed group no longer exists, this update is definitely needed. Thanks for spotting it!" [puppet] - 10https://gerrit.wikimedia.org/r/539676 (owner: 10MarcoAurelio) [14:22:39] (03PS2) 10Pmiazga: Set new MFMobileFormatterOptions config using old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540107 (https://phabricator.wikimedia.org/T232690) [14:24:51] Daimona: maybe something with addWiki.php being still broken and CirrusSearch [14:25:03] (03PS1) 10Pmiazga: Remove old and unused MFMobileFormatterHeadings config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540123 (https://phabricator.wikimedia.org/T232690) [14:25:12] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) Sorry indeed, I have been confused by all those plugins. Turns out today I lacked some m... [14:25:13] Was the wiki newly created? [14:26:46] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris) >>! In T223953#5535332, @mobrovac wrote: > @akosiaris regarding rate limiting,... [14:26:51] 10Puppet: ensure additional puppetmaster files are managed by puppet - https://phabricator.wikimedia.org/T234332 (10jbond) [14:26:56] Daimona: sì [14:27:02] 10Puppet: ensure additional puppetmaster files are managed by puppet - https://phabricator.wikimedia.org/T234332 (10jbond) [14:27:22] Hah :) [14:27:57] I'm unsure how this works, but maybe the CirrusSearch backend still has to be configured or sth [14:28:34] afaics search is handled by addWiki.php but that script lately is a generator of tons of "fun" [14:28:50] you fix something and something else breaks [14:28:56] hauskatze: do you know who created the wiki and if by chance the logs of addWiki are still available? [14:29:12] dcausse: I think Amir1 did it [14:29:22] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vlan-private1-d-codfw] - member ge-6/0/16; [edit interfaces interface-range disabled] me... [14:29:40] 10Puppet: ensure additional puppetmaster files are managed by puppet - https://phabricator.wikimedia.org/T234332 (10jbond) /srv/config-master/puppet-sha1.txt should not be writable by gitpuppet [14:29:59] dcausse: https://phabricator.wikimedia.org/T230359#5523436 [14:30:46] dcausse: yeah I did it, no error showed up in the maintenance script run [14:31:04] gehel: re: logstash shards, thanks will take a look shortly! [14:31:45] Amir1: thanks, I'll dig into it, cirrus has never played very well with addWiki :( [14:31:47] PROBLEM - Host db2068 is DOWN: PING CRITICAL - Packet loss = 100% [14:32:19] dcausse: shall we check all newly created wikis to see if the same problem exist? [14:32:24] Nothing plays very well with addwiki [14:32:29] ^^ [14:32:35] RECOVERY - Host db2068 is UP: PING OK - Packet loss = 0%, RTA = 36.31 ms [14:32:43] hauskatze: yes I think so [14:32:45] addWiki seems like a pile of hacks that for some reason works [14:32:49] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) [14:32:49] db2068 that was me disable the wrong port i have it back [14:32:51] -ish [14:33:50] (03PS1) 10CDanis: config-master: ensure perms on puppet-merge sha files [puppet] - 10https://gerrit.wikimedia.org/r/540126 (https://phabricator.wikimedia.org/T234332) [14:33:59] !log created cirrussearch indices for nqowiki (T234326) [14:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:04] T234326: nqo.wikipedia: "An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later." - https://phabricator.wikimedia.org/T234326 [14:34:11] dcausse: okay, there are not much, I could take a look [14:35:22] hauskatze: if you track newly created wikis please feel free to add me as subscriber [14:35:27] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/540126 (https://phabricator.wikimedia.org/T234332) (owner: 10CDanis) [14:35:37] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:36:17] dcausse: sure, we track new wiki creations in #wiki-setup-create [14:36:20] dcausse: The one before it is hiwikisource [14:36:28] ok checking [14:36:39] the one before it is napwikisource [14:36:43] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:37:11] hiwikisource doesn't return any error while searching [14:37:34] (03PS1) 10Jbond: puppetmaster2001: move eqsin back to puppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/540130 (https://phabricator.wikimedia.org/T234315) [14:37:36] (03PS1) 10Alexandros Kosiaris: restrouter: Add ratelimiting support to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/540131 (https://phabricator.wikimedia.org/T223953) [14:37:44] napwikisource seems okay as well [14:37:49] prima facie [14:38:27] yes both of them have their indices setup in elastic [14:38:42] (03CR) 10Alexandros Kosiaris: "question: What is the ratelimiting key? I am assuming in production we want it to be some HTTP header like X-Client-IP ?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/540131 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [14:40:04] List of wikis created on 2019: https://incubator.wikimedia.org/wiki/Incubator:Site_creation_log#2019 [14:41:41] (03PS1) 10Jbond: puppetmaster2001: move ulsfo back to puppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/540132 [14:41:43] (03PS1) 10Jbond: puppetmaster2001: move codfw back to puppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/540133 (https://phabricator.wikimedia.org/T234315) [14:41:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/540130 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [14:42:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/540132 (owner: 10Jbond) [14:42:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/540133 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [14:42:49] (03CR) 10Jbond: [C: 03+2] puppetmaster2001: move eqsin back to puppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/540130 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [14:43:43] hywwiki seems okay as well dcausse [14:44:04] hauskatze: thanks for checking and the link (very useful) [14:44:28] note that only content wikis are listed there... Chapters and other wikis are not [14:44:46] I guess monitoring all.dblist would be more accurate [14:44:57] (03CR) 10CDanis: [C: 03+2] config-master: ensure perms on puppet-merge sha files [puppet] - 10https://gerrit.wikimedia.org/r/540126 (https://phabricator.wikimedia.org/T234332) (owner: 10CDanis) [14:50:18] 10Operations, 10netops: Telia IC-314534 (eqord/eqdfw 10Gbps wave) down - https://phabricator.wikimedia.org/T234335 (10CDanis) [14:50:54] ACKNOWLEDGEMENT - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis https://phabricator.wikimedia.org/T234335 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:50:54] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis https://phabricator.wikimedia.org/T234335 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:51:16] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) I do not have any idea from where to start though there are a few leads in the previous... [14:51:29] (03CR) 10Jbond: [C: 03+2] puppetmaster2001: move ulsfo back to puppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/540132 (owner: 10Jbond) [14:54:00] 10Puppet, 10Patch-For-Review: ensure additional puppetmaster files are managed by puppet - https://phabricator.wikimedia.org/T234332 (10CDanis) 05Open→03Resolved [14:56:32] 10Puppet, 10Patch-For-Review: ensure additional puppetmaster files are managed by puppet - https://phabricator.wikimedia.org/T234332 (10jbond) 05Resolved→03Open thanks chris but i still need to explore /var/lib/puppet/server/ssl/certs/ca.pem [14:56:35] 10Operations, 10ops-eqiad: replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10RobH) 05Resolved→03Open @Jclark-ctr: Please note this task was not ready to be resolved, it has many, many steps left. [14:56:37] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) [14:57:49] (03CR) 10Mobrovac: [C: 03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/540131 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [14:58:55] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:59:17] (03CR) 10CDanis: [C: 03+1] "Thanks! Had meant to do this but didn't get around to it yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/540113 (owner: 10Muehlenhoff) [14:59:57] (03CR) 10Jbond: [C: 03+2] puppetmaster2001: move codfw back to puppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/540133 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [15:02:06] (03CR) 10Alexandros Kosiaris: "> The rate-limiting key is based on the x-client-ip header and the end point's generic name." [deployment-charts] - 10https://gerrit.wikimedia.org/r/540131 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [15:05:14] (03PS1) 10Jhedden: wikimediacloud.org: add initial zone file [dns] - 10https://gerrit.wikimedia.org/r/540148 (https://phabricator.wikimedia.org/T223907) [15:09:27] (03PS1) 10Ottomata: Add missing an-worker1088 to hadoop net_topology [puppet] - 10https://gerrit.wikimedia.org/r/540149 (https://phabricator.wikimedia.org/T209929) [15:09:31] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:11:06] (03CR) 10Ayounsi: [C: 03+2] Netbox Juniper installed base report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/539192 (owner: 10Ayounsi) [15:12:47] (03CR) 10Elukey: [C: 03+1] Add missing an-worker1088 to hadoop net_topology [puppet] - 10https://gerrit.wikimedia.org/r/540149 (https://phabricator.wikimedia.org/T209929) (owner: 10Ottomata) [15:12:52] !log puppetmaster2001 is back online [15:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:11] (03PS1) 10Elukey: profile::hadoop::master: fix check_hdfs_topology script [puppet] - 10https://gerrit.wikimedia.org/r/540150 [15:14:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. I don't know if we need anything else in this repo for the new domain to work, like external triggers, hooks or whatever." [dns] - 10https://gerrit.wikimedia.org/r/540148 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:14:58] (03PS2) 10Elukey: profile::hadoop::master: fix check_hdfs_topology script [puppet] - 10https://gerrit.wikimedia.org/r/540150 [15:21:41] (03PS1) 10Jbond: pybal_config: add codfw backend back [puppet] - 10https://gerrit.wikimedia.org/r/540151 [15:22:02] (03CR) 10Ayounsi: "Added the manual steps to https://wikitech.wikimedia.org/wiki/Netbox#Juniper_Report" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/539192 (owner: 10Ayounsi) [15:25:45] (03CR) 10Ottomata: [C: 03+1] profile::hadoop::master: fix check_hdfs_topology script [puppet] - 10https://gerrit.wikimedia.org/r/540150 (owner: 10Elukey) [15:27:11] 10Operations, 10netops, 10Wikimedia-Incident: asw2-d2-eqiad crash - https://phabricator.wikimedia.org/T233645 (10ayounsi) 05Open→03Resolved Discussed during the Monday meeting, will leave it as it. [15:29:49] (03CR) 10Jbond: [C: 03+2] pybal_config: add codfw backend back [puppet] - 10https://gerrit.wikimedia.org/r/540151 (owner: 10Jbond) [15:30:45] (03PS10) 10Ayounsi: Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) [15:33:01] (03CR) 10Ayounsi: [C: 03+2] Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [15:33:23] (03PS1) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [15:35:45] (03PS11) 10Ayounsi: Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) [15:36:08] !log powercycle an-conf1001 to test some bios settings [15:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:37] <_joe_> !log uninstalling temporarily the math rendering related packages from mwdebug2002, test for T195847 [15:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:41] T195847: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 [15:38:23] PROBLEM - Host an-conf1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:39:17] (03PS1) 10Giuseppe Lavagetto: mediawiki::packages: remove packages for math rendering [puppet] - 10https://gerrit.wikimedia.org/r/540154 (https://phabricator.wikimedia.org/T195847) [15:41:07] PROBLEM - Host ms-be2055.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:43:47] 10Operations, 10ops-codfw, 10media-storage: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 (10Papaul) disable switch port on asw-d-codfw for ms-be2055 since we are moving the server in row C ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vla... [15:44:41] RECOVERY - Host an-conf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:46:44] (03PS1) 10Alexandros Kosiaris: ores: Alert on aberrant amount of celery workers [puppet] - 10https://gerrit.wikimedia.org/r/540155 (https://phabricator.wikimedia.org/T230917) [15:47:21] (03CR) 10Mobrovac: [C: 03+1] "> Fine by me. Is the x-client-ip something we might want to have" [deployment-charts] - 10https://gerrit.wikimedia.org/r/540131 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [15:48:51] (03PS1) 10Ayounsi: PDU password cookbook, fix typos [cookbooks] - 10https://gerrit.wikimedia.org/r/540156 (https://phabricator.wikimedia.org/T233053) [15:49:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ores: Alert on aberrant amount of celery workers [puppet] - 10https://gerrit.wikimedia.org/r/540155 (https://phabricator.wikimedia.org/T230917) (owner: 10Alexandros Kosiaris) [15:49:29] PROBLEM - Host an-conf1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:50:13] <_joe_> jbond42: can I merge your change too? [15:51:57] (03CR) 10Ayounsi: [C: 03+2] PDU password cookbook, fix typos [cookbooks] - 10https://gerrit.wikimedia.org/r/540156 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [15:52:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] "> Right now it's hard-coded in the filter code, but we can easily make it configurable if needed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/540131 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [15:52:43] (03Merged) 10jenkins-bot: restrouter: Add ratelimiting support to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/540131 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [15:53:51] RECOVERY - Host an-conf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:56:27] (03PS1) 10Ayounsi: PDU password cookbook, fix more typos [cookbooks] - 10https://gerrit.wikimedia.org/r/540157 (https://phabricator.wikimedia.org/T233053) [15:56:46] (03PS1) 10Alexandros Kosiaris: Bump restrouter chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/540158 [15:58:23] 10Operations, 10Mail: Vendor's Emails Not Coming Through - https://phabricator.wikimedia.org/T233991 (10HMarcus) 05Open→03Resolved a:03HMarcus Thanks for confirming that @herron . will go ahead and close this out as we've found these emails in our quarantine. [15:58:48] (03CR) 10Ayounsi: [C: 03+2] PDU password cookbook, fix more typos [cookbooks] - 10https://gerrit.wikimedia.org/r/540157 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [15:59:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] Bump restrouter chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/540158 (owner: 10Alexandros Kosiaris) [15:59:26] (03Merged) 10jenkins-bot: Bump restrouter chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/540158 (owner: 10Alexandros Kosiaris) [16:00:04] godog and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191001T1600). [16:00:04] Lucas_WMDE: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:26] 10Operations, 10ORES, 10serviceops, 10Patch-For-Review, 10Scoring-platform-team (Current): celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Joe) First of all, apologies for losing track of this task. What you... [16:00:40] <_joe_> Lucas_WMDE: let me look at those patches [16:01:01] o/ [16:01:06] <_joe_> uh it's wmcs stuff, I don't know anything about it [16:01:21] bstorm_ already merged one [16:01:28] though I don’t see the manpage on toolforge yte [16:01:31] <_joe_> did you get a +1 from them? [16:02:00] (03PS1) 10Elukey: profile::kerberos::replication: add AAAA ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/540161 [16:02:07] <_joe_> I mean the change seems reasonable but this is quite out of the scope of my knowledge sorry [16:02:25] which one do you mean? [16:02:34] “fix variable” had a +1 before I rebased it [16:02:37] <_joe_> all patches to dologmsg :) [16:02:48] well the other patch to dologmsg was already merged, as I said :) [16:02:51] <_joe_> yeah ok [16:02:53] so it even got a +2 [16:02:54] <_joe_> ahah ok [16:02:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] dologmsg: fix variable [puppet] - 10https://gerrit.wikimedia.org/r/511750 (owner: 10Lucas Werkmeister (WMDE)) [16:03:05] “exec watch” has no review yet [16:03:44] <_joe_> ok so this one was quite uncontroversial [16:04:27] (03PS5) 10Giuseppe Lavagetto: dologmsg: fix variable [puppet] - 10https://gerrit.wikimedia.org/r/511750 (owner: 10Lucas Werkmeister (WMDE)) [16:04:35] <_joe_> for the fatalmonitor script, let me refresh my memory a second [16:04:46] <_joe_> but I think it's been rendered useless in its current form [16:05:33] <_joe_> yes, this watches the HHVM logs, which should be gone in a couple weeks :) [16:05:45] oh, ok [16:05:51] then I’ll need to update my SWAT script [16:06:21] <_joe_> sorry, but scap already monitors logstash [16:06:42] <_joe_> I have to understand how this is used, but lemme merge the other change first [16:06:54] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [16:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:59] <_joe_> if jenkins allows me [16:08:27] (03CR) 10Lucas Werkmeister (WMDE): "Hm, I don’t see the manpage on Toolforge yet…" [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [16:09:38] <_joe_> Lucas_WMDE: so, what is this script used for? [16:09:52] fatalmonitor or dologmsg? [16:09:58] <_joe_> fatalmonitor [16:10:09] <_joe_> dologmsg I am aware more or less :) [16:10:23] shows the most common errors in the HHVM logs [16:10:24] <_joe_> so I understand what fatalmonitor does [16:10:29] auto-refreshes every 2 seconds, I think [16:10:35] <_joe_> not sure what it's used for [16:10:42] it’s one of the terminal tabs SWAT deployers are supposed to have open https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Terminal_tabs [16:10:50] to notice if anything’s wrong with a deployment [16:10:58] 10Operations: Add Urbanecm to #mediawiki_security - https://phabricator.wikimedia.org/T233235 (10herron) 05Open→03Resolved a:03herron done! [16:10:59] !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'restrouter' for release 'production' . [16:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:08] <_joe_> Lucas_WMDE: but this counts logs on a local machine [16:11:22] on a special machine where all the logs are (were?) forwarded [16:11:24] I think [16:11:28] <_joe_> so, given scap does monitor logstash [16:11:40] <_joe_> oh eek [16:11:50] <_joe_> hhvm.log there, lemme see what you should use [16:12:08] <_joe_> Lucas_WMDE: anyways, can we continue this tomorrow? It needs some investigating on my part [16:12:31] sure [16:12:46] <_joe_> hhvm.log doesn't exist anymore on mwlog1001 :P [16:13:25] should I create a “figure out what to do with fatalmonitor script” task? [16:13:39] (I should probably also check the ops mailing list first, if I missed anything) [16:13:41] <_joe_> I guess so [16:14:07] <_joe_> no you did not, sorry, I didn't send out an announcement there not thinking of such issues. [16:14:19] any idea how long hhvm.log has been gone already? [16:14:27] !log @ helmfile [CODFW] Ran 'sync' command on namespace 'restrouter' for release 'production' . [16:14:27] <_joe_> a few days I think [16:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:29] <_joe_> let me check [16:14:35] because my fellow SWAT deployers are supposed to monitor the command, you’d think they would have noticed it’s gone 🤔 [16:14:35] !log @ helmfile [CODFW] Ran 'sync' command on namespace 'restrouter' for release 'production' . [16:14:36] <_joe_> but should be ~ 1 week [16:14:37] ;) [16:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:46] <_joe_> Lucas_WMDE: you would think! [16:15:54] (03CR) 10Jforrester: [C: 03+2] Labs: set MCR migration stage to SCHEMA_COMPAT_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540083 (https://phabricator.wikimedia.org/T198559) (owner: 10Daniel Kinzler) [16:16:55] (03Merged) 10jenkins-bot: Labs: set MCR migration stage to SCHEMA_COMPAT_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540083 (https://phabricator.wikimedia.org/T198559) (owner: 10Daniel Kinzler) [16:16:58] <_joe_> !log manually downgrading php-geoip on deploy*, it was still at the 7.0-only version from the distro [16:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:26] <_joe_> Lucas_WMDE: that file is gone since 1 week, yes [16:17:54] (03PS1) 10Ayounsi: PDU password cookbook, dedup puppetdb list [cookbooks] - 10https://gerrit.wikimedia.org/r/540162 (https://phabricator.wikimedia.org/T233053) [16:18:15] 10Operations, 10ORES, 10serviceops, 10Patch-For-Review, 10Scoring-platform-team (Current): celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) Thanks @Joe and no worries. I'm happy to move this one off... [16:18:30] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This script in its current form is useless, we need to figure out what to do with it." [puppet] - 10https://gerrit.wikimedia.org/r/499761 (owner: 10Lucas Werkmeister (WMDE)) [16:19:00] <_joe_> Lucas_WMDE: ping me tomorrow morning, we can figure out what to do. My current feeling is this script should just be killed [16:19:47] ok thanks [16:22:38] _joe_: created https://phabricator.wikimedia.org/T234345 [16:23:38] (03CR) 10Lucas Werkmeister (WMDE): "> we need to figure out what to do with it." [puppet] - 10https://gerrit.wikimedia.org/r/499761 (owner: 10Lucas Werkmeister (WMDE)) [16:24:16] (03CR) 10CRusnov: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/540162 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [16:25:27] (03CR) 10Ayounsi: [C: 03+2] PDU password cookbook, dedup puppetdb list [cookbooks] - 10https://gerrit.wikimedia.org/r/540162 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [16:25:30] (03PS1) 10Paladox: Gerrit: Get cobalt to replicate to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540164 [16:26:05] (03PS2) 10Paladox: Gerrit: Get cobalt to replicate to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540164 [16:26:17] (03PS9) 10Paladox: gerrit: add role on gerrit1001 and remove spare [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) [16:26:27] another gerrit host? :) [16:26:28] (03PS3) 10Paladox: Gerrit: Get cobalt to replicate to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540164 [16:26:42] (03CR) 10CRusnov: [C: 03+2] Initial support for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) (owner: 10CRusnov) [16:26:52] (03PS4) 10Paladox: Gerrit: Get cobalt to replicate to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540164 [16:27:22] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox) [16:29:33] (03PS10) 10CRusnov: Initial support for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) [16:36:31] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:36:37] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:36:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor inline nitpick, rest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539978 (https://phabricator.wikimedia.org/T207200) (owner: 10Filippo Giunchedi) [16:37:22] (03PS1) 10CRusnov: interface_automation: fix refactor error [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/540166 [16:39:02] (03CR) 10Ayounsi: [C: 03+1] interface_automation: fix refactor error [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/540166 (owner: 10CRusnov) [16:39:10] (03CR) 10CRusnov: [C: 03+2] interface_automation: fix refactor error [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/540166 (owner: 10CRusnov) [16:41:07] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:41:13] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:42:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10greg) >>! In T233202#5506870, @herron wrote: > @greg could you please review/approve this request for deployment permissions? Sorry for the delay (my phabri... [16:43:46] (03PS1) 10CRusnov: interface_automation: fix another refactor error [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/540167 [16:49:34] RECOVERY - Host ms-be2055.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms [16:51:58] 10Operations, 10ops-codfw, 10media-storage: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 (10Papaul) switch port configuration for ms-be2055 in row C ` apaul@asw-c-codfw# show | compare [edit interfaces interface-range vlan-private1-c-codfw] member xe-7/0/10... [16:52:01] (03PS3) 10Krinkle: Enable AbuseFilterCachingParser on itwiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540103 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [16:52:04] (03CR) 10Krinkle: [C: 03+1] Enable AbuseFilterCachingParser on itwiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540103 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [16:53:14] * Krinkle staging on mwdebug1002 [16:53:30] 10Operations, 10ops-codfw, 10media-storage: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 (10Papaul) a:05Papaul→03fgiunchedi [16:54:41] (03CR) 10Krinkle: [C: 03+2] Enable AbuseFilterCachingParser on itwiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540103 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [16:55:35] (03Merged) 10jenkins-bot: Enable AbuseFilterCachingParser on itwiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540103 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [16:59:04] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10Cmjohnson) [16:59:37] Daimona: OK, staging now [16:59:39] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10Cmjohnson) [16:59:58] Good. mwdebug1002? [17:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191001T1700). [17:00:13] Daimona: yep, as of now. [17:00:17] no parsoid deploy today [17:01:18] Nothing weird [17:01:26] So I think we can go on [17:02:21] doing some more testing on nlwiki and a XHGui profile to see the new class used [17:03:46] (03CR) 10Dzahn: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/539997 (https://phabricator.wikimedia.org/T233202) (owner: 10Dzahn) [17:04:08] (03PS3) 10Dzahn: admins: create deployment shell user for Andrew Kostka [puppet] - 10https://gerrit.wikimedia.org/r/539997 (https://phabricator.wikimedia.org/T233202) [17:04:26] Good [17:04:34] Hm well I'm not seeing any parser [17:04:42] I guess we still haven't solved that issue [17:04:45] Let me try [17:04:51] I managed to see it on mw.o yesterday [17:04:56] it falls between the capture points of xhprof [17:05:01] yeah, maybe without stashing [17:05:06] (03PS2) 10CRusnov: interface_automation: fix minor issues [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/540167 [17:05:09] let me disable JS , you try as well :D [17:06:20] Sure [17:06:27] AbuseFilterCachingParser::intEval [17:06:29] cool [17:07:05] and metrics are coming in as well for our edits, I think. [17:07:07] !log cutting wmf/1.34.0-wmf.25 [17:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:22] Yes, seeing it [17:07:30] !log Welcome new deployer Andrew Kostka (WMDE) (T233202) [17:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:45] marxarelli: I'm about to finish a sync-file btw [17:07:47] 2min left [17:08:01] Krinkle: no prob. thx [17:08:05] mutante \o/ [17:09:02] Daimona: ok logstash all clear as well, just the usual warnings about unset vars in some filters. [17:09:06] Lucas_WMDE: :) if you see him you can say it should work after next puppet run on everything. ~ 30 min. he can ssh to deploy1001 .. now [17:09:07] syncing :) [17:09:11] Indeed [17:09:28] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T156095 - c28baa1862401 (duration: 00m 59s) [17:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:32] T156095: Re-enable AbuseFilterCachingParser once we are sure it's safe - https://phabricator.wikimedia.org/T156095 [17:09:37] Lucas_WMDE: maybe you can do a bit of the "training" part per https://phabricator.wikimedia.org/T233202#5538271 ? [17:10:02] sure [17:10:07] cool [17:11:48] Krinkle: I see data flowing. So we're done for now? :) [17:13:07] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10Dzahn) thanks Greg for approval and Moritz for code review. merged and deployed 13:07 < mutante> !log Welcome new deployer Andrew Kostka (WMDE) (T233202) @Andrew-WMDE You shoul... [17:13:20] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10Lucas_Werkmeister_WMDE) > Please reach out to fellow WMDE deployers for a training or if not possible, RelEng team members. I’d be happy to help out :) do you have any planned/up... [17:13:45] Urbanecm: Not sure if anyone is on sitereq-l yet, but I sent https://groups.google.com/a/wikimedia.org/forum/#!forum/sitereq-l there… [17:13:54] Err, https://groups.google.com/a/wikimedia.org/forum/#!topic/sitereq-l/NX3BpY2Nqeo even. [17:14:09] James_F: recieved. I hope people will join soon, thanks! [17:14:30] OK, that's a start at least. :-) [17:14:35] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10Dzahn) [17:14:40] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [17:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:14] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10Dzahn) @Lucas_Werkmeister_WMDE Maybe shoulder surfing during your next SWAT? I see you are on a couple on the calendar. [17:16:55] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10Dzahn) 05Open→03Resolved a:03Dzahn [17:17:08] Daimona: yep, all good. [17:17:14] Daimona: I'm notifying nlwiki village pump [17:17:40] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP to Kevinbazira - https://phabricator.wikimedia.org/T234209 (10Dzahn) [17:18:13] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP to Kevinbazira - https://phabricator.wikimedia.org/T234209 (10Dzahn) adding SRE-Access-Requests because it's more than just LDAP. Involves also shell access. [17:18:23] Great, thanks! [17:18:33] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:45] 10Operations, 10User-DannyS712: 503 Backend fetch failed - https://phabricator.wikimedia.org/T233271 (10MusikAnimal) There have been more reports of this. Going off of the error logs for #xtools, the last burst of 503s from the MediaWiki API happened from around 11:00 to 12:00 UTC on October 1. [17:19:53] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevinbazira - https://phabricator.wikimedia.org/T234209 (10Dzahn) [17:20:03] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10Dzahn) [17:20:56] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10Dzahn) p:05Triage→03Normal [17:21:14] RECOVERY - Host ms-be2055 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [17:21:36] (03CR) 10Bstorm: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [17:21:56] (03PS4) 10Dzahn: gerrit: Fix renamed group name "Project and Group Creators" [puppet] - 10https://gerrit.wikimedia.org/r/539676 (owner: 10MarcoAurelio) [17:22:59] (03CR) 10Dzahn: [C: 03+2] gerrit: Fix renamed group name "Project and Group Creators" [puppet] - 10https://gerrit.wikimedia.org/r/539676 (owner: 10MarcoAurelio) [17:23:20] (03CR) 10Jdlrobson: [C: 03+1] Set new MFMobileFormatterOptions config using old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540107 (https://phabricator.wikimedia.org/T232690) (owner: 10Pmiazga) [17:31:03] (03PS3) 10Dzahn: gerrit: G1GC tuning. Increase NewGen space [puppet] - 10https://gerrit.wikimedia.org/r/540114 (owner: 10Thcipriani) [17:32:15] (03CR) 10Dzahn: [C: 03+2] gerrit: G1GC tuning. Increase NewGen space [puppet] - 10https://gerrit.wikimedia.org/r/540114 (owner: 10Thcipriani) [17:33:41] 10Operations, 10Traffic: Renew Digicert Unified in 2019 - https://phabricator.wikimedia.org/T209515 (10RobH) [17:34:12] (03PS16) 10CRusnov: backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) [17:36:12] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 57.13 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:39:12] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 78.12 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:41:25] (03CR) 10Jdlrobson: [C: 03+1] "but must be deployed Thursday/late Wednesday after the train has rolled." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540123 (https://phabricator.wikimedia.org/T232690) (owner: 10Pmiazga) [17:42:35] ACKNOWLEDGEMENT - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis https://phabricator.wikimedia.org/T234335 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:42:35] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis https://phabricator.wikimedia.org/T234335 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:44:40] 10Operations, 10Wikimedia-Mailing-lists: disable WMFSF, keep archives - https://phabricator.wikimedia.org/T233883 (10eliza) Hello All, Was wondering what may be the status on this request? Eliza [17:48:46] !log rotate PDUs passwords - T233053 [17:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:57] !log ayounsi@cumin1001 START - Cookbook sre.hosts.rotate-pdu-password [17:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:06] :)) [17:50:17] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.rotate-pdu-password (exit_code=97) [17:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:33] !log ayounsi@cumin1001 START - Cookbook sre.hosts.rotate-pdu-password [17:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:59] !log gerrit restart for new config changes incoming [17:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:49] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.rotate-pdu-password (exit_code=97) [17:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:52] (03Abandoned) 10Hashar: gerrit: remove a no more existing group [puppet] - 10https://gerrit.wikimedia.org/r/525048 (owner: 10Hashar) [17:54:01] well seems like something is not working [17:55:37] * godog compulsively hits reload on gerrit [17:56:18] thanks godog [17:56:23] PROBLEM - Check the last execution of git_pull_charts on deploy2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:56:28] should be back now [17:56:36] :) [17:56:55] thcipriani: lol, it is indeed back now, thanks! [17:57:23] saw your message re: restart so no panic ensued [17:57:46] godog: heh, that's good. I'm just glad to know I'm not alone: tailing the logs, refreshing repeatedly, humming crazily to myself [17:58:45] marxarelli: ^ you're all clear for train stuffs, sorry for the interruption, and thanks for letting me steal the conch [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191001T1800) [18:01:50] (03PS2) 10Filippo Giunchedi: logstash: parse nested json from mmkubernetes [puppet] - 10https://gerrit.wikimedia.org/r/539978 (https://phabricator.wikimedia.org/T207200) [18:02:30] thcipriani: yup, my first reaction before looking at irc was like https://external-preview.redd.it/WZHj5iCqD6ktDs9EsrI8UNvcViIaAV9gkQ4EL_rgsPk.gif?format=mp4&s=9e751053acfc79f0e37402237a5f64211c6f13af [18:03:11] (03CR) 10Filippo Giunchedi: logstash: parse nested json from mmkubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539978 (https://phabricator.wikimedia.org/T207200) (owner: 10Filippo Giunchedi) [18:03:22] (03PS1) 10Bstorm: dologmsg: ensure the directory exists before trying to add the man page [puppet] - 10https://gerrit.wikimedia.org/r/540188 (https://phabricator.wikimedia.org/T222244) [18:03:56] godog: hahaha nice [18:04:56] (03PS1) 10Herron: WIP: elasticsearch: add toggle for rsyslog udp logback compat include [puppet] - 10https://gerrit.wikimedia.org/r/540189 [18:06:01] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:06:09] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:06:51] RECOVERY - Check the last execution of git_pull_charts on deploy2001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:07:24] (03CR) 10jerkins-bot: [V: 04-1] WIP: elasticsearch: add toggle for rsyslog udp logback compat include [puppet] - 10https://gerrit.wikimedia.org/r/540189 (owner: 10Herron) [18:09:47] (03PS2) 10Herron: WIP: elasticsearch: add toggle for rsyslog udp logback compat include [puppet] - 10https://gerrit.wikimedia.org/r/540189 [18:10:41] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:11:08] (03CR) 10Aaron Schulz: [C: 03+2] Configure allow_tcp_nagle_delay for mcrouter cache in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539985 (owner: 10Aaron Schulz) [18:11:36] (03Merged) 10jenkins-bot: Configure allow_tcp_nagle_delay for mcrouter cache in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539985 (owner: 10Aaron Schulz) [18:12:11] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:12:55] (03CR) 10Bstorm: "Puppet is broken on the toolforge bastions, so running the compiler (see https://puppet-compiler.wmflabs.org/compiler1002/18698/tools-sgeb" [puppet] - 10https://gerrit.wikimedia.org/r/540188 (https://phabricator.wikimedia.org/T222244) (owner: 10Bstorm) [18:13:15] (03PS2) 10Bstorm: dologmsg: ensure the directory exists before trying to add the man page [puppet] - 10https://gerrit.wikimedia.org/r/540188 (https://phabricator.wikimedia.org/T222244) [18:13:17] (03CR) 10jerkins-bot: [V: 04-1] WIP: elasticsearch: add toggle for rsyslog udp logback compat include [puppet] - 10https://gerrit.wikimedia.org/r/540189 (owner: 10Herron) [18:13:26] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/18699/logstash1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/540189 (owner: 10Herron) [18:14:18] (03PS3) 10Herron: elasticsearch: add toggle for rsyslog udp logback compat include [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) [18:16:40] (03PS1) 10Dzahn: gerrit1001: allow rsyncing Gerrit data from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) [18:16:42] (03CR) 10jerkins-bot: [V: 04-1] dologmsg: ensure the directory exists before trying to add the man page [puppet] - 10https://gerrit.wikimedia.org/r/540188 (https://phabricator.wikimedia.org/T222244) (owner: 10Bstorm) [18:17:49] (03CR) 10Dzahn: [C: 03+1] "first https://gerrit.wikimedia.org/r/c/operations/puppet/+/540190 now" [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [18:17:56] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add toggle for rsyslog udp logback compat include [puppet] - 10https://gerrit.wikimedia.org/r/540189 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [18:18:14] (03CR) 10Paladox: [C: 03+1] gerrit1001: allow rsyncing Gerrit data from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [18:18:21] thcipriani ^^ :) [18:18:48] (03PS2) 10Dzahn: gerrit1001: allow rsyncing Gerrit data from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) [18:18:59] (03PS1) 10Dduvall: Group0 to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540191 [18:19:07] (03PS2) 10Filippo Giunchedi: swift: open per-port object server ports [puppet] - 10https://gerrit.wikimedia.org/r/539535 (https://phabricator.wikimedia.org/T162123) [18:19:24] (03CR) 10Bstorm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/540188 (https://phabricator.wikimedia.org/T222244) (owner: 10Bstorm) [18:21:47] (03CR) 10jerkins-bot: [V: 04-1] gerrit1001: allow rsyncing Gerrit data from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [18:22:04] AaronSchulz: are you deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/539985 ? [18:22:44] just saw it show up in origin/master while prepping for train [18:22:44] (03CR) 10jerkins-bot: [V: 04-1] dologmsg: ensure the directory exists before trying to add the man page [puppet] - 10https://gerrit.wikimedia.org/r/540188 (https://phabricator.wikimedia.org/T222244) (owner: 10Bstorm) [18:22:57] (03CR) 10Jforrester: [C: 03+1] "Today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539916 (owner: 10Jforrester) [18:23:09] (03CR) 10Jforrester: [C: 03+2] Add the beta REL1_34 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539916 (owner: 10Jforrester) [18:23:33] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:23:39] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:24:25] (03CR) 10Jforrester: [C: 03+1] "Actually, let's wait." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539916 (owner: 10Jforrester) [18:24:30] (03PS2) 10Jforrester: Add the beta REL1_34 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539916 [18:24:57] (03PS3) 10Paladox: gerrit1001: allow rsyncing Gerrit data from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [18:25:53] (03CR) 10Bstorm: [V: 03+2 C: 03+2] "Jerkins seems to be broken? It worked before rebasing. I'm going to override and merge." [puppet] - 10https://gerrit.wikimedia.org/r/540188 (https://phabricator.wikimedia.org/T222244) (owner: 10Bstorm) [18:26:06] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: open per-port object server ports [puppet] - 10https://gerrit.wikimedia.org/r/539535 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [18:26:30] (03PS3) 10Bstorm: dologmsg: ensure the directory exists before trying to add the man page [puppet] - 10https://gerrit.wikimedia.org/r/540188 (https://phabricator.wikimedia.org/T222244) [18:27:37] (03CR) 10Thcipriani: [C: 03+1] gerrit1001: allow rsyncing Gerrit data from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [18:27:42] (03CR) 10jerkins-bot: [V: 04-1] gerrit1001: allow rsyncing Gerrit data from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [18:27:48] (03CR) 10Bstorm: [V: 03+2 C: 03+2] dologmsg: ensure the directory exists before trying to add the man page [puppet] - 10https://gerrit.wikimedia.org/r/540188 (https://phabricator.wikimedia.org/T222244) (owner: 10Bstorm) [18:28:23] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:28:31] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:29:19] marxarelli: I was going to no-op sync [18:29:55] only effects beta [18:30:13] (03PS4) 10Dzahn: gerrit1001: allow rsyncing Gerrit data from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) [18:30:46] (03PS1) 10Pmiazga: Do not set wgMFNoindexPages config flag in mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540193 (https://phabricator.wikimedia.org/T206497) [18:31:27] (03CR) 10Dzahn: [C: 03+2] "only affects non-prod server" [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [18:33:08] (03CR) 10jerkins-bot: [V: 04-1] gerrit1001: allow rsyncing Gerrit data from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [18:33:27] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Running swiftrepl is not puppetized - https://phabricator.wikimedia.org/T162123 (10fgiunchedi) >>! In T162123#5538678, @gerritbot wrote: > Change 539535 **merged** by Filippo Giunchedi: > [operations/puppet@production] swift: open per-po... [18:37:03] (03PS5) 10Dzahn: gerrit1001: allow rsyncing Gerrit data from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) [18:37:44] (03PS1) 10Andrew Bogott: codfw1-dev ldap: fix novaobserver ACLs [puppet] - 10https://gerrit.wikimedia.org/r/540195 [18:38:50] (03CR) 10Andrew Bogott: [C: 03+2] codfw1-dev ldap: fix novaobserver ACLs [puppet] - 10https://gerrit.wikimedia.org/r/540195 (owner: 10Andrew Bogott) [18:39:30] (03CR) 10jerkins-bot: [V: 04-1] gerrit1001: allow rsyncing Gerrit data from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [18:39:37] (03CR) 10Bstorm: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [18:42:10] (03PS1) 10Filippo Giunchedi: hieradata: enable swift servers per port for new codfw hardware [puppet] - 10https://gerrit.wikimedia.org/r/540196 (https://phabricator.wikimedia.org/T222366) [18:44:23] 10Operations, 10CX-cxserver, 10Citoid, 10Core Platform Team, and 9 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10CCicalese_WMF) [18:46:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1002/18701/" [puppet] - 10https://gerrit.wikimedia.org/r/540196 (https://phabricator.wikimedia.org/T222366) (owner: 10Filippo Giunchedi) [18:47:10] 10Operations, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, 10Core Platform Team Legacy (Watching / External), and 3 others: FY2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453 (10Pchelolo) [18:48:09] (03PS6) 10Dzahn: gerrit1001: allow rsyncing Gerrit data from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) [18:48:39] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 58%, RTA = 2204.94 ms [18:48:43] !log dduvall@deploy1001 Pruned MediaWiki: 1.34.0-wmf.16 (duration: 18m 45s) [18:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:15] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1050 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:49:27] PROBLEM - Check systemd state on ms-be1050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:07] (03PS1) 10Andrew Bogott: validatelabsfqdn.py: allow codfw1dev and future eqiad VM fqdns [puppet] - 10https://gerrit.wikimedia.org/r/540197 [18:50:09] (03PS3) 10CRusnov: interface_automation: fix minor issues [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/540167 [18:51:10] (03CR) 10Paladox: [C: 03+1] "LGTM and also you managed to future proof this too!" [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [18:51:35] (03PS2) 10Andrew Bogott: validatelabsfqdn.py: allow codfw1dev and future eqiad VM fqdns [puppet] - 10https://gerrit.wikimedia.org/r/540197 [18:52:41] (03CR) 10Dzahn: "yea, future proof was a side effect of making style check happy. the system works :)" [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [18:52:52] (03CR) 10Thcipriani: [C: 03+1] Fix phatality deployment script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540117 (owner: 1020after4) [18:52:54] (03CR) 10Andrew Bogott: [C: 03+2] validatelabsfqdn.py: allow codfw1dev and future eqiad VM fqdns [puppet] - 10https://gerrit.wikimedia.org/r/540197 (owner: 10Andrew Bogott) [18:53:37] 10Operations, 10User-DannyS712: 503 Backend fetch failed - https://phabricator.wikimedia.org/T233271 (10sbassett) Re:** CheckUser**, there was a recent security patch (T207094, [[ https://gerrit.wikimedia.org/r/539643 | backport to master ]]) which //did// suffer from some initial performance issues. These is... [18:54:19] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.97 ms [18:54:40] !log dduvall@deploy1001 Pruned MediaWiki: 1.34.0-wmf.17 (duration: 02m 48s) [18:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:16] 10Operations, 10observability, 10Patch-For-Review: Hosts in puppet with $cluster missing from wikimedia_clusters - https://phabricator.wikimedia.org/T234232 (10fgiunchedi) [18:57:20] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18702/" [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [18:57:29] !log dduvall@deploy1001 Pruned MediaWiki: 1.34.0-wmf.19 (duration: 02m 12s) [18:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:03] (03PS7) 10Dzahn: gerrit1001: allow rsyncing Gerrit data from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/540190 (https://phabricator.wikimedia.org/T231046) [18:59:52] !log dduvall@deploy1001 Pruned MediaWiki: 1.34.0-wmf.20 (duration: 02m 11s) [18:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] marxarelli: #bothumor I � Unicode. All rise for MediaWiki train - American version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191001T1900). [19:00:45] !log ayounsi@cumin1001 START - Cookbook sre.hosts.rotate-pdu-password [19:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:27] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.rotate-pdu-password (exit_code=1) [19:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:02] !log dduvall@deploy1001 Pruned MediaWiki: 1.34.0-wmf.21 (duration: 01m 57s) [19:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:53] !log dduvall@deploy1001 Pruned MediaWiki: 1.34.0-wmf.22 (duration: 01m 41s) [19:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:33] !log dduvall@deploy1001 Pruned MediaWiki: 1.34.0-wmf.23 (duration: 01m 32s) [19:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:57] !log dduvall@deploy1001 Started scap: testwiki to php-1.34.0-wmf.25 and rebuild l10n cache [19:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:51] mutante \o/ [19:10:28] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10Nuria) Is @herron going to correct the patch per @MoritzMuehlenhoff guidelines? @EYener: You should ha... [19:15:07] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1050 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:15:19] RECOVERY - Check systemd state on ms-be1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:48] (03CR) 10CDanis: [C: 03+1] hieradata: enable swift servers per port for new codfw hardware [puppet] - 10https://gerrit.wikimedia.org/r/540196 (https://phabricator.wikimedia.org/T222366) (owner: 10Filippo Giunchedi) [19:27:43] (03PS1) 10Dzahn: gerrit::migration: ensure data dir exists and is configurable [puppet] - 10https://gerrit.wikimedia.org/r/540203 (https://phabricator.wikimedia.org/T231046) [19:28:28] !log dduvall@deploy1001 Finished scap: testwiki to php-1.34.0-wmf.25 and rebuild l10n cache (duration: 19m 31s) [19:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:55] (03CR) 10jerkins-bot: [V: 04-1] gerrit::migration: ensure data dir exists and is configurable [puppet] - 10https://gerrit.wikimedia.org/r/540203 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [19:30:21] !log promoting 1.34.0-wmf.25 to group0 [19:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:25] (03CR) 10Dduvall: [C: 03+2] Group0 to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540191 (owner: 10Dduvall) [19:32:18] (03PS2) 10Dzahn: gerrit::migration: ensure data dir exists and is configurable [puppet] - 10https://gerrit.wikimedia.org/r/540203 (https://phabricator.wikimedia.org/T231046) [19:32:24] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540191 (owner: 10Dduvall) [19:34:45] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.25 [19:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:20] 10Operations, 10Wikimedia-Mailing-lists: disable WMFSF, keep archives - https://phabricator.wikimedia.org/T233883 (10Varnent) It sounds like this is similar to the request for WMFall - basically maintain the previous archive and redirect the email to the new list. I think that has to be done on the DevOps side... [19:37:31] (03CR) 10Paladox: [C: 03+1] gerrit::migration: ensure data dir exists and is configurable [puppet] - 10https://gerrit.wikimedia.org/r/540203 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [19:37:50] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/18703/gerrit1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/540203 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [19:38:16] (03PS3) 10Dzahn: gerrit::migration: ensure data dir exists and is configurable [puppet] - 10https://gerrit.wikimedia.org/r/540203 (https://phabricator.wikimedia.org/T231046) [19:40:56] (03CR) 10Dzahn: [C: 03+2] gerrit::migration: ensure data dir exists and is configurable [puppet] - 10https://gerrit.wikimedia.org/r/540203 (https://phabricator.wikimedia.org/T231046) (owner: 10Dzahn) [19:42:55] !log 1.34.0-wmf.25 promoted to group0 cc: T220750. no rise in relevant error rates [19:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:00] T220750: 1.34.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T220750 [19:43:46] (03PS1) 10Ayounsi: PDU password change, add session management [cookbooks] - 10https://gerrit.wikimedia.org/r/540206 (https://phabricator.wikimedia.org/T233053) [19:46:09] (03CR) 10Ayounsi: [C: 03+2] PDU password change, add session management [cookbooks] - 10https://gerrit.wikimedia.org/r/540206 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [19:47:05] !log ayounsi@cumin1001 START - Cookbook sre.hosts.rotate-pdu-password [19:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:39] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.rotate-pdu-password (exit_code=97) [19:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:20] Krinkle: Prod is yours if you want to deploy that Vector patch. [19:53:09] (03PS1) 10Herron: admin: add expiry_date and expiry_contact for user eyener [puppet] - 10https://gerrit.wikimedia.org/r/540207 (https://phabricator.wikimedia.org/T233636) [19:54:54] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Patch-For-Review: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10herron) >>! In T233636#5536946, @MoritzMuehlenhoff wrote: > There's two issues wi... [19:58:02] !log cobalt (gerrit) - rsyncing gerrit data to gerrit1001 in a screen session (T222391) [19:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:06] T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 [19:58:06] (03PS3) 10EBernhardson: [cirrus] glent method 0 A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537637 (https://phabricator.wikimedia.org/T233211) (owner: 10DCausse) [20:10:58] !log ayounsi@cumin1001 START - Cookbook sre.hosts.rotate-pdu-password [20:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:30] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.rotate-pdu-password (exit_code=1) [20:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:42] James_F: thx, will do a bit later. [20:31:02] 10Operations, 10ops-eqiad: Verify switch port connections - https://phabricator.wikimedia.org/T233302 (10Jclark-ctr) [20:33:52] (03PS2) 10Brennen Bearnes: mediawiki-dev: port 8080; apache entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T222494) [20:40:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] Set correct appbase url port [deployment-charts] - 10https://gerrit.wikimedia.org/r/539641 (owner: 10Jeena Huneidi) [20:41:31] (03CR) 10Brennen Bearnes: mediawiki-dev: port 8080; apache entrypoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T222494) (owner: 10Brennen Bearnes) [20:42:23] (03CR) 10Brennen Bearnes: [C: 03+2] Set correct appbase url port [deployment-charts] - 10https://gerrit.wikimedia.org/r/539641 (owner: 10Jeena Huneidi) [20:45:15] (03PS10) 10Paladox: gerrit: add role on gerrit1001 and remove spare [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) [20:46:06] (03PS11) 10Paladox: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) [20:46:14] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [20:52:20] (03CR) 10Filippo Giunchedi: logstash: throttle duplicate normalized_message with level:ERR* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron) [20:53:26] (03PS2) 10Filippo Giunchedi: hieradata: enable swift servers per port for new codfw hardware [puppet] - 10https://gerrit.wikimedia.org/r/540196 (https://phabricator.wikimedia.org/T222366) [20:55:03] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: enable swift servers per port for new codfw hardware [puppet] - 10https://gerrit.wikimedia.org/r/540196 (https://phabricator.wikimedia.org/T222366) (owner: 10Filippo Giunchedi) [20:57:14] (03CR) 10Alexandros Kosiaris: "Hm, interesting, never though of it. Do you have a reproduction handy? I 'd like to see it in action and understand it a bit better" [deployment-charts] - 10https://gerrit.wikimedia.org/r/539220 (owner: 10Jeena Huneidi) [21:01:11] (03PS1) 10Filippo Giunchedi: hieradata: add ms-be205[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/540212 (https://phabricator.wikimedia.org/T233638) [21:05:05] 10Operations, 10ops-eqiad, 10decommission: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10Jclark-ctr) [21:06:06] (03CR) 10Ayounsi: [C: 03+1] interface_automation: fix minor issues [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/540167 (owner: 10CRusnov) [21:06:09] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add ms-be205[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/540212 (https://phabricator.wikimedia.org/T233638) (owner: 10Filippo Giunchedi) [21:06:35] 10Operations, 10ops-eqiad, 10decommission: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson Host wiped, netbox updated status set offline removed from rack, and added to hardware tracking sheet [21:07:23] 10Operations, 10ops-eqiad, 10decommission: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10Jclark-ctr) [21:08:25] (03Abandoned) 10Jforrester: 50% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535770 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [21:09:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission lithium - https://phabricator.wikimedia.org/T229557 (10Jclark-ctr) [21:09:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission lithium - https://phabricator.wikimedia.org/T229557 (10Jclark-ctr) 05Open→03Resolved a:05Jclark-ctr→03Cmjohnson Host wiped, netbox updated status set offline removed from rack, and added to hardware tracking sheet [21:12:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission phab1002/WMF4727 - https://phabricator.wikimedia.org/T221391 (10Jclark-ctr) [21:13:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission phab1002/WMF4727 - https://phabricator.wikimedia.org/T221391 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson performed disk wipe changed hostname back to asset tag [21:14:37] (03PS1) 10Filippo Giunchedi: codfw-prod: add ms-be2051, minimal weight and servers_per_port [software/swift-ring] - 10https://gerrit.wikimedia.org/r/540213 (https://phabricator.wikimedia.org/T233638) [21:17:02] 10Operations, 10ORES, 10serviceops, 10Patch-For-Review, 10Scoring-platform-team (Current): celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10akosiaris) 05Open→03Resolved I 'll resolve this, all workers hav... [21:19:19] (03PS3) 10Jeena Huneidi: Set correct appbase url port [deployment-charts] - 10https://gerrit.wikimedia.org/r/539641 [21:22:52] (03PS1) 10Jhedden: openstack: increase apr cache table for cloudnet hosts [puppet] - 10https://gerrit.wikimedia.org/r/540216 (https://phabricator.wikimedia.org/T234373) [21:23:16] (03CR) 10CDanis: [C: 03+1] codfw-prod: add ms-be2051, minimal weight and servers_per_port [software/swift-ring] - 10https://gerrit.wikimedia.org/r/540213 (https://phabricator.wikimedia.org/T233638) (owner: 10Filippo Giunchedi) [21:23:45] (03CR) 10Brennen Bearnes: [C: 03+2] Set correct appbase url port [deployment-charts] - 10https://gerrit.wikimedia.org/r/539641 (owner: 10Jeena Huneidi) [21:23:57] (03Merged) 10jenkins-bot: Set correct appbase url port [deployment-charts] - 10https://gerrit.wikimedia.org/r/539641 (owner: 10Jeena Huneidi) [21:24:18] (03PS2) 10Jhedden: openstack: increase arp cache table for cloudnet hosts [puppet] - 10https://gerrit.wikimedia.org/r/540216 (https://phabricator.wikimedia.org/T234373) [21:24:52] (03PS1) 10Dzahn: gerrit::migration: ensure user exists and sync has right owners [puppet] - 10https://gerrit.wikimedia.org/r/540217 [21:25:16] 10Operations, 10ops-codfw, 10Traffic: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10wiki_willy) Hi @Vgutierrez - just following up on this to see if there was an ETA, since these are supposed to replace lvs2001-2006...which are all past their 5yr mark, and have the following... [21:25:59] (03CR) 10Filippo Giunchedi: [C: 03+2] codfw-prod: add ms-be2051, minimal weight and servers_per_port [software/swift-ring] - 10https://gerrit.wikimedia.org/r/540213 (https://phabricator.wikimedia.org/T233638) (owner: 10Filippo Giunchedi) [21:26:01] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] codfw-prod: add ms-be2051, minimal weight and servers_per_port [software/swift-ring] - 10https://gerrit.wikimedia.org/r/540213 (https://phabricator.wikimedia.org/T233638) (owner: 10Filippo Giunchedi) [21:28:07] 10Operations, 10ops-eqiad: rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10Jclark-ctr) [21:29:10] 10Operations, 10ops-eqiad: rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10Jclark-ctr) racked and cabled host updated netbox [21:29:20] !log ayounsi@cumin1001 START - Cookbook sre.hosts.rotate-pdu-password [21:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:39] (03CR) 10Andrew Bogott: openstack: increase arp cache table for cloudnet hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540216 (https://phabricator.wikimedia.org/T234373) (owner: 10Jhedden) [21:29:48] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.rotate-pdu-password (exit_code=1) [21:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:35] (03CR) 10Jhedden: openstack: increase arp cache table for cloudnet hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540216 (https://phabricator.wikimedia.org/T234373) (owner: 10Jhedden) [21:33:55] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.25/skins/Vector/: bb2fd9cf9c22cc (duration: 01m 00s) [21:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:40] !log swift codfw-prod: add ms-be2051 with minimal weight - T233638 T222366 [21:34:43] (03PS2) 10Dzahn: DNS: Remove mgmt DNS for db2040 [dns] - 10https://gerrit.wikimedia.org/r/539996 (owner: 10Papaul) [21:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:44] T222366: Test swift object server deployment with one disk per tcp port - https://phabricator.wikimedia.org/T222366 [21:34:45] T233638: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 [21:37:13] (03PS3) 10Jhedden: openstack: increase arp cache table for cloudnet hosts [puppet] - 10https://gerrit.wikimedia.org/r/540216 (https://phabricator.wikimedia.org/T234373) [21:37:51] (03PS3) 10Dzahn: DNS: Remove mgmt DNS for db2040 [dns] - 10https://gerrit.wikimedia.org/r/539996 (owner: 10Papaul) [21:38:33] (03CR) 10Andrew Bogott: [C: 03+1] openstack: increase arp cache table for cloudnet hosts [puppet] - 10https://gerrit.wikimedia.org/r/540216 (https://phabricator.wikimedia.org/T234373) (owner: 10Jhedden) [21:42:36] (03CR) 10CRusnov: [C: 03+2] interface_automation: fix minor issues [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/540167 (owner: 10CRusnov) [21:42:53] (03CR) 10Jhedden: [C: 03+2] openstack: increase arp cache table for cloudnet hosts [puppet] - 10https://gerrit.wikimedia.org/r/540216 (https://phabricator.wikimedia.org/T234373) (owner: 10Jhedden) [21:44:29] (03PS2) 10Dzahn: gerrit::migration: ensure user exists and sync has right owners [puppet] - 10https://gerrit.wikimedia.org/r/540217 [21:44:34] (03PS4) 10Papaul: DNS: Remove mgmt DNS for db2040 [dns] - 10https://gerrit.wikimedia.org/r/539996 [21:44:56] (03CR) 10Dzahn: [C: 03+2] gerrit::migration: ensure user exists and sync has right owners [puppet] - 10https://gerrit.wikimedia.org/r/540217 (owner: 10Dzahn) [21:45:09] 10Operations, 10serviceops, 10HHVM, 10Patch-For-Review, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10Jdforrester-WMF) [21:46:17] (03CR) 10Dzahn: [C: 03+1] DNS: Remove mgmt DNS for db2040 [dns] - 10https://gerrit.wikimedia.org/r/539996 (owner: 10Papaul) [21:47:21] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for db2040 [dns] - 10https://gerrit.wikimedia.org/r/539996 (owner: 10Papaul) [21:48:33] (03CR) 10Paladox: [C: 03+1] gerrit::migration: ensure user exists and sync has right owners [puppet] - 10https://gerrit.wikimedia.org/r/540217 (owner: 10Dzahn) [21:49:01] PROBLEM - Host backup2001 is DOWN: PING CRITICAL - Packet loss = 100% [21:50:13] RECOVERY - Host backup2001 is UP: PING OK - Packet loss = 0%, RTA = 36.29 ms [21:50:25] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2040 - https://phabricator.wikimedia.org/T224079 (10Papaul) [21:50:44] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2040 - https://phabricator.wikimedia.org/T224079 (10Papaul) 05Open→03Resolved complete [22:01:25] (03CR) 10Thcipriani: "hrm. This seems to be already installed on cobalt:" [puppet] - 10https://gerrit.wikimedia.org/r/540094 (https://phabricator.wikimedia.org/T231872) (owner: 10Hashar) [22:08:00] (03PS3) 10Dzahn: gerrit: install openjdk dbg package [puppet] - 10https://gerrit.wikimedia.org/r/540094 (https://phabricator.wikimedia.org/T231872) (owner: 10Hashar) [22:10:12] (03CR) 10Dzahn: [C: 03+2] "indeed it is already installed on cobalt. and https://puppet-compiler.wmflabs.org/compiler1002/18704/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/540094 (https://phabricator.wikimedia.org/T231872) (owner: 10Hashar) [22:13:04] (03PS1) 10Ayounsi: Rotate PDU password, workaround requests lib bug [cookbooks] - 10https://gerrit.wikimedia.org/r/540228 (https://phabricator.wikimedia.org/T233053) [22:14:03] (03CR) 10Dzahn: "noop on cobalt. Package[openjdk-8-dbg]/ensure: created on gerrit2001." [puppet] - 10https://gerrit.wikimedia.org/r/540094 (https://phabricator.wikimedia.org/T231872) (owner: 10Hashar) [22:14:10] (03CR) 10CRusnov: [C: 03+1] "LGTM hope it works :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/540228 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [22:15:31] (03CR) 10jerkins-bot: [V: 04-1] Rotate PDU password, workaround requests lib bug [cookbooks] - 10https://gerrit.wikimedia.org/r/540228 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [22:15:43] (03PS2) 10Ayounsi: Rotate PDU password, workaround requests lib bug [cookbooks] - 10https://gerrit.wikimedia.org/r/540228 (https://phabricator.wikimedia.org/T233053) [22:17:51] (03CR) 10Ayounsi: [C: 03+2] Rotate PDU password, workaround requests lib bug [cookbooks] - 10https://gerrit.wikimedia.org/r/540228 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [22:17:55] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2041 db204[3-4] db204[6-7] and db2049 [dns] - 10https://gerrit.wikimedia.org/r/540230 [22:19:01] !log ayounsi@cumin1001 START - Cookbook sre.hosts.rotate-pdu-password [22:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:52] (03PS1) 10Bartosz Dziewoński: Add 'VisualEditor' logging channel to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540235 (https://phabricator.wikimedia.org/T233127) [22:23:29] jouncebot: next [22:23:29] In 0 hour(s) and 36 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191001T2300) [22:25:47] is anyone around to quickly review some patches related to logging? i'm not sure if i know what i'm doing. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540234/1/includes/ApiVisualEditor.php https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/540235/1/wmf-config/InitialiseSettings.php [22:25:59] is that ^ all i have to do to see the results on logstash.wikimedia.org? [22:26:34] (03PS1) 10Dzahn: gerrit::migration: add firewall hole for rsync over IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/540237 [22:27:51] (03CR) 10Dzahn: [C: 03+1] DNS: Remove mgmt DNS for db2041 db204[3-4] db204[6-7] and db2049 [dns] - 10https://gerrit.wikimedia.org/r/540230 (owner: 10Papaul) [22:28:49] (03CR) 10jerkins-bot: [V: 04-1] gerrit::migration: add firewall hole for rsync over IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/540237 (owner: 10Dzahn) [22:29:59] PROBLEM - puppet last run on cloudnet2003-dev is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:30:22] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.rotate-pdu-password (exit_code=1) [22:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:18] (03PS2) 10Dzahn: gerrit::migration: add firewall hole for rsync over IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/540237 [22:32:10] (03CR) 10jerkins-bot: [V: 04-1] gerrit::migration: add firewall hole for rsync over IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/540237 (owner: 10Dzahn) [22:32:43] (03PS1) 10Bstorm: dumps distribution: fail labstore1007 back as VPS NFS [puppet] - 10https://gerrit.wikimedia.org/r/540238 [22:36:08] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for db2041 db204[3-4] db204[6-7] and db2049 [dns] - 10https://gerrit.wikimedia.org/r/540230 (owner: 10Papaul) [22:38:11] (03PS3) 10Dzahn: gerrit::migration: add firewall hole for rsync over IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/540237 [22:38:54] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2041 - https://phabricator.wikimedia.org/T223950 (10Papaul) [22:39:10] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2041 - https://phabricator.wikimedia.org/T223950 (10Papaul) Complete [22:39:17] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2041 - https://phabricator.wikimedia.org/T223950 (10Papaul) 05Open→03Resolved [22:39:37] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2043.codfw.wmnet - https://phabricator.wikimedia.org/T230311 (10Papaul) [22:39:44] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [22:39:46] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2043.codfw.wmnet - https://phabricator.wikimedia.org/T230311 (10Papaul) 05Open→03Resolved Complete [22:40:05] (03PS4) 10Dzahn: gerrit::migration: add firewall hole for rsync over IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/540237 (https://phabricator.wikimedia.org/T222391) [22:42:14] 10Operations, 10Gerrit: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10Dzahn) [22:43:58] 10Operations, 10Gerrit: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10Dzahn) There were/are a bunch of changes in a common topic branch not linked to this ticket that were needed: https://gerrit.wikimedia.org/r/q/topic:%22gerrit1001%22+(status:open%20OR%20status:merged) I just close... [22:44:25] 10Operations, 10Gerrit: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10Dzahn) 05Open→03Resolved [22:44:35] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [22:46:05] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) p:05Normal→03High [22:46:40] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2044.codfw.wmnet - https://phabricator.wikimedia.org/T230761 (10Papaul) [22:46:51] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2044.codfw.wmnet - https://phabricator.wikimedia.org/T230761 (10Papaul) 05Open→03Resolved Complete [22:46:54] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [22:49:49] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18707/gerrit1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/540237 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [22:50:52] (03PS5) 10Dzahn: gerrit::migration: add firewall hole for rsync over IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/540237 (https://phabricator.wikimedia.org/T222391) [22:50:54] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Papaul) [22:51:09] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Papaul) 05Open→03Resolved Complete [22:51:11] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [22:52:38] (03PS1) 10Filippo Giunchedi: icinga: add missing format string type [puppet] - 10https://gerrit.wikimedia.org/r/540239 [22:52:40] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2047.codfw.wmnet - https://phabricator.wikimedia.org/T231852 (10Papaul) [22:52:49] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [22:52:51] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2047.codfw.wmnet - https://phabricator.wikimedia.org/T231852 (10Papaul) 05Open→03Resolved Complete [22:53:33] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Papaul) [22:54:21] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: add missing format string type [puppet] - 10https://gerrit.wikimedia.org/r/540239 (owner: 10Filippo Giunchedi) [22:56:19] 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom-RFC, and 6 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [22:56:43] (03PS1) 10Dzahn: gerrit::migration: let gerrit-root users ssh to new gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/540240 (https://phabricator.wikimedia.org/T222391) [22:57:20] thcipriani ^^ :) [22:57:23] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Papaul) 05Open→03Resolved [22:57:27] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [22:57:32] (03PS2) 10Dzahn: gerrit::migration: let gerrit-root users ssh to new gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/540240 (https://phabricator.wikimedia.org/T222391) [22:59:10] (03PS3) 10Dzahn: gerrit::migration: let gerrit-root users ssh to new gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/540240 (https://phabricator.wikimedia.org/T222391) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191001T2300). [23:00:04] ebernhardson and MatmaRex: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:11] (03PS2) 10Filippo Giunchedi: icinga: add missing format string type [puppet] - 10https://gerrit.wikimedia.org/r/540239 [23:00:21] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] icinga: add missing format string type [puppet] - 10https://gerrit.wikimedia.org/r/540239 (owner: 10Filippo Giunchedi) [23:00:33] (03CR) 10Paladox: [C: 03+1] gerrit::migration: let gerrit-root users ssh to new gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/540240 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:00:45] \o [23:00:47] 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom-RFC, and 6 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [23:01:23] hi [23:01:41] 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom-RFC, and 6 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) >>! In T120085#5345448, @BBlack wrote: > I like the end result here, and I don't think it's problematic... [23:02:43] 10Operations, 10Core Platform Team, 10Performance-Team, 10Readers-Web-Backlog, and 7 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [23:02:46] (03CR) 10Dzahn: "@reviewers: fyi, you would have received this anyways in the moment we apply the prod role but this way you can ssh to the new server now " [puppet] - 10https://gerrit.wikimedia.org/r/540240 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:03:01] (03CR) 10Dzahn: [C: 03+2] gerrit::migration: let gerrit-root users ssh to new gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/540240 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:03:12] (03PS4) 10Dzahn: gerrit::migration: let gerrit-root users ssh to new gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/540240 (https://phabricator.wikimedia.org/T222391) [23:05:39] 10Operations, 10Core Platform Team, 10Performance-Team, 10Readers-Web-Backlog, and 7 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) >>! From **task description** > Stakeholders: > > - Traffic team. (assert potential routing i... [23:05:44] (03PS6) 10Dzahn: gerrit::migration: add firewall hole for rsync over IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/540237 (https://phabricator.wikimedia.org/T222391) [23:06:29] hmm, is anyone doing the SWAT? [23:07:20] i guess i can ship things [23:07:28] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) @thcipriani You and the other members of gerrit-roots admin group can now ssh to gerrit1001.wikimed... [23:07:54] (03CR) 10EBernhardson: [C: 03+2] Add 'VisualEditor' logging channel to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540235 (https://phabricator.wikimedia.org/T233127) (owner: 10Bartosz Dziewoński) [23:08:47] (03Merged) 10jenkins-bot: Add 'VisualEditor' logging channel to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540235 (https://phabricator.wikimedia.org/T233127) (owner: 10Bartosz Dziewoński) [23:09:00] ebernhardson: can you have a quick look at those two patches and see if they make sense? i'm not really familiar with the logging stuff [23:09:21] (and thank you for deploying) [23:10:33] (03PS1) 10Dzahn: gerrit::migration: remove superfluous ipresolve line [puppet] - 10https://gerrit.wikimedia.org/r/540244 [23:11:10] MatmaRex: seems plausible enough [23:11:12] (03CR) 10jerkins-bot: [V: 04-1] gerrit::migration: remove superfluous ipresolve line [puppet] - 10https://gerrit.wikimedia.org/r/540244 (owner: 10Dzahn) [23:11:57] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10thcipriani) >>! In T222391#5539841, @Dzahn wrote: > @thcipriani You and the other members of gerrit-roots... [23:12:39] (03CR) 10EBernhardson: [C: 03+2] [cirrus] glent method 0 A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537637 (https://phabricator.wikimedia.org/T233211) (owner: 10DCausse) [23:12:54] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T233127: Add VisualEditor logging channel to wmgMonologChannels (duration: 00m 59s) [23:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:58] T233127: HTTP 404 error in VE possibly when confronted with an edit conflict - https://phabricator.wikimedia.org/T233127 [23:16:22] (03PS4) 10Filippo Giunchedi: profile: sanity checks for cluster [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) [23:17:36] (03PS4) 10EBernhardson: [cirrus] glent method 0 A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537637 (https://phabricator.wikimedia.org/T233211) (owner: 10DCausse) [23:17:45] (03CR) 10EBernhardson: [C: 03+2] [cirrus] glent method 0 A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537637 (https://phabricator.wikimedia.org/T233211) (owner: 10DCausse) [23:18:39] (03Merged) 10jenkins-bot: [cirrus] glent method 0 A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537637 (https://phabricator.wikimedia.org/T233211) (owner: 10DCausse) [23:18:41] (03PS2) 10Dzahn: gerrit::migration: add base::firewall, rm superfluous ipresolve line [puppet] - 10https://gerrit.wikimedia.org/r/540244 (https://phabricator.wikimedia.org/T222391) [23:18:43] (03CR) 10Filippo Giunchedi: "Thanks for the reviews!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [23:18:46] (03CR) 10jerkins-bot: [V: 04-1] profile: sanity checks for cluster [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [23:19:00] 10Operations, 10Core Platform Team, 10Performance-Team, 10Readers-Web-Backlog, and 7 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Johan) How would you phrase this for inclusion in Tech News? [23:19:58] RECOVERY - puppet last run on cloudnet2003-dev is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:20:18] (03CR) 10Filippo Giunchedi: "> Patch Set 4: Verified-1" [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [23:20:25] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T233211: CirrusSearch: Configuration for glent m0 AB test (duration: 00m 58s) [23:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:29] T233211: Deploy A/B Test for Glent Method 0 - https://phabricator.wikimedia.org/T233211 [23:21:28] !log gerrit1001 - chown -R gerrit2:gerrit2 /srv/gerrit/git/ (T222391) [23:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:31] T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 [23:22:28] (03PS1) 10Filippo Giunchedi: DNM Revert "hieradata: add acmechief cluster" [puppet] - 10https://gerrit.wikimedia.org/r/540246 [23:24:19] (03CR) 10Paladox: [C: 03+1] gerrit::migration: add base::firewall, rm superfluous ipresolve line [puppet] - 10https://gerrit.wikimedia.org/r/540244 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:24:26] (03CR) 10Dzahn: [C: 03+2] gerrit::migration: add base::firewall, rm superfluous ipresolve line [puppet] - 10https://gerrit.wikimedia.org/r/540244 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:25:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "Looks like it is working as intended (tested on a dummy revert for acmechief cluster) https://puppet-compiler.wmflabs.org/compiler1001/187" [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [23:27:54] * ebernhardson patiently waits for jenkins [23:28:46] !log cobalt (gerrit) rsyncing /srv/gerrit/plugins dir, push to new server gerrit1001 (T222391) [23:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:50] T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 [23:31:40] PROBLEM - puppet last run on cloudnet2002-dev is CRITICAL: connect to address 10.192.20.10 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:34:33] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) >>! In T222391#5539856, @thcipriani wrote: > Confirmed that I can ssh in and I can see `/srv/gerrit... [23:34:37] yeah, jenkins sure is taking his sweet time [23:35:20] my usual reaction: https://preview.redd.it/s9oewgfokie21.gif?width=600&format=mp4&s=c1e2341d9bddd883daf12653f86ab70f6668331a [23:36:09] more rarely: http://i.imgur.com/ObIUV8L.gif [23:40:50] MatmaRex: you're up on mwdebug1002 [23:40:56] MatmaRex: i dunno if you can trigger that [23:43:10] !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.25/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: T233211: Deploy cirrussearch glent m0 a/b test (duration: 00m 59s) [23:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:14] T233211: Deploy A/B Test for Glent Method 0 - https://phabricator.wikimedia.org/T233211 [23:44:01] ebernhardson: sorry, i was away for a moment [23:44:12] MatmaRex: no worries, it's been like 40 minutes :) [23:44:12] ebernhardson: no, i can't. we have no idea when or how it's happening :D [23:44:18] !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.24/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: T233211: Deploy cirrussearch glent m0 a/b test (duration: 00m 59s) [23:44:19] ok will sync out [23:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:23] (03CR) 10Dzahn: "can you really just change this to a list in yaml without changing some puppet or erb code as well?" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox) [23:45:13] editing pages works on mwdebug1002, so i didn't break anything horribly [23:45:21] (03PS12) 10Paladox: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) [23:45:30] (03PS5) 10Paladox: Gerrit: Get cobalt to replicate to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540164 [23:45:39] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [23:45:45] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox) [23:46:06] !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.25/extensions/VisualEditor/includes/ApiVisualEditor.php: T233127: ApiVisualEditor: Add logging for RESTBase HTTP errors (duration: 00m 58s) [23:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:10] T233127: HTTP 404 error in VE possibly when confronted with an edit conflict - https://phabricator.wikimedia.org/T233127 [23:46:36] PROBLEM - traffic_server tls process restarted on cp5007 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5007&var-layer=tls [23:46:42] MatmaRex: all deployed [23:46:56] thanks [23:47:57] PROBLEM - puppet last run on cloudnet2002-dev is CRITICAL: connect to address 10.192.20.10 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:49:49] (03CR) 10Paladox: "Yes it should work! See puppet compiler https://puppet-compiler.wmflabs.org/compiler1002/288/" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox) [23:52:17] (03PS5) 10Dzahn: mariadb::packages: support buster, drop libmariadbclient18 [puppet] - 10https://gerrit.wikimedia.org/r/539973 [23:53:52] (03PS3) 10Bstorm: sssd: Add a whole duplicate hierarchy of sssd images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) [23:55:07] ebernhardson: all done? [23:56:31] (03PS11) 10Dzahn: parsoid: introduce parameter to use parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) [23:57:10] (03CR) 10Bstorm: "Now this definitely will need review. I'll get testing them locally. Important note: In this version the default "prefix" is toolforge. " [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm)