[00:06:14] (03PS1) 10Papaul: Partman: Add cloudceph200[1-3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/595714 (https://phabricator.wikimedia.org/T250846) [00:13:03] (03CR) 10Papaul: [C: 03+2] Partman: Add cloudceph200[1-3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/595714 (https://phabricator.wikimedia.org/T250846) (owner: 10Papaul) [00:18:14] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cloudceph2001-dev.wiki... [00:18:40] (03PS1) 10CRusnov: Upgrade Netbox to v2.8.3-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/595717 [00:19:01] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/595717 (owner: 10CRusnov) [00:33:33] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudceph2001-dev.wikimedia.org'] ` Of which those **FAILED**: ` ['cl... [00:34:47] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:11] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:57] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10Papaul) [00:51:25] (03PS1) 10Papaul: Change doamin from codfw.wmnet to wikimedia.org for cloudceph200[1-3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/595738 (https://phabricator.wikimedia.org/T250846) [00:52:43] (03CR) 10Papaul: [C: 03+2] Change doamin from codfw.wmnet to wikimedia.org for cloudceph200[1-3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/595738 (https://phabricator.wikimedia.org/T250846) (owner: 10Papaul) [00:56:59] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cloudceph2001-dev.wiki... [01:14:32] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [01:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:59] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:42] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudceph2001-dev.wikimedia.org'] ` and were **ALL** successful. [01:25:49] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cloudceph2002-dev.wikimedia.org ` The log ca... [01:43:26] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [01:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:54] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:44] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudceph2002-dev.wikimedia.org'] ` and were **ALL** successful. [01:53:23] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10Papaul) [01:54:02] PROBLEM - Host db2097 is DOWN: PING CRITICAL - Packet loss = 100% [01:54:41] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cloudceph2003-dev.wikimedia.org ` The log ca... [01:56:46] RECOVERY - Host db2097 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [01:59:46] PROBLEM - MariaDB read only s6 on db2097 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [01:59:48] PROBLEM - MariaDB Slave IO: s1 on db2097 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:00:16] PROBLEM - MariaDB Slave IO: s6 on db2097 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:00:22] PROBLEM - mysqld processes on db2097 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [02:00:52] PROBLEM - MariaDB Slave SQL: s1 on db2097 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:02:06] PROBLEM - MariaDB Slave SQL: s6 on db2097 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:04:28] PROBLEM - MariaDB read only s1 on db2097 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [02:10:36] PROBLEM - MariaDB Slave Lag: s6 on db2097 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:12:19] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [02:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:59] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [02:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:22] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudceph2003-dev.wikimedia.org'] ` and were **ALL** successful. [02:20:56] Report of 503s on https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Wrong_Fabric?_WikiMedia_Foundation_Errors,_May_2020 [02:26:29] PROBLEM - dump of s1 in codfw on db1115 is CRITICAL: Last dump for s1 at codfw (db2097.codfw.wmnet:3311) taken on 2020-05-12 00:00:02 is 129 GB, but previous one was 154 GB, a change of 16.4% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:26:30] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10Papaul) [02:29:25] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10Papaul) 05Open→03Resolved @JHedden this is complete. You just need to add the DNS for the private interface. Thanks. [04:24:15] ACKNOWLEDGEMENT - MariaDB Slave IO: s1 on db2097 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui T252492 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:24:15] ACKNOWLEDGEMENT - MariaDB Slave IO: s6 on db2097 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui T252492 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:24:15] ACKNOWLEDGEMENT - MariaDB Slave Lag: s6 on db2097 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui T252492 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:24:15] ACKNOWLEDGEMENT - MariaDB Slave SQL: s1 on db2097 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui T252492 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:24:15] ACKNOWLEDGEMENT - MariaDB Slave SQL: s6 on db2097 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui T252492 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:24:16] ACKNOWLEDGEMENT - MariaDB read only s1 on db2097 is CRITICAL: Could not connect to localhost:3311 Marostegui T252492 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:24:16] ACKNOWLEDGEMENT - MariaDB read only s6 on db2097 is CRITICAL: Could not connect to localhost:3316 Marostegui T252492 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:24:17] ACKNOWLEDGEMENT - mysqld processes on db2097 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Marostegui T252492 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:28:43] RECOVERY - MariaDB Slave SQL: s5 on db2123 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:29:22] In 30 minutes I will be restarting s4 master [04:45:48] (03PS2) 10Andrew Bogott: Openstack: move API traffic to cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/595229 (https://phabricator.wikimedia.org/T252121) [04:45:59] (03PS4) 10Andrew Bogott: OpenStack: move all openstack API support to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/595227 (https://phabricator.wikimedia.org/T252121) [04:46:38] !log Stop mysql on labsdb1011 to transfer its content - T249188 [04:46:39] RECOVERY - MariaDB Slave Lag: s5 on db2123 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:42] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [04:46:49] RECOVERY - MariaDB Slave Lag: s5 on db2128 is OK: OK slave_sql_lag Replication lag: 0.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:47:19] RECOVERY - MariaDB Slave Lag: s5 on db2099 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:47:21] RECOVERY - MariaDB Slave Lag: s5 on db2113 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:47:25] RECOVERY - MariaDB Slave Lag: s5 on db2094 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:47:49] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:47:53] RECOVERY - MariaDB Slave Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:48:17] RECOVERY - MariaDB Slave Lag: s5 on db2111 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:48:19] RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 0.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:51:23] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [04:51:37] haproxy alert is expected ^ [04:52:05] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui labsdb1011 is on maintenance https://wikitech.wikimedia.org/wiki/HAProxy [05:00:04] marostegui: That opportune time is upon us again. Time for a s4 primary database master restart deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200512T0500). [05:00:10] i/ [05:00:17] going to start jynus [05:00:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s4 as read-only for maintenance T251502', diff saved to https://phabricator.wikimedia.org/P11179 and previous config saved to /var/cache/conftool/dbconfig/20200512-050054-marostegui.json [05:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:57] ok [05:00:57] T251502: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 [05:01:12] ro confirmed [05:01:22] restarting [05:02:51] restart done [05:03:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s4 as read-only=off for maintenance T251502', diff saved to https://phabricator.wikimedia.org/P11180 and previous config saved to /var/cache/conftool/dbconfig/20200512-050339-marostegui.json [05:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:20] I can edit again [05:04:21] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:04:28] same [05:04:42] Is that it? Maintenance is done? [05:04:46] yep [05:04:50] Damn, y'all work fast. [05:06:28] recentchanges looking good [05:07:01] 10Operations, 10DBA, 10User-notice: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 (10Marostegui) 05Open→03Resolved This has been done. RO starts: 05:00:54 RO stops: 05:03:40 [05:07:04] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:07:38] 10Operations, 10DBA, 10User-notice: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 (10Marostegui) [05:07:55] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:08:01] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:10:07] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10Marostegui) p:05Triage→03Medium [05:19:11] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10Marostegui) p:05Triage→03Medium a:03MoritzMuehlenhoff Assigning it to Mortiz for now as he's the one currently working on it [05:20:42] 10Operations: buster reimaging broken with "No kernel modules found" - https://phabricator.wikimedia.org/T252382 (10Marostegui) 05Open→03Resolved p:05Triage→03Medium a:03MoritzMuehlenhoff Closing per: T252382#6124358 If we want a further discussion on long-term solving, we can always create a new task.... [05:22:05] 10Operations, 10serviceops: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10Marostegui) p:05Triage→03Medium [05:23:16] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10Marostegui) p:05Triage→03Medium [05:27:15] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 89, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:29:10] !log Restart docker-report-releng on deneb [05:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:21] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:40:38] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Marostegui) @elukey looks like kafka-jumbo1007 is failing to execute any of the NREP commands, while, for instance kafka-jumbo1008 or 1009 are all green. I... [05:42:56] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10Dzahn) a:03Dzahn [05:46:25] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) Thanks for the ping! Restarted the nagios server on the host and forced a recheck from icinga, let's see if it works. [06:08:35] (03PS1) 10Elukey: Remove mc1036/mc2036 from the Redis Nutcracker config [puppet] - 10https://gerrit.wikimedia.org/r/595810 (https://phabricator.wikimedia.org/T252391) [06:11:10] 10Operations, 10Goal: SRE firefighting improvements - 2019-20 Q1 Goal - https://phabricator.wikimedia.org/T229782 (10Marostegui) Is this task still valid? [06:11:58] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) [06:33:29] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.7-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/595533 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [06:34:44] 10Operations: cron-spam to root@: lsof stderr generates large emails on boron from wmf-auto-restart - https://phabricator.wikimedia.org/T224661 (10Marostegui) 05Open→03Resolved a:03Volans boron is gone - marking this as resolved [06:34:49] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Marostegui) [06:37:37] 10Operations, 10observability: smart-data-dump --syslog producing errors and spamming root@ - https://phabricator.wikimedia.org/T252500 (10Marostegui) [06:37:49] 10Operations, 10observability: smart-data-dump --syslog producing errors and spamming root@ - https://phabricator.wikimedia.org/T252500 (10Marostegui) p:05Triage→03Medium [06:40:52] 10Operations, 10observability: smart-data-dump --syslog producing errors and spamming root@ - https://phabricator.wikimedia.org/T252500 (10Marostegui) [06:40:54] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Marostegui) [06:43:29] 10Operations, 10serviceops: cronspam for slow queries in PageAssessments - https://phabricator.wikimedia.org/T197564 (10Marostegui) 05Open→03Resolved Closing this as the last email is from 4th Feb 2019 [06:43:33] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Marostegui) [06:45:11] 10Operations, 10observability: exim paniclog on $HOST has non-zero size - https://phabricator.wikimedia.org/T224399 (10Marostegui) This keeps happening quite often for different hosts - example : ` exim paniclog on mx1001.wikimedia.org has non-zero size 2020-05-12 00:12:54 1jYIXM-0000YU-Cv spam acl condition:... [06:46:16] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: swift-recon-cron on ms-be203[34]: [Errno 17] File exists: '/var/lock/swift-recon-object-cron' - https://phabricator.wikimedia.org/T174959 (10Marostegui) 05Open→03Resolved Closing this as fixed as the last error is from Mon, Dec 16, 2019 [06:46:19] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Marostegui) [06:47:40] (03PS1) 10Elukey: Add bash shabang to all bin scripts [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/595859 (https://phabricator.wikimedia.org/T250161) [06:49:18] (03CR) 10Dzahn: "How did it add that requirement? I don't see a use of the apache module in this role?" [puppet] - 10https://gerrit.wikimedia.org/r/595650 (owner: 10Andrew Bogott) [06:50:38] (03CR) 10Dzahn: "it seems Brooke already did the same thing and it doesn't look wrong to me but i am still wondering how these puppetmasters got apache bef" [puppet] - 10https://gerrit.wikimedia.org/r/595650 (owner: 10Andrew Bogott) [06:54:06] (03CR) 10Dzahn: "either way, sorry for breaking it and Brooke already did the fix, i think this is duplicate now." [puppet] - 10https://gerrit.wikimedia.org/r/595650 (owner: 10Andrew Bogott) [06:56:54] !log upload trafficserver 8.0.7-1wm4 to apt.wm.o (buster) - T242767 T249335 [06:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:59] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [06:56:59] T242767: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 [06:57:08] 10Operations, 10Goal: SRE firefighting improvements - 2019-20 Q1 Goal - https://phabricator.wikimedia.org/T229782 (10ayounsi) 05Open→03Resolved a:03ayounsi I'd say yes. 1/ and 2/ are done. VictorOps seems to be a good replacement of the [stretch] as it's possible to page people directly even if the infra... [06:57:37] (03CR) 10Dzahn: "Thanks for the fix. I am not sure how this worked before without using the apache module, I grepped for all things using it." [puppet] - 10https://gerrit.wikimedia.org/r/595701 (owner: 10Bstorm) [06:58:07] (03CR) 10Dzahn: contint: fix git cloning of docroot for integration.wm.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/595525 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [06:58:42] (03PS4) 10Dzahn: contint: fix git cloning of docroot for integration.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/595525 (https://phabricator.wikimedia.org/T224591) [06:59:27] (03CR) 10Dzahn: "Evaluation Error: Unknown variable: 'java_version'" [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [07:01:55] (03CR) 10Muehlenhoff: "Nothing else pulls in Java there, so I don't see why alternatives are even needed? If you only install Java 8, you only get Java 8." [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [07:04:33] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:25] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) Looks good now, removed also the downtime/acks! [07:12:01] (03PS4) 10Dzahn: Initially use Java 8 for contint on Buster [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [07:12:27] (03CR) 10jerkins-bot: [V: 04-1] Initially use Java 8 for contint on Buster [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [07:12:48] !log rebooting the IDP hosts, SSO sessions will need to be renewed [07:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:05] (03PS5) 10Dzahn: Initially use Java 8 for contint on Buster [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [07:16:11] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10ema) >>! In T252131#6116691, @Reedy wrote: > It was definitely updated at least twice today In o... [07:18:31] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:20:56] (03CR) 10Gehel: [C: 03+2] sre.wdqs.data-transfer: fix syntax, simplify rule [cookbooks] - 10https://gerrit.wikimedia.org/r/595061 (https://phabricator.wikimedia.org/T206951) (owner: 10Ryan Kemper) [07:22:15] (03CR) 10Dzahn: "Duplicate declaration: Package[openjdk-8-jdk] is already declared at (file: /srv/jenkins-workspace/puppet-compiler/22455/change/src/module" [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [07:25:12] (03PS6) 10Dzahn: Initially use Java 8 for contint on Buster [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [07:26:40] (03CR) 10Dzahn: "compiling on "C:jenkins" seems fine but only picks one host of 2 per role. in this case when they have different OS version this means it " [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [07:27:59] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/22457/" [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [07:29:39] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:42] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:31] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Traffic: 500, Internal Server Error on Commons for images at specified size - https://phabricator.wikimedia.org/T250211 (10ema) Request/response details, might be useful to help diagnosing the issue: ` ** << BeReq >> 193204670 -- B... [07:30:35] (03PS7) 10Dzahn: Initially use Java 8 for contint on Buster [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [07:31:04] 10Operations, 10MediaWiki-Cache, 10Traffic, 10serviceops, and 4 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Joe) This change was released to production to all wikis yesterday. The effect can be seen in this 12h moving average of purge r... [07:32:28] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648 (10ema) [07:37:09] (03CR) 10Dzahn: [C: 03+2] "amended so that it compiles without changes on unrelated hosts (not sure why we changed java_path there?) and only changes the buster host" [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [07:38:35] (03PS1) 10Muehlenhoff: Switch prod IDPs to external Tomcat [puppet] - 10https://gerrit.wikimedia.org/r/595862 (https://phabricator.wikimedia.org/T233950) [07:39:59] (03CR) 10Dzahn: "noop on contint1001,releases* and java 8 got installed on contint2001" [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [07:41:05] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [07:41:06] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:23] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 123.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [07:46:06] 10Operations, 10observability: smart-data-dump --syslog producing errors and spamming root@ - https://phabricator.wikimedia.org/T252500 (10fgiunchedi) cc @colewhite [07:46:57] !log reboot thanos hosts for kernel upgrade [07:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [07:51:07] (03CR) 10Gehel: [C: 04-1] Role for SDoC WDQS (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [07:56:06] (03PS1) 10Dzahn: jenkins: simplify java setup, delete common class [puppet] - 10https://gerrit.wikimedia.org/r/595866 [07:56:36] (03CR) 10jerkins-bot: [V: 04-1] jenkins: simplify java setup, delete common class [puppet] - 10https://gerrit.wikimedia.org/r/595866 (owner: 10Dzahn) [07:58:51] (03CR) 10Jbond: [C: 03+1] Switch prod IDPs to external Tomcat [puppet] - 10https://gerrit.wikimedia.org/r/595862 (https://phabricator.wikimedia.org/T233950) (owner: 10Muehlenhoff) [07:59:47] (03CR) 10Gehel: [C: 03+1] "LGTM, but I'd like to make sure someone in observability looks at this patch before merging it, just to make sure they are aware." [puppet] - 10https://gerrit.wikimedia.org/r/595215 (owner: 10EBernhardson) [08:00:18] (03PS1) 10Ayounsi: Add blackhole and trusted_space [homer/mock-private] - 10https://gerrit.wikimedia.org/r/595867 [08:00:33] (03CR) 10jerkins-bot: [V: 04-1] Add blackhole and trusted_space [homer/mock-private] - 10https://gerrit.wikimedia.org/r/595867 (owner: 10Ayounsi) [08:07:06] (03PS2) 10Ayounsi: Add blackhole and trusted_space [homer/mock-private] - 10https://gerrit.wikimedia.org/r/595867 [08:07:57] (03CR) 10Ayounsi: [C: 03+2] Add blackhole and trusted_space [homer/mock-private] - 10https://gerrit.wikimedia.org/r/595867 (owner: 10Ayounsi) [08:08:05] (03PS1) 10JMeybohm: admin: jayme dotfiles: vim config, helm/kubectl completion [puppet] - 10https://gerrit.wikimedia.org/r/595870 [08:09:48] (03PS2) 10Dzahn: jenkins: simplify java setup, delete common class [puppet] - 10https://gerrit.wikimedia.org/r/595866 [08:10:49] (03CR) 10JMeybohm: [C: 03+2] admin: jayme dotfiles: vim config, helm/kubectl completion [puppet] - 10https://gerrit.wikimedia.org/r/595870 (owner: 10JMeybohm) [08:10:51] (03CR) 10jerkins-bot: [V: 04-1] jenkins: simplify java setup, delete common class [puppet] - 10https://gerrit.wikimedia.org/r/595866 (owner: 10Dzahn) [08:13:11] (03CR) 10Dzahn: "noop in prod per compiler: https://puppet-compiler.wmflabs.org/compiler1003/22459/" [puppet] - 10https://gerrit.wikimedia.org/r/595866 (owner: 10Dzahn) [08:15:00] (03CR) 10Dzahn: "and noop on cloud instances using role::ci::slave::labs::common" [puppet] - 10https://gerrit.wikimedia.org/r/595866 (owner: 10Dzahn) [08:15:52] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 109.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [08:15:56] (03PS3) 10Dzahn: jenkins: simplify java setup, delete common class [puppet] - 10https://gerrit.wikimedia.org/r/595866 [08:17:35] (03PS2) 10Dzahn: jenkins: add missing /srv/jenkins dir [puppet] - 10https://gerrit.wikimedia.org/r/595521 (https://phabricator.wikimedia.org/T224591) [08:19:46] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22461/" [puppet] - 10https://gerrit.wikimedia.org/r/595521 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [08:21:51] (03PS1) 10KartikMistry: Update cxserver to 2020-05-11-082207-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/595872 (https://phabricator.wikimedia.org/T250004) [08:23:04] (03CR) 10Dzahn: "noop on contint*, created empty dir on releases*" [puppet] - 10https://gerrit.wikimedia.org/r/595521 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [08:23:27] (03PS5) 10Dzahn: contint: fix git cloning of docroot for integration.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/595525 (https://phabricator.wikimedia.org/T224591) [08:24:28] (03PS4) 10Dzahn: remove IPs of recently decom'ed appservers in eqiad D5 [dns] - 10https://gerrit.wikimedia.org/r/583377 (https://phabricator.wikimedia.org/T247780) [08:25:37] 10Operations, 10Performance-Team, 10Thumbor, 10Traffic: thumbor: set Cache-Control on 404 responses that ensures cacheability - https://phabricator.wikimedia.org/T252509 (10ema) [08:26:02] 10Operations, 10Performance-Team, 10Thumbor, 10Traffic: thumbor: set Cache-Control ensuring cacheability on 404 responses - https://phabricator.wikimedia.org/T252509 (10ema) [08:26:11] 10Operations, 10Performance-Team, 10Thumbor, 10Traffic: thumbor: set Cache-Control ensuring cacheability on 404 responses - https://phabricator.wikimedia.org/T252509 (10ema) p:05Triage→03High [08:27:25] (03CR) 10Dzahn: [C: 03+2] remove IPs of recently decom'ed appservers in eqiad D5 [dns] - 10https://gerrit.wikimedia.org/r/583377 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [08:27:44] (03CR) 10JMeybohm: parsoid: Add TLS termination support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/595505 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [08:28:07] (03Abandoned) 10JMeybohm: parsoid: Add TLS termination support [deployment-charts] - 10https://gerrit.wikimedia.org/r/595505 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [08:29:47] (03PS1) 10Ema: Allow caching 404 errors [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/595875 (https://phabricator.wikimedia.org/T252509) [08:33:11] (03PS1) 10Dzahn: remove mw1254 - mw1258, they have been decom'ed [dns] - 10https://gerrit.wikimedia.org/r/595876 (https://phabricator.wikimedia.org/T247780) [08:37:19] (03CR) 10Dzahn: [C: 03+2] "these are also in decom state in netbox since a while now" [dns] - 10https://gerrit.wikimedia.org/r/595876 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [08:40:02] (03CR) 10Gilles: [C: 04-1] "That's the Debian package repo, it should instead be changed in operations/software/thumbor-plugins" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/595875 (https://phabricator.wikimedia.org/T252509) (owner: 10Ema) [08:40:31] (03PS4) 10Dzahn: Remove jessie support from squid classes [puppet] - 10https://gerrit.wikimedia.org/r/595477 (owner: 10Muehlenhoff) [08:48:55] (03PS1) 10Ema: ATS: cache 404s without Cache-Control [puppet] - 10https://gerrit.wikimedia.org/r/595877 [08:51:42] (03Abandoned) 10Ema: Allow caching 404 errors [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/595875 (https://phabricator.wikimedia.org/T252509) (owner: 10Ema) [08:54:51] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/22462/" [puppet] - 10https://gerrit.wikimedia.org/r/595477 (owner: 10Muehlenhoff) [08:55:22] (03PS1) 10Ema: Allow caching 404 errors [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/595880 (https://phabricator.wikimedia.org/T252509) [08:55:51] (03CR) 10Dzahn: [C: 03+1] "noop in compiler, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/595477 (owner: 10Muehlenhoff) [09:01:27] (03CR) 10Dzahn: [C: 04-1] "my bad, i looked at the $domain fact on gerrit1001 where it is wikimedia.org but on phab1001 with its internal IP this becomes root@eqiad." [puppet] - 10https://gerrit.wikimedia.org/r/595488 (owner: 10Paladox) [09:02:50] (03PS4) 10Paladox: phabricator: Change mail alias only on wmcs [puppet] - 10https://gerrit.wikimedia.org/r/595488 [09:04:22] (03PS1) 10RhinosF1: Insert the description of the change.Add to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595881 [09:06:04] (03CR) 10Dzahn: [C: 03+2] phabricator: Change mail alias only on wmcs [puppet] - 10https://gerrit.wikimedia.org/r/595488 (owner: 10Paladox) [09:06:58] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:06:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:46] !log rebooting contint2001 for kernel update [09:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:48] (03CR) 10Dzahn: [C: 03+2] phabricator: Disable/Enable dumps using hiera [puppet] - 10https://gerrit.wikimedia.org/r/595479 (owner: 10Paladox) [09:09:13] (03PS5) 10Paladox: phabricator: Change mail alias only on wmcs [puppet] - 10https://gerrit.wikimedia.org/r/595488 [09:09:17] (03PS6) 10Dzahn: phabricator: Change mail alias only on wmcs [puppet] - 10https://gerrit.wikimedia.org/r/595488 (owner: 10Paladox) [09:10:49] (03PS2) 10RhinosF1: Add *.deutsche-digitale-bibliothek.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595881 (https://phabricator.wikimedia.org/T252296) [09:14:21] 10Operations, 10ops-eqiad, 10serviceops: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10Dzahn) @wiki_willy Yea, we agree we can just decom the server at this point. [09:18:11] (03PS1) 10RhinosF1: Localisations for ti[wikipedia|wikitionary] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 [09:18:12] 10Operations, 10ops-codfw, 10DBA: db2097 (backup source) restarted itself - https://phabricator.wikimedia.org/T252492 (10Marostegui) This host is under warranty from what I can see, so maybe we should get a new memory DIMM from HP? That is what we did when it happened at T225378 [09:19:16] (03CR) 10Gilles: [C: 03+2] Allow caching 404 errors [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/595880 (https://phabricator.wikimedia.org/T252509) (owner: 10Ema) [09:19:21] (03CR) 10Gilles: [V: 03+2 C: 03+2] Allow caching 404 errors [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/595880 (https://phabricator.wikimedia.org/T252509) (owner: 10Ema) [09:19:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, I don't think we'll be regressing anything in production. Note that role::logstash::puppetreports (runs in WMCS only, I'm no" [puppet] - 10https://gerrit.wikimedia.org/r/595215 (owner: 10EBernhardson) [09:20:41] (03CR) 10Muehlenhoff: [C: 03+2] Remove jessie support from squid classes [puppet] - 10https://gerrit.wikimedia.org/r/595477 (owner: 10Muehlenhoff) [09:22:21] (03CR) 10Filippo Giunchedi: [C: 03+2] conftool-data: add thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/595491 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:22:39] (03PS1) 10Dzahn: site: decom mw1280 [puppet] - 10https://gerrit.wikimedia.org/r/595885 (https://phabricator.wikimedia.org/T251077) [09:22:48] (03PS2) 10Filippo Giunchedi: conftool-data: add thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/595491 (https://phabricator.wikimedia.org/T233956) [09:23:07] (03PS1) 10Ema: ATS: cache thumbnail 404s despite of CC [puppet] - 10https://gerrit.wikimedia.org/r/595886 (https://phabricator.wikimedia.org/T252509) [09:23:15] (03CR) 10Filippo Giunchedi: [C: 03+2] "Signed off by traffic yesterday on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/595491 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:24:43] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: allocate thanos-query.svc addresses [dns] - 10https://gerrit.wikimedia.org/r/595489 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:24:47] (03PS2) 10Filippo Giunchedi: wmnet: allocate thanos-query.svc addresses [dns] - 10https://gerrit.wikimedia.org/r/595489 (https://phabricator.wikimedia.org/T233956) [09:25:29] (03PS1) 10Gilles: Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/595888 [09:25:31] (03PS2) 10RhinosF1: Localisations for ti[wikipedia|wiktionary] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 [09:25:43] (03CR) 10Gilles: [V: 03+2 C: 03+2] Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/595888 (owner: 10Gilles) [09:26:21] (03CR) 10jerkins-bot: [V: 04-1] Localisations for ti[wikipedia|wiktionary] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 (owner: 10RhinosF1) [09:26:53] (03PS1) 10Muehlenhoff: Remove more squid3 compat code [puppet] - 10https://gerrit.wikimedia.org/r/595889 [09:27:21] (03PS15) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) [09:28:19] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 72.2 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [09:29:42] !log filippo@cumin1001 conftool action : set/pooled=yes:weight=100; selector: cluster=thanos [09:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:51] (03CR) 10RhinosF1: "10:26:00 tiwikipedia is referenced, but it isn't either a wiki or a dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 (owner: 10RhinosF1) [09:31:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [09:31:29] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [09:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [09:31:56] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [09:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [09:34:21] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [09:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:07] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:19] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: update to have no embeded webserver [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594688 (https://phabricator.wikimedia.org/T233950) (owner: 10Jbond) [09:38:26] i don't know what i am doing wrong with the decom cookbook. i use it just like i did before but it says "No hosts provided" though i clearly give it a hostname [09:38:34] (03CR) 10Muehlenhoff: [C: 03+2] Switch prod IDPs to external Tomcat [puppet] - 10https://gerrit.wikimedia.org/r/595862 (https://phabricator.wikimedia.org/T233950) (owner: 10Muehlenhoff) [09:41:12] (03PS1) 10Gilles: Upgrade to 2.7 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/595891 (https://phabricator.wikimedia.org/T252509) [09:41:34] mutante: mw1253.eqiad.wmnet is not in puppetdb [09:42:09] PROBLEM - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:42:11] volans: i am trying mw1280 [09:42:39] is that because it's status "failed" in netbox? [09:43:00] if it's down since more than 2 weeks has been auto-removed from puppetdb [09:43:24] oh, even when it's still in site.pp? yea, that would be the case for this one [09:43:33] what would be the right way to remove it? [09:44:31] let me make a patch to the cookbook [09:44:39] cool, thanks! [09:45:01] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:45:42] the ticket for it was created 15 days ago. off by one day :P [09:51:47] (03PS2) 10Giuseppe Lavagetto: purged: add support for kafka [puppet] - 10https://gerrit.wikimedia.org/r/595502 (https://phabricator.wikimedia.org/T133821) [09:52:14] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review, and 2 others: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10Joe) >>! In T133821#6118058, @aaron wrote: >>>! In T133821#6092867, @Joe wrote: >> At a later time, we could think of changing the logic, and make purges avoid ra... [09:52:51] (03CR) 10jerkins-bot: [V: 04-1] purged: add support for kafka [puppet] - 10https://gerrit.wikimedia.org/r/595502 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [09:55:13] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:55:52] (03PS2) 10Filippo Giunchedi: hieradata: add thanos-query to service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/595493 (https://phabricator.wikimedia.org/T233956) [09:55:54] (03PS2) 10Filippo Giunchedi: thanos: add lvs addresses to frontend [puppet] - 10https://gerrit.wikimedia.org/r/595494 (https://phabricator.wikimedia.org/T233956) [09:58:50] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10Dzahn) @chasemp Did you want peek1001 or should we rather use something a bit more generic like sectool1001? Will it maybe hos... [10:00:20] (03PS3) 10Giuseppe Lavagetto: purged: add support for kafka [puppet] - 10https://gerrit.wikimedia.org/r/595502 (https://phabricator.wikimedia.org/T133821) [10:03:28] (03PS1) 10Dzahn: introduce sectools1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/595892 (https://phabricator.wikimedia.org/T252382) [10:03:43] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10MoritzMuehlenhoff) Why does this need a complete VM, though? If this simply sends some notifications triggered by cron jobs, si... [10:04:48] !log update compiler facts [10:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:42] (03CR) 10Muehlenhoff: [C: 03+2] Fix name for httpd::site and Icinga defs when using the staging flag [puppet] - 10https://gerrit.wikimedia.org/r/595550 (owner: 10Muehlenhoff) [10:09:19] (03PS2) 10Dzahn: introduce sectools1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/595892 [10:11:06] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/22472/" [puppet] - 10https://gerrit.wikimedia.org/r/595493 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [10:11:10] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/22472/" [puppet] - 10https://gerrit.wikimedia.org/r/595494 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [10:13:06] 10Operations, 10Patch-For-Review: buster reimaging broken with "No kernel modules found" - https://phabricator.wikimedia.org/T252382 (10Dzahn) >>! In T252382#6124358, @jcrespo wrote: > I tested it on backup1002 and this worked well. This can be closed Thanks! > - but I wonder if we should have a working grou... [10:13:16] (03PS1) 10Muehlenhoff: Enable staging IDP site for graphite [puppet] - 10https://gerrit.wikimedia.org/r/595895 [10:13:27] !log thumbor2001: upgrade python-thumbor-wikimedia to 2.6-1+deb10u1 [10:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:14] (03CR) 10Muehlenhoff: jenkins: simplify java setup, delete common class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595866 (owner: 10Dzahn) [10:17:39] (03CR) 10Volans: "Couple of questions inline. I would be also ok to fully remove the deprecated directories already, unless they are needed for some specifi" (032 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/595717 (owner: 10CRusnov) [10:19:43] !log repool thumbor2001 with upgraded python-thumbor-wikimedia [10:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:55] 10Operations, 10Performance-Team, 10Thumbor, 10Traffic, 10Patch-For-Review: thumbor: set Cache-Control ensuring cacheability on 404 responses - https://phabricator.wikimedia.org/T252509 (10ema) [10:24:25] (03PS1) 10Volans: gitignore: add build directory [cookbooks] - 10https://gerrit.wikimedia.org/r/595898 [10:24:27] 10Operations, 10Performance-Team, 10Thumbor: cwebp chokes on YCCK JPGs - https://phabricator.wikimedia.org/T226707 (10ema) 05Stalled→03Open [10:24:29] (03PS1) 10Volans: sre.hosts.decommission: support failed hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/595899 [10:24:30] mutante: ^^^ [10:25:27] (03CR) 10Muehlenhoff: [C: 03+2] Enable staging IDP site for graphite [puppet] - 10https://gerrit.wikimedia.org/r/595895 (owner: 10Muehlenhoff) [10:30:07] !log rolling thumbor upgrade to 2.6-1+deb10u1 T226707 [10:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:11] T226707: cwebp chokes on YCCK JPGs - https://phabricator.wikimedia.org/T226707 [10:30:17] !log upgrade trafficserver to version 8.0.7-1wm5 on cp4032 - T249335 [10:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:20] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [10:31:01] (03CR) 10Volans: [C: 03+2] gitignore: add build directory [cookbooks] - 10https://gerrit.wikimedia.org/r/595898 (owner: 10Volans) [10:31:59] (03CR) 10Volans: [C: 03+2] icinga: fix passive Icinga meta-monitoring for VO [puppet] - 10https://gerrit.wikimedia.org/r/595515 (https://phabricator.wikimedia.org/T252401) (owner: 10Volans) [10:33:13] (03Merged) 10jenkins-bot: gitignore: add build directory [cookbooks] - 10https://gerrit.wikimedia.org/r/595898 (owner: 10Volans) [10:34:53] (03PS1) 10JMeybohm: zotero: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/595900 (https://phabricator.wikimedia.org/T235411) [10:37:09] 10Operations, 10observability: Duplicate definitions found in Icinga configuration - https://phabricator.wikimedia.org/T211692 (10Volans) [10:37:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/595899 (owner: 10Volans) [10:38:28] (03PS2) 10Jbond: build.gradle: add memcached support to cas blob [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/592659 (https://phabricator.wikimedia.org/T233931) [10:39:29] 10Operations, 10observability: Duplicate definitions found in Icinga configuration - https://phabricator.wikimedia.org/T211692 (10Volans) Updated the description with the latest occurrences, ping: - @Joe @Gehel for the search ones - @herron @godo for the logstash one - @Ottomata @elukey for the eventgate one [10:40:24] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: support failed hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/595899 (owner: 10Volans) [10:42:12] (03Merged) 10jenkins-bot: sre.hosts.decommission: support failed hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/595899 (owner: 10Volans) [10:43:52] !log reimaging pc2010 to buster T252182 [10:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:55] T252182: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 [10:44:27] mutante: cookbook updated on the cumin hosts ready for a beta tester ;) [10:46:10] (03PS1) 10Kormat: install_server: Allow reimage of pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/595901 (https://phabricator.wikimedia.org/T252182) [10:47:17] (03CR) 10Marostegui: [C: 03+1] install_server: Allow reimage of pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/595901 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [10:47:36] (03CR) 10Kormat: [C: 03+2] install_server: Allow reimage of pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/595901 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [10:47:58] (03PS4) 10Giuseppe Lavagetto: purged: add support for kafka [puppet] - 10https://gerrit.wikimedia.org/r/595502 (https://phabricator.wikimedia.org/T133821) [10:48:15] 10Operations, 10Performance-Team, 10Thumbor: cwebp chokes on YCCK JPGs - https://phabricator.wikimedia.org/T226707 (10ema) If I understand the issue correctly, it looks fixed to me now. The following is now returned as CT:image/jpeg: ` 10:45:56 ema@cp1076.eqiad.wmnet:~ $ curl -i https://swift.discovery.wmne... [10:52:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [10:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:43] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [10:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:51] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1280.eqiad.wmnet` - mw1280.eqiad.wmnet (**FAIL**) -... [10:53:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [10:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:37] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [10:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:39] mutante: two different hosts? [10:54:41] volans: it works! i typed "done" 2 times and it continued. thanks for the quick response [10:54:44] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1280.eqiad.wmnet` - mw1280.eqiad.wmnet (**FAIL**) -... [10:54:58] volans: no, i just repeated it with actually supplying the mgmt pass [10:55:23] it seemed to have worked the first time already [10:55:24] https://phabricator.wikimedia.org/T251077#6128811 [10:55:24] !log upgrade trafficserver to version 8.0.7-1wm5 on cp5011 - T249335 [10:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:28] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [10:55:57] volans: just wanted to make sure because i failed to provide it out of habit for VMs [10:57:03] it doesn't ask it anymore for VMs ;) [10:57:22] but if the power off fails it should complain [10:57:25] that's why I'm asking [10:57:26] cool:) it also changed state in netbox, looks good [10:57:42] if you did not provide one in the first run means the check that power off works is not working properly [10:57:43] volans: Powered off [10:58:27] PROBLEM - staging-cas-graphite.wikimedia.org requires authentication on graphite2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 401 Unauthorized https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:58:47] volans: it said first "unable to connect to the host" (because it's broken) but then also "Powered off" [10:58:57] PROBLEM - MariaDB Slave IO: pc1 on pc2010 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [10:59:18] (03PS5) 10Giuseppe Lavagetto: purged: add support for kafka [puppet] - 10https://gerrit.wikimedia.org/r/595502 (https://phabricator.wikimedia.org/T133821) [10:59:21] that's the cookbook [10:59:24] I'll investigate [10:59:48] yeah, didn't fails but neither complained [10:59:48] 2020-05-12 10:53:18,718 dzahn 189982 [DEBUG ipmi.py:77 in command] [10:59:59] while the second time [10:59:59] 2020-05-12 10:54:12,614 dzahn 190061 [DEBUG ipmi.py:77 in command] Chassis Power Control: Down/Off [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200512T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:06] that's the output of the IPMI command [11:00:19] I have a patch if someone wants to [11:00:28] yes, that makes sense i guess. it cant connect to the host but it can connect to DRAC [11:00:40] and yes, first time i did not provide the pass [11:00:45] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/595881/ [11:00:50] ack, thanks, I'll send a patch later to improve it [11:00:55] thanks as well! [11:01:00] (03CR) 10Hnowlan: [C: 03+2] Change-prop: add produce metric mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/595637 (owner: 10Ppchelko) [11:01:18] (03Merged) 10jenkins-bot: Change-prop: add produce metric mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/595637 (owner: 10Ppchelko) [11:02:20] (03CR) 10Dzahn: [C: 03+2] "has been decom'ed with cookbook and was already inactive in confctl because it's broken" [puppet] - 10https://gerrit.wikimedia.org/r/595885 (https://phabricator.wikimedia.org/T251077) (owner: 10Dzahn) [11:02:51] 10Operations, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research intern - https://phabricator.wikimedia.org/T252129 (10Daniram3) [6] Here is my SSH public key for production access: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMzKRxyFfYw/A8UIQAm6o6hTdVOKCjgdh+qsXlzOGBT4 daniele@MacBook-Pro-di-D... [11:02:55] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/22475/ as-is, it just creates the /etc/purged directory" [puppet] - 10https://gerrit.wikimedia.org/r/595502 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [11:04:15] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10Dzahn) @wiki_willy @Jclark-ctr We decom'ed mw1280 on our end and you can remove it any time. [11:05:31] PROBLEM - staging-cas-graphite.wikimedia.org requires authentication on graphite1004 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 401 Unauthorized https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:05:46] (03PS1) 10Giuseppe Lavagetto: Adding the fake certificates for purged [labs/private] - 10https://gerrit.wikimedia.org/r/595904 [11:05:47] ^ moritzm is that related to your work? [11:09:09] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [11:10:27] (03PS4) 10Dzahn: jenkins: simplify java setup, delete common class [puppet] - 10https://gerrit.wikimedia.org/r/595866 [11:11:01] (03CR) 10Dzahn: jenkins: simplify java setup, delete common class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595866 (owner: 10Dzahn) [11:11:10] marostegui: ack, it's harmless, but looking into it [11:11:21] thanks [11:11:30] (03CR) 10Dzahn: [C: 03+2] Remove more squid3 compat code [puppet] - 10https://gerrit.wikimedia.org/r/595889 (owner: 10Muehlenhoff) [11:17:35] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:04] (03CR) 10Muehlenhoff: jenkins: simplify java setup, delete common class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595866 (owner: 10Dzahn) [11:19:02] 10Operations, 10Performance-Team, 10Thumbor: cwebp chokes on YCCK JPGs - https://phabricator.wikimedia.org/T226707 (10Gilles) 05Open→03Resolved YCCK images requested as webp now fall back correctly to jpeg [11:20:07] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:48] (03PS2) 10Giuseppe Lavagetto: Adding the fake certificates for purged [labs/private] - 10https://gerrit.wikimedia.org/r/595904 [11:25:08] (03PS1) 10Giuseppe Lavagetto: cache::text: enable reading purges from kafka on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/595905 (https://phabricator.wikimedia.org/T133821) [11:29:08] (03PS1) 10Hnowlan: changeprop: new package [deployment-charts] - 10https://gerrit.wikimedia.org/r/595907 [11:29:50] (03CR) 10Dzahn: [C: 03+2] "yep, tested the first one and it replaces whitespace in URLs" [puppet] - 10https://gerrit.wikimedia.org/r/595513 (owner: 10Aklapper) [11:30:11] (03PS2) 10Dzahn: phabricator weekly changes email: Fix links to project pages [puppet] - 10https://gerrit.wikimedia.org/r/595513 (owner: 10Aklapper) [11:35:30] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:30] ^ fixing [11:36:36] (03PS1) 10Muehlenhoff: Don't add an Icinga service for the staging IDP vhosts [puppet] - 10https://gerrit.wikimedia.org/r/595908 [11:36:42] (03PS1) 10Privacybatm: transfer.py: Move MariaDB and Firewall logic to its new handler modules [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) [11:36:54] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:27] (03CR) 10Dzahn: [C: 03+2] "works. the results are 43385 and 1230, btw." [puppet] - 10https://gerrit.wikimedia.org/r/595514 (owner: 10Aklapper) [11:37:43] (03PS2) 10Dzahn: Phabricator monthly email: Explicitly list number of stalled tasks [puppet] - 10https://gerrit.wikimedia.org/r/595514 (owner: 10Aklapper) [11:37:51] (03CR) 10jerkins-bot: [V: 04-1] Don't add an Icinga service for the staging IDP vhosts [puppet] - 10https://gerrit.wikimedia.org/r/595908 (owner: 10Muehlenhoff) [11:39:05] (03PS2) 10Muehlenhoff: Don't add an Icinga service for the staging IDP vhosts [puppet] - 10https://gerrit.wikimedia.org/r/595908 [11:40:54] (03CR) 10Hnowlan: Preserve datetime field of the purge events (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/595601 (https://phabricator.wikimedia.org/T252127) (owner: 10Ppchelko) [11:41:17] (03CR) 10Hnowlan: [C: 03+2] changeprop: new package [deployment-charts] - 10https://gerrit.wikimedia.org/r/595907 (owner: 10Hnowlan) [11:41:35] (03Merged) 10jenkins-bot: changeprop: new package [deployment-charts] - 10https://gerrit.wikimedia.org/r/595907 (owner: 10Hnowlan) [11:42:59] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:44:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:45:18] (03PS2) 10Privacybatm: transfer.py: Move MariaDB and Firewall logic to its new handler modules [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) [11:48:12] (03CR) 10Dzahn: [C: 04-1] "it's not installed in prod so not really used. let's not install it unless we use it. I hope you can just click ignore on the warning?" [puppet] - 10https://gerrit.wikimedia.org/r/594157 (owner: 10Paladox) [11:49:04] (03PS2) 10Dzahn: role::puppetmaster::standalone: explicitly include httpd [puppet] - 10https://gerrit.wikimedia.org/r/595650 (owner: 10Andrew Bogott) [11:49:59] (03CR) 10Dzahn: [C: 04-1] "thanks, but it's a duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/595701 meanwhile" [puppet] - 10https://gerrit.wikimedia.org/r/595650 (owner: 10Andrew Bogott) [11:50:49] (03CR) 10Dzahn: [C: 04-1] "per Subbu's comments it still needs to be amended" [puppet] - 10https://gerrit.wikimedia.org/r/577656 (owner: 10C. Scott Ananian) [11:52:52] (03CR) 10Dzahn: [C: 04-1] "please continue the discussion from https://phabricator.wikimedia.org/T244162#6021780 on https://phabricator.wikimedia.org/T215360 I am ju" [puppet] - 10https://gerrit.wikimedia.org/r/569627 (https://phabricator.wikimedia.org/T215360) (owner: 10Zoranzoki21) [11:53:48] (03CR) 10Dzahn: "i'm afraid this is too large of a patch at once for people to review" [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [11:55:21] (03CR) 10Dzahn: [C: 04-1] "@paladox I think we don't need this anymore since you already fixed and deployed phab in cloud already." [puppet] - 10https://gerrit.wikimedia.org/r/565712 (owner: 10Paladox) [11:56:32] (03CR) 10Hnowlan: [C: 04-1] "This will need a new package to release this change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/595601 (https://phabricator.wikimedia.org/T252127) (owner: 10Ppchelko) [11:56:45] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [11:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:23] (03CR) 10Dzahn: [C: 03+1] "I grouped these pending patches under https://gerrit.wikimedia.org/r/q/topic:%22gerrit-paladox%22+(status:open%20OR%20status:merged) Could" [puppet] - 10https://gerrit.wikimedia.org/r/556270 (owner: 10Paladox) [11:58:15] (03CR) 10Dzahn: "@jbond Do you know about this?" [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939) (owner: 10Jcrespo) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200512T1200) [12:03:46] (03CR) 10Dzahn: "this was originally uploaded in 2017. let me try a manual rebase and see what's left." [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto) [12:03:56] (03PS5) 10Dzahn: profile::mediawiki::jobrunner: restrict firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto) [12:06:28] (03CR) 10Jbond: [C: 03+1] "Servermon is dead and nothing else in puppetdb uses MySQL so i think this is fine. added moritz in case im missing some historic backgrou" [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939) (owner: 10Jcrespo) [12:07:12] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/22479/mw1337.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto) [12:07:27] 10Operations, 10Traffic: Investigate trafficserver-tls crash on cp3064 - https://phabricator.wikimedia.org/T240183 (10Marostegui) @ema @Vgutierrez any outcome here? Any point on keeping this track opened? [12:08:05] !log restart blazegraph + updater on wdqs2002 - JVM upgrade [12:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:13] !log Cutting branch 1.35.0-wmf.32 # T249964 [12:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:16] T249964: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 [12:08:29] 10Operations, 10SRE-tools: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314 (10Marostegui) [12:12:20] (03CR) 10ArielGlenn: [C: 03+2] in page content fixup script, check for truncation, move into place if good [dumps] - 10https://gerrit.wikimedia.org/r/595518 (owner: 10ArielGlenn) [12:12:40] (03CR) 10jerkins-bot: [V: 04-1] in page content fixup script, check for truncation, move into place if good [dumps] - 10https://gerrit.wikimedia.org/r/595518 (owner: 10ArielGlenn) [12:13:16] (03CR) 10Dzahn: [C: 03+1] "This still looks good to me (now). port 9005 is open and just for health checks. The only thing using it i see is an Icinga check_command " [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto) [12:13:34] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] in page content fixup script, check for truncation, move into place if good [dumps] - 10https://gerrit.wikimedia.org/r/595518 (owner: 10ArielGlenn) [12:14:16] (03CR) 10Muehlenhoff: [C: 03+1] "Agreed, this should only be used by the (now gone) servermon" [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939) (owner: 10Jcrespo) [12:14:48] (03CR) 10Dzahn: "since torrelay is gone ( ;/ ) i guess this is not going to be used anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [12:14:51] (03PS1) 10ArielGlenn: tiny flake8 fix [dumps] - 10https://gerrit.wikimedia.org/r/595916 [12:18:23] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [12:18:34] 10Operations: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10Dzahn) [12:19:18] 10Operations: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10Dzahn) [12:19:22] 10Operations, 10Patch-For-Review: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10Dzahn) [12:19:39] 10Operations: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10Dzahn) p:05Triage→03Medium [12:19:44] (03CR) 10ArielGlenn: [C: 03+2] tiny flake8 fix [dumps] - 10https://gerrit.wikimedia.org/r/595916 (owner: 10ArielGlenn) [12:19:49] 10Operations: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10Dzahn) a:03Dzahn [12:20:15] (03Merged) 10jenkins-bot: tiny flake8 fix [dumps] - 10https://gerrit.wikimedia.org/r/595916 (owner: 10ArielGlenn) [12:21:35] (03CR) 10Dzahn: "Moritz, given the new information on blockers, is that (still) a veto from you?" [puppet] - 10https://gerrit.wikimedia.org/r/593166 (https://phabricator.wikimedia.org/T251349) (owner: 10Dzahn) [12:23:49] 10Operations, 10SRE-swift-storage: xfs_db blocked / timeout on ms-be2023 - https://phabricator.wikimedia.org/T185298 (10Marostegui) @fgiunchedi is this task still valid? [12:23:51] 10Operations, 10DBA: wmf-auto-reinstall fails on hosts that run pt-heartbeat - https://phabricator.wikimedia.org/T252528 (10Kormat) [12:26:27] 10Operations, 10DBA: wmf-auto-reinstall fails on hosts that run pt-heartbeat - https://phabricator.wikimedia.org/T252528 (10Marostegui) p:05Triage→03Medium This is "expected" as the host doesn't have MySQL up and running. This is pretty much the last step of the script, so even if the installation reports... [12:29:31] 10Operations: mw1230 sdb "Raw_Read_Error_Rate" SMART - https://phabricator.wikimedia.org/T194036 (10Marostegui) 05Open→03Invalid This host no longer exists - resolving this. [12:31:48] (03CR) 10Marostegui: [C: 03+1] "This requires applying the grants manually on the DB, let me know when do you want me to do it" [puppet] - 10https://gerrit.wikimedia.org/r/595207 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [12:31:55] 10Operations, 10MediaWiki-Cache, 10Traffic, 10serviceops, and 4 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Ladsgroup) Amazing work. Thank you! [12:35:27] (03PS1) 10Vgutierrez: ATS: Increase max_connections_in and max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/595919 (https://phabricator.wikimedia.org/T249335) [12:37:34] !log installing iputils update from Buster point release [12:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:29] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Adding the fake certificates for purged [labs/private] - 10https://gerrit.wikimedia.org/r/595904 (owner: 10Giuseppe Lavagetto) [12:42:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] "This is going to be fun, let's see how it goes. LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/595900 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [12:42:37] (03CR) 10Alexandros Kosiaris: [C: 03+1] New upstream version 2.16.7 [debs/helm] - 10https://gerrit.wikimedia.org/r/595591 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [12:48:13] (03PS6) 10Giuseppe Lavagetto: purged: add support for kafka [puppet] - 10https://gerrit.wikimedia.org/r/595502 (https://phabricator.wikimedia.org/T133821) [12:48:15] (03PS2) 10Giuseppe Lavagetto: cache::text: enable reading purges from kafka on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/595905 (https://phabricator.wikimedia.org/T133821) [12:51:14] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work): SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10Dzahn) [12:52:06] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work): SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10Dzahn) Added Ryan to pwstore in the ops group after importing his key and checking it had a signature from Keith. [12:53:36] 10Operations, 10Privacy Engineering, 10Research, 10Traffic, and 2 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10Dzahn) Thanks @leila ! I would be happy to merge my patches but i don't have +2 on that repo. There is... [12:54:15] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/22481/cp3050.esams.wmnet/index.html seems to do the right thing. On hold for review and w" [puppet] - 10https://gerrit.wikimedia.org/r/595905 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [13:00:05] hashar and twentyafterfour: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200512T1300). [13:03:02] (03Abandoned) 10Faidon Liambotis: tor: add an additional relay instance [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [13:04:49] (03PS1) 10Kormat: install_server: Allow reimage of pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/595922 (https://phabricator.wikimedia.org/T252182) [13:05:43] !log hashar@deploy1001 Pruned MediaWiki: 1.35.0-wmf.28 (duration: 23m 47s) [13:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:52] (03CR) 10Muehlenhoff: [C: 03+2] Don't add an Icinga service for the staging IDP vhosts [puppet] - 10https://gerrit.wikimedia.org/r/595908 (owner: 10Muehlenhoff) [13:06:39] (03CR) 10Marostegui: [C: 03+1] install_server: Allow reimage of pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/595922 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [13:06:55] (03CR) 10Kormat: [C: 03+2] install_server: Allow reimage of pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/595922 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [13:07:03] (03PS1) 10Hashar: testwikis wikis to 1.35.0-wmf.32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595923 [13:07:05] (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.35.0-wmf.32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595923 (owner: 10Hashar) [13:08:06] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595923 (owner: 10Hashar) [13:08:38] !log hashar@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.32 [13:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:16] _joe_: is it safe to puppet-merge your BATMAN changes? [13:12:27] <_joe_> kormat: sure [13:12:47] 🦇 [13:12:53] moritzm: is it safe to puppet-merge your icinga/idp change? [13:13:10] ack, please do, got distracted [13:13:54] moritzm: np, done [13:15:20] (03CR) 10Ema: [C: 03+1] ATS: Increase max_connections_in and max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/595919 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [13:16:35] 10Operations, 10SRE-OnFire, 10Goal: SRE firefighting improvements - 2019-20 Q1 Goal - https://phabricator.wikimedia.org/T229782 (10CDanis) [13:17:55] (03CR) 10Ema: [C: 03+2] Upgrade to 2.7 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/595891 (https://phabricator.wikimedia.org/T252509) (owner: 10Gilles) [13:18:34] 10Operations, 10SRE-tools: wmf-auto-reimage-host: failed to resolve mgmt FQDN while renaming host - https://phabricator.wikimedia.org/T214314 (10Volans) [13:22:00] 10Operations, 10MediaWiki-extensions-CodeReview: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Dzahn) @Legoktm Where can i find the dump file please? [13:24:21] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [13:24:21] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [13:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:35] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [13:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:15] (03CR) 10Vgutierrez: [C: 03+2] ATS: Increase max_connections_in and max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/595919 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [13:26:53] (03PS1) 10Dzahn: site: add people1002 [puppet] - 10https://gerrit.wikimedia.org/r/595927 (https://phabricator.wikimedia.org/T247649) [13:29:07] (03PS1) 10Kormat: install_server: Switch all remaining pc* hosts to buster. [puppet] - 10https://gerrit.wikimedia.org/r/595928 (https://phabricator.wikimedia.org/T252182) [13:30:32] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:33] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:37] (03CR) 10Marostegui: [C: 03+1] install_server: Switch all remaining pc* hosts to buster. [puppet] - 10https://gerrit.wikimedia.org/r/595928 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [13:31:19] (03CR) 10Kormat: [C: 03+2] install_server: Switch all remaining pc* hosts to buster. [puppet] - 10https://gerrit.wikimedia.org/r/595928 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [13:31:29] !log rebooting deneb for kernel update [13:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:39] !log rolling upgrade of ATS to version 8.0.7-1wm5 - T249335 [13:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:46] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [13:34:56] (03CR) 10Ema: purged: add support for kafka (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/595502 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [13:35:40] (03CR) 10Ema: [C: 03+1] Add integration tests using docker-compose [software/purged] - 10https://gerrit.wikimedia.org/r/594148 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [13:36:04] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:25] !log rebooting netflow* hosts for kernel update [13:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:33] !log reimaging pc2007 to buster T252182 [13:36:34] (03CR) 10Ema: [C: 03+1] "Please add a line to d/changelog too, otherwise looks good!" [software/purged] - 10https://gerrit.wikimedia.org/r/594953 (owner: 10Giuseppe Lavagetto) [13:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:35] T252182: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 [13:36:46] (03CR) 10Ema: [C: 03+1] Add the ability to consume from kafka [software/purged] - 10https://gerrit.wikimedia.org/r/594147 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [13:39:03] PROBLEM - MariaDB Slave SQL: pc1 on pc2007 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:39:15] (03CR) 10Ema: [C: 03+1] "Please merge with puppet disabled on all 'A:cp' hosts. Depool a single codfw text node (eg: cp2027) and try the patch there. Do the same o" [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [13:39:19] PROBLEM - MariaDB read only pc1 on pc2007 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:40:26] kormat: ^ [13:41:20] (03PS1) 10Filippo Giunchedi: swift: move swift::params::swift_cluster to profile::swift::cluster [puppet] - 10https://gerrit.wikimedia.org/r/595930 (https://phabricator.wikimedia.org/T252537) [13:42:22] !log disable puppet on all CP hosts to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/583342 [13:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:32] (03CR) 10Jbond: [C: 03+2] varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [13:44:18] marostegui: what did i do wrong? [13:44:22] (03PS1) 10Volans: wmf-auto-reimage: fix autodetected rename MGMT [puppet] - 10https://gerrit.wikimedia.org/r/595931 (https://phabricator.wikimedia.org/T214314) [13:44:47] kormat: I guess notifications didn't get disabled and/or no downtime was applied? [13:45:14] Maybe the bug we saw the other day with just some notifications getting disabled but not all of them? [13:45:21] I haven't checked the host itself yet [13:45:23] 10Operations, 10SRE-tools, 10Patch-For-Review: wmf-auto-reimage-host: failed to resolve mgmt FQDN while renaming host - https://phabricator.wikimedia.org/T214314 (10Volans) I managed to find the related bug in the code also without the logs. It should be fixed once the above patch gets reviewed/merged [13:45:47] (03PS1) 10Jbond: Revert "varnish: update varnish config to use the abuse_networks global" [puppet] - 10https://gerrit.wikimedia.org/r/595933 [13:45:56] https://cas-icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=pc2007 says they're all downtimed now at least [13:46:02] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "varnish: update varnish config to use the abuse_networks global" [puppet] - 10https://gerrit.wikimedia.org/r/595933 (owner: 10Jbond) [13:46:28] May 12 13:24:24 icinga1001 puppet-agent[214607]: Applying configuration version '(a43ce5a8fb) Kormat - install_server: Allow reimage of pc2007' [13:46:54] that's 15mins before the alert fired [13:47:23] (03PS1) 10Ottomata: Add notes_link to some eventgate icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/595934 (https://phabricator.wikimedia.org/T211692) [13:48:03] (03PS1) 10Jbond: Revert "Revert "varnish: update varnish config to use the abuse_networks global"" [puppet] - 10https://gerrit.wikimedia.org/r/595935 [13:48:06] so, yeah. smells like an icinga bug [13:50:00] (03PS2) 10Ottomata: Add notes_link and use + instead of space in some eventgate icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/595934 (https://phabricator.wikimedia.org/T211692) [13:50:55] !log thumbor2001: upgrade python-thumbor-wikimedia to 2.7-1+deb10u1 T252509 T219569 T236240 [13:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:59] T236240: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 [13:51:00] T219569: libvips thumbnail generation fails for TIFF files with invalid ICC profiles - https://phabricator.wikimedia.org/T219569 [13:51:00] T252509: thumbor: set Cache-Control ensuring cacheability on 404 responses - https://phabricator.wikimedia.org/T252509 [13:51:04] (03CR) 10Volans: Add notes_link and use + instead of space in some eventgate icinga alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595934 (https://phabricator.wikimedia.org/T211692) (owner: 10Ottomata) [13:52:24] (03PS3) 10Ottomata: Add notes_link and use + instead of space in some eventgate icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/595934 (https://phabricator.wikimedia.org/T211692) [13:52:38] (03CR) 10Ottomata: Add notes_link and use + instead of space in some eventgate icinga alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595934 (https://phabricator.wikimedia.org/T211692) (owner: 10Ottomata) [13:53:04] (03PS2) 10Filippo Giunchedi: swift: move swift::params::swift_cluster to profile::swift::cluster [puppet] - 10https://gerrit.wikimedia.org/r/595930 (https://phabricator.wikimedia.org/T252537) [13:54:21] !log thumbor2001: pool thumbor 2.7-1+deb10u1 for prod traffic T252509 T219569 T236240 [13:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:42] (03CR) 10Ottomata: [C: 03+2] Add notes_link and use + instead of space in some eventgate icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/595934 (https://phabricator.wikimedia.org/T211692) (owner: 10Ottomata) [13:55:27] (03CR) 10Filippo Giunchedi: "PCC appears to be happy https://puppet-compiler.wmflabs.org/compiler1003/22485/" [puppet] - 10https://gerrit.wikimedia.org/r/595930 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [13:56:33] (03CR) 10Ottomata: [C: 03+2] "Nice! I was wondering why I was getting that." [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/595859 (https://phabricator.wikimedia.org/T250161) (owner: 10Elukey) [13:56:43] (03Abandoned) 10Andrew Bogott: role::puppetmaster::standalone: explicitly include httpd [puppet] - 10https://gerrit.wikimedia.org/r/595650 (owner: 10Andrew Bogott) [13:58:49] (03PS1) 10Ottomata: Remove notes_links from monitoring::alerts::kafka_topic_throughput [puppet] - 10https://gerrit.wikimedia.org/r/595936 [13:59:56] hashar, how is the train going? [14:00:06] (03CR) 10Ottomata: [C: 03+2] Remove notes_links from monitoring::alerts::kafka_topic_throughput [puppet] - 10https://gerrit.wikimedia.org/r/595936 (owner: 10Ottomata) [14:00:47] !log thumbor2001: depool due to minor bug in 2.7-1+deb10u1 T252509 T219569 T236240 [14:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:53] T236240: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 [14:00:53] T219569: libvips thumbnail generation fails for TIFF files with invalid ICC profiles - https://phabricator.wikimedia.org/T219569 [14:00:53] T252509: thumbor: set Cache-Control ensuring cacheability on 404 responses - https://phabricator.wikimedia.org/T252509 [14:02:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [14:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:36] (03CR) 10CDanis: [C: 03+1] "LGTM assuming no diffs" [puppet] - 10https://gerrit.wikimedia.org/r/595930 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [14:05:00] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [14:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:04] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:08] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [14:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:32] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:01] hashar: twentyafterfour also curious about train status, i have a mw config change i want to sync [14:06:13] (03PS6) 10Hnowlan: changeprop: add cpjobqueue configuration switching [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) [14:06:21] from brief scan of T249964 it looks like maybe train is stuck for now? [14:06:21] T249964: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 [14:06:34] ottomata: it is rsyncing the stuff [14:06:36] which takes age and age [14:06:38] ah ok [14:06:41] i will wait then [14:06:56] :( [14:06:58] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: move API traffic to cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/595229 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [14:07:03] (03PS3) 10Andrew Bogott: Openstack: move API traffic to cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/595229 (https://phabricator.wikimedia.org/T252121) [14:07:13] (03CR) 10Hnowlan: changeprop: add cpjobqueue configuration switching (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:07:13] seems to be cpu bound [14:07:15] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: move all openstack API support to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/595227 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [14:07:22] ty! hashar can you ping me when it is safe to scap sync my change? [14:07:29] due to compressing every single files or something [14:07:31] sure [14:07:33] aye [14:07:41] 75 servers lseft [14:09:25] (03PS1) 10Gilles: Fix _write_results_to_client compatibility for both Thumbor 6 and 7 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/595938 [14:10:20] 10Operations, 10observability, 10Patch-For-Review: Duplicate definitions found in Icinga configuration - https://phabricator.wikimedia.org/T211692 (10Ottomata) I dunno why this would happen, the alerts look pretty distinct. There was a space in the dashboard_link that showed up in the notes_url on line 1786... [14:10:57] (03CR) 10Jbond: [C: 03+2] Revert "Revert "varnish: update varnish config to use the abuse_networks global"" [puppet] - 10https://gerrit.wikimedia.org/r/595935 (owner: 10Jbond) [14:11:23] (03CR) 10Gilles: [V: 03+2 C: 03+2] Fix _write_results_to_client compatibility for both Thumbor 6 and 7 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/595938 (owner: 10Gilles) [14:11:46] (03CR) 10Ema: [C: 03+1] thanos: add lvs addresses to frontend [puppet] - 10https://gerrit.wikimedia.org/r/595494 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [14:12:08] (03PS1) 10Gilles: Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/595939 [14:12:16] (03CR) 10Gilles: [V: 03+2 C: 03+2] Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/595939 (owner: 10Gilles) [14:13:07] (03CR) 10Ema: [C: 03+1] hieradata: add thanos-query to service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/595493 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [14:13:11] (03PS1) 10Gilles: Upgrade to 2.8 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/595940 [14:13:29] (03PS2) 10Ottomata: systemd::timer::job - add ability to syslog match based on programname equality [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) [14:14:31] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job - add ability to syslog match based on programname equality [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) (owner: 10Ottomata) [14:15:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Thanks! this is a pretty nice start! I like the general idea and think this is more or less the way to go. I have a couple of comments." (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/595177 (owner: 10Apakhomov) [14:16:23] (03PS2) 10Gilles: Upgrade to 2.8 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/595940 [14:16:32] (03PS3) 10Ottomata: systemd::timer::job - add ability to syslog match based on programname equality [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) [14:16:58] (03CR) 10Elukey: Add bash shabang to all bin scripts (031 comment) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/595859 (https://phabricator.wikimedia.org/T250161) (owner: 10Elukey) [14:17:03] (03CR) 10Volans: Add notes_link and use + instead of space in some eventgate icinga alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595934 (https://phabricator.wikimedia.org/T211692) (owner: 10Ottomata) [14:17:32] (03PS2) 10Elukey: Add bash shabang to all bin scripts [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/595859 (https://phabricator.wikimedia.org/T250161) [14:17:34] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job - add ability to syslog match based on programname equality [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) (owner: 10Ottomata) [14:18:42] 10Operations, 10Traffic: Investigate trafficserver-tls crash on cp3064 - https://phabricator.wikimedia.org/T240183 (10ema) 05Open→03Resolved a:03ema >>! In T240183#6128998, @Marostegui wrote: > @ema @Vgutierrez any outcome here? Any point on keeping this track opened? Nope, thanks @Marostegui. We can r... [14:18:47] 10Operations, 10observability: Duplicate definitions found in Icinga configuration - https://phabricator.wikimedia.org/T211692 (10Volans) >>! In T211692#6129396, @Ottomata wrote: > I dunno why this would happen, the alerts look pretty distinct. There was a space in the dashboard_link that showed up in the not... [14:20:11] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add thanos-query to service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/595493 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [14:20:25] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add lvs addresses to frontend [puppet] - 10https://gerrit.wikimedia.org/r/595494 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [14:20:42] !log hashar@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.32 (duration: 72m 04s) [14:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:21] ottomata: finished ! [14:23:18] !log installing Java security updates on WDQS hosts [14:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:19] I added a warning that Tuesdays steps can take longer; should've done it when I conducted the train last :( [14:28:36] (03PS1) 10Kormat: install_server: Fix netboot.cfg entry for pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/595945 (https://phabricator.wikimedia.org/T252182) [14:29:33] ty! [14:29:46] (03CR) 10Ema: [C: 03+2] Upgrade to 2.8 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/595940 (owner: 10Gilles) [14:30:01] (03PS4) 10Ottomata: systemd::timer::job - add ability to syslog match based on programname equality [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) [14:30:07] (03CR) 10Ottomata: [C: 03+2] Configure wgEventLoggingSchemas overrides in beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595634 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [14:30:20] (03PS3) 10Filippo Giunchedi: swift: move swift::params::swift_cluster to profile::swift::cluster [puppet] - 10https://gerrit.wikimedia.org/r/595930 (https://phabricator.wikimedia.org/T252537) [14:30:22] (03PS1) 10Filippo Giunchedi: hieradata: set thanos-query as lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/595946 (https://phabricator.wikimedia.org/T252537) [14:30:32] (03CR) 10Pablo Grass (WMDE): [C: 03+1] "No objection. In fact, I am surprised that this wasn't the case before as my LocalSettings has a commented-out version of the anchored Reg" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595544 (owner: 10Lucas Werkmeister (WMDE)) [14:33:11] !log thumbor2001: upgrade python-thumbor-wikimedia to 2.8-1+deb10u1 T252509 T219569 T236240 [14:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:17] T236240: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 [14:33:17] T219569: libvips thumbnail generation fails for TIFF files with invalid ICC profiles - https://phabricator.wikimedia.org/T219569 [14:33:17] T252509: thumbor: set Cache-Control ensuring cacheability on 404 responses - https://phabricator.wikimedia.org/T252509 [14:33:44] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - Test everywhere, SearchSatisfaction on testwiki only - T249261 (duration: 01m 06s) [14:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:48] T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 [14:34:22] !log thumbor2001: repool [14:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:55] (03CR) 10Lucas Werkmeister (WMDE): "Thanks, added to tomorrow’s EU SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595544 (owner: 10Lucas Werkmeister (WMDE)) [14:35:38] (03PS1) 10Andrew Bogott: Rebuild cloudcontrol1003/1004 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/595949 (https://phabricator.wikimedia.org/T252121) [14:36:38] ok I think I got it cut [14:36:41] it is on testwiki [14:36:51] (03CR) 10Andrew Bogott: [C: 03+2] Rebuild cloudcontrol1003/1004 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/595949 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [14:37:09] and I can't do the group0 promotion right now [14:37:23] so well later [14:38:03] !log 1.35.0-wmf.22 is on test wikis. Will be pushed to group0 later today during the american window (19:00 - 21:00 UTC) # T249964 [14:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:07] T249964: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 [14:39:55] !log rebuilding cloudcontrol1003 and 1004 [14:39:56] !log rolling thumbor upgrade to 2.8-1+deb10u1 T252509 T219569 T236240 [14:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:02] T236240: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 [14:40:02] T219569: libvips thumbnail generation fails for TIFF files with invalid ICC profiles - https://phabricator.wikimedia.org/T219569 [14:40:02] T252509: thumbor: set Cache-Control ensuring cacheability on 404 responses - https://phabricator.wikimedia.org/T252509 [14:40:43] !log imported openjdk-8 u252 forward port for buster-wikimedia component/jdk8 [14:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add the ability to consume from kafka [software/purged] - 10https://gerrit.wikimedia.org/r/594147 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [14:42:37] (03CR) 10Marostegui: [C: 03+1] install_server: Fix netboot.cfg entry for pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/595945 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [14:42:39] (03PS4) 10Filippo Giunchedi: swift: move swift::params::swift_cluster to profile::swift::cluster [puppet] - 10https://gerrit.wikimedia.org/r/595930 (https://phabricator.wikimedia.org/T252537) [14:42:41] (03PS2) 10Filippo Giunchedi: hieradata: set thanos-query as lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/595946 (https://phabricator.wikimedia.org/T252186) [14:42:54] 10Operations, 10netops: Routinator RSYNC errors - https://phabricator.wikimedia.org/T240817 (10ayounsi) 05Stalled→03Resolved Fix is now running in prod. Grafana alerts have been updated accordingly. [14:43:29] (03CR) 10Kormat: [C: 03+2] install_server: Fix netboot.cfg entry for pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/595945 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [14:46:54] (03CR) 10Giuseppe Lavagetto: "recheck" [software/purged] - 10https://gerrit.wikimedia.org/r/594148 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [14:46:58] (03CR) 10Dzahn: [C: 03+2] site: add people1002 [puppet] - 10https://gerrit.wikimedia.org/r/595927 (https://phabricator.wikimedia.org/T247649) (owner: 10Dzahn) [14:47:09] (03PS2) 10Dzahn: site: add people1002 [puppet] - 10https://gerrit.wikimedia.org/r/595927 (https://phabricator.wikimedia.org/T247649) [14:49:59] (03PS1) 10Dzahn: DHCP: add people1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/595951 (https://phabricator.wikimedia.org/T247649) [14:50:51] (03CR) 10Dzahn: [C: 03+2] DHCP: add people1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/595951 (https://phabricator.wikimedia.org/T247649) (owner: 10Dzahn) [14:51:18] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: set thanos-query as lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/595946 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [14:51:27] (03PS3) 10Filippo Giunchedi: hieradata: set thanos-query as lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/595946 (https://phabricator.wikimedia.org/T252186) [14:52:22] (03Abandoned) 10Ema: ATS: cache thumbnail 404s despite of CC [puppet] - 10https://gerrit.wikimedia.org/r/595886 (https://phabricator.wikimedia.org/T252509) (owner: 10Ema) [14:53:10] 10Operations, 10Performance-Team, 10Thumbor, 10Traffic, 10Patch-For-Review: thumbor: set Cache-Control ensuring cacheability on 404 responses - https://phabricator.wikimedia.org/T252509 (10Gilles) 05Open→03Resolved a:03Gilles Ratio of 404s to 200s on Thumbor are now back to their pre-2020-04-30 levels [14:55:54] (03PS1) 10Ottomata: Fix duplicate alert descriptions for eventgate_logging_external_errors [puppet] - 10https://gerrit.wikimedia.org/r/595952 (https://phabricator.wikimedia.org/T211692) [14:56:21] (03CR) 10Jdlrobson: Optimise all static PNGs losslessly (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594943 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [14:56:58] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) 05Open→03Resolved Fix confirmed on https://commons.wikimedia... [14:57:52] (03CR) 10EBernhardson: "This doesn't impact production, rather in WMCS only instances inside deployment-prep project can talk to kafka. Firewalls and security gro" [puppet] - 10https://gerrit.wikimedia.org/r/595215 (owner: 10EBernhardson) [14:58:40] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 75 connections established with conf2001.codfw.wmnet:2379 (min=76) https://wikitech.wikimedia.org/wiki/PyBal [14:59:02] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.53:10902]) https://wikitech.wikimedia.org/wiki/PyBal [14:59:34] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.53:10902]) https://wikitech.wikimedia.org/wiki/PyBal [14:59:40] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 55 connections established with conf2001.codfw.wmnet:2379 (min=56) https://wikitech.wikimedia.org/wiki/PyBal [14:59:43] that's likely me ^ needs pybal restarts afaik [15:00:00] it seems just one under the alert threshold for both [15:00:15] yeah I just added a service, checks out [15:00:40] 10Operations, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10Aklapper) [15:01:05] ack [15:01:16] !log bounce pybal on lvs2010 and lvs2009 - T252186 [15:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:19] T252186: Deploy Thanos (Prometheus long-term storage) stateful components - https://phabricator.wikimedia.org/T252186 [15:02:08] !log upgrading contint2001 to openjdk-8 u252 [15:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:01] (03PS1) 10Ottomata: Allow eventgate-analytics-* to access schema.svc on port 443 [deployment-charts] - 10https://gerrit.wikimedia.org/r/595953 [15:04:04] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 76 connections established with conf2001.codfw.wmnet:2379 (min=76) https://wikitech.wikimedia.org/wiki/PyBal [15:04:24] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:04:37] !log installing 4.9.118 Linux updates on Buster nodes (reboots happening later) [15:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:00] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:05:08] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 56 connections established with conf2001.codfw.wmnet:2379 (min=56) https://wikitech.wikimedia.org/wiki/PyBal [15:05:16] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [15:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] Allow eventgate-analytics-* to access schema.svc on port 443 [deployment-charts] - 10https://gerrit.wikimedia.org/r/595953 (owner: 10Ottomata) [15:06:00] (03PS1) 10Filippo Giunchedi: hieradata: thanos-query to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/595955 (https://phabricator.wikimedia.org/T252186) [15:06:11] (03Merged) 10jenkins-bot: Allow eventgate-analytics-* to access schema.svc on port 443 [deployment-charts] - 10https://gerrit.wikimedia.org/r/595953 (owner: 10Ottomata) [15:06:42] 10Operations, 10vm-requests: eqiad/codfw: 1 each VM request for people.wikimedia.org - https://phabricator.wikimedia.org/T249907 (10Dzahn) 05Open→03Resolved Tried it again today and this time it worked (maybe because other VMs have been removed meanwhile). VM has been created (in eqiad) as people1002. [15:06:47] 10Operations, 10serviceops, 10Patch-For-Review: upgrade people.wikimedia.org backend to buster - https://phabricator.wikimedia.org/T247649 (10Dzahn) [15:07:43] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: thanos-query to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/595955 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:07:45] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:52] (03CR) 10Ottomata: "I think should work and be a no-op if not specified to use equals." [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) (owner: 10Ottomata) [15:08:14] (03CR) 10Ottomata: "I had thought about updating to the newer rsyslog conf syntax, but then got scared when I saw how many places this was used." [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) (owner: 10Ottomata) [15:09:15] (03PS2) 10Ottomata: Fix duplicate alert descriptions for eventgate_logging_external_errors [puppet] - 10https://gerrit.wikimedia.org/r/595952 (https://phabricator.wikimedia.org/T211692) [15:09:45] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:09:46] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [15:12:13] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:06] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime [15:13:06] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:21] (03PS7) 10Ppchelko: changeprop: add cpjobqueue configuration switching [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [15:14:59] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:11] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:21] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:36] (03CR) 10Ppchelko: changeprop: add cpjobqueue configuration switching (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [15:17:22] (03PS1) 10Dzahn: site: add peopleweb role to people1002 [puppet] - 10https://gerrit.wikimedia.org/r/595956 (https://phabricator.wikimedia.org/T247649) [15:18:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: (Need By: TBD) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Jclark-ctr) @ayounsi did we have host names yet? [15:19:37] (03PS1) 10Dzahn: add IPv6 records for people1002 [dns] - 10https://gerrit.wikimedia.org/r/595957 (https://phabricator.wikimedia.org/T247649) [15:23:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: (Need By: TBD) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10ayounsi) Yep, see diagram (minus the typo). `cloudsw-c8-eqiad` `cloudsw-d5-eqiad` [15:23:38] (03PS1) 10Dzahn: switch peopleweb service/discovery names to people1002 [dns] - 10https://gerrit.wikimedia.org/r/595959 (https://phabricator.wikimedia.org/T247649) [15:23:41] (03PS1) 10Filippo Giunchedi: hieradata: thanos-query to production [puppet] - 10https://gerrit.wikimedia.org/r/595958 (https://phabricator.wikimedia.org/T252186) [15:23:51] (03PS1) 10Marostegui: install_server: Allow reimage of dbproxy1020 [puppet] - 10https://gerrit.wikimedia.org/r/595960 (https://phabricator.wikimedia.org/T202367) [15:24:51] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: thanos-query to production [puppet] - 10https://gerrit.wikimedia.org/r/595958 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:25:04] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage of dbproxy1020 [puppet] - 10https://gerrit.wikimedia.org/r/595960 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [15:25:27] godog: ok to merge your change? [15:25:37] marostegui: yes please, tahnk you [15:25:42] ok, doing it! [15:25:52] (03CR) 10Jcrespo: "This is easier to guide- the general idea looks good, but I would change the directory structure to be less flat." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [15:26:55] (03PS1) 10Kormat: Revert "install_server: Allow reimage of pc2010" [puppet] - 10https://gerrit.wikimedia.org/r/595961 (https://phabricator.wikimedia.org/T252182) [15:27:08] (03CR) 10jerkins-bot: [V: 04-1] Revert "install_server: Allow reimage of pc2010" [puppet] - 10https://gerrit.wikimedia.org/r/595961 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [15:28:16] PROBLEM - Host furud.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:29:08] (03PS2) 10Kormat: Revert "install_server: Allow reimage of pc2010" [puppet] - 10https://gerrit.wikimedia.org/r/595961 (https://phabricator.wikimedia.org/T252182) [15:29:50] (03CR) 10Kormat: [C: 03+2] Revert "install_server: Allow reimage of pc2010" [puppet] - 10https://gerrit.wikimedia.org/r/595961 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [15:31:48] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: add wmcs-k8s-node-upgrade.py script [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) [15:32:23] (03PS1) 10Filippo Giunchedi: Add thanos-query discovery record [dns] - 10https://gerrit.wikimedia.org/r/595965 (https://phabricator.wikimedia.org/T252186) [15:33:05] 10Operations, 10ops-eqiad: Degraded RAID on kafka-jumbo1001 - https://phabricator.wikimedia.org/T251586 (10elukey) ` elukey@kafka-jumbo1001:~$ sudo megacli -LDPDInfo -aAll | grep State State : Optimal Foreign State: None Foreign State: None State : Optimal Foreign State: None Foreig... [15:34:14] !log filippo@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-query [15:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:02] (03CR) 10Jcrespo: "I added you to the "Trusted contributors" group. Could you try uploading a patch (or rebase one of the previous ones)- so we check your up" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [15:38:38] (03PS2) 10Arturo Borrero Gonzalez: kubeadm: add wmcs-k8s-node-upgrade.py script [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) [15:40:46] (03CR) 10Filippo Giunchedi: [C: 03+2] Add thanos-query discovery record [dns] - 10https://gerrit.wikimedia.org/r/595965 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:41:00] (03PS1) 10BryanDavis: Remove unused static-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/595966 [15:41:02] (03PS3) 10Arturo Borrero Gonzalez: kubeadm: add wmcs-k8s-node-upgrade.py script [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) [15:44:17] (03CR) 10Bstorm: kubeadm: add wmcs-k8s-node-upgrade.py script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [15:44:30] RECOVERY - Host furud.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.54 ms [15:45:11] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Elitre) Thanks all. [15:45:24] James_F: Hi, is this a good moment for you to run the script? [15:45:45] [Not necessarily right now] [15:48:00] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:23] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:48:24] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:43] Daimona: Sure, give me a sec. [15:49:06] Nice! No worry, I have a few hours available :) [15:50:56] PROBLEM - Host furud.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:51:07] (03CR) 10Bstorm: "I have a single feature request that might be worth the time:" [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [15:52:03] (03CR) 10Bstorm: kubeadm: add wmcs-k8s-node-upgrade.py script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [15:52:05] (03PS4) 10Arturo Borrero Gonzalez: kubeadm: add wmcs-k8s-node-upgrade.py script [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) [15:52:31] (03PS1) 10Ottomata: eventgate - Set NODE_EXTRA_CA_CERTS [deployment-charts] - 10https://gerrit.wikimedia.org/r/595969 (https://phabricator.wikimedia.org/T238230) [15:52:44] (03PS8) 10Hnowlan: changeprop: add cpjobqueue configuration switching [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) [15:52:48] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install thanos-be200[1-4] - https://phabricator.wikimedia.org/T251634 (10Papaul) [15:53:55] (03PS1) 10Andrew Bogott: Add cloudcontrol1003 and 1004 back to openstack_controllers list [puppet] - 10https://gerrit.wikimedia.org/r/595970 (https://phabricator.wikimedia.org/T252121) [15:54:29] (03CR) 10Arturo Borrero Gonzalez: kubeadm: add wmcs-k8s-node-upgrade.py script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [15:54:50] (03CR) 10Privacybatm: "Thank you for your review. The directory structure makes sense to me. The transferer separation also sounds good. Let me make those change" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [15:55:44] (03CR) 10Ottomata: [C: 03+2] eventgate - Set NODE_EXTRA_CA_CERTS [deployment-charts] - 10https://gerrit.wikimedia.org/r/595969 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [15:55:47] (03CR) 10Bstorm: kubeadm: add wmcs-k8s-node-upgrade.py script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [15:56:14] (03CR) 10Andrew Bogott: [C: 03+2] Add cloudcontrol1003 and 1004 back to openstack_controllers list [puppet] - 10https://gerrit.wikimedia.org/r/595970 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [15:57:52] (03PS1) 10Ottomata: Update charts/index.yaml with eventgate 0.2.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/595971 [15:58:42] (03CR) 10Ottomata: [C: 03+2] Fix duplicate alert descriptions for eventgate_logging_external_errors [puppet] - 10https://gerrit.wikimedia.org/r/595952 (https://phabricator.wikimedia.org/T211692) (owner: 10Ottomata) [15:59:38] (03PS9) 10Hnowlan: changeprop: add cpjobqueue configuration switching [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) [15:59:43] (03CR) 10Ottomata: [C: 03+2] Update charts/index.yaml with eventgate 0.2.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/595971 (owner: 10Ottomata) [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200512T1600). Please do the needful. [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:48] 10Operations, 10SRE-swift-storage: xfs_db blocked / timeout on ms-be2023 - https://phabricator.wikimedia.org/T185298 (10fgiunchedi) 05Open→03Invalid Not really @Marostegui, we haven't seen this again, resolving! [16:00:52] (03CR) 10Bstorm: kubeadm: add wmcs-k8s-node-upgrade.py script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [16:01:01] (03CR) 10Elukey: [C: 03+2] Add bash shabang to all bin scripts [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/595859 (https://phabricator.wikimedia.org/T250161) (owner: 10Elukey) [16:05:24] (03CR) 10Bstorm: kubeadm: add wmcs-k8s-node-upgrade.py script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [16:06:10] (03Abandoned) 10Jcrespo: Revert "Revert "Revert "bacula: Schedule hourly copies of production backups to the offsite pool""" [puppet] - 10https://gerrit.wikimedia.org/r/556192 (owner: 10Jcrespo) [16:06:14] (03CR) 10Bstorm: wikireplicas: remove MCR-obsoleted fields from the replica views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) (owner: 10Bstorm) [16:09:00] (03CR) 10Arturo Borrero Gonzalez: kubeadm: add wmcs-k8s-node-upgrade.py script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [16:10:12] 10Operations, 10ops-eqiad, 10serviceops: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10wiki_willy) Thanks @Dzahn . @Jclark-ctr - I'll move this task over to the "decommission" column on the workboard. [16:13:46] PROBLEM - Check systemd state on cescout1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:06] hmm [16:14:39] (03PS6) 10CRusnov: prometheus::ops: Add prometheus job to scrape Netbox scripts [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) [16:15:34] 10Operations, 10observability: smart-data-dump --syslog producing errors and spamming root@ - https://phabricator.wikimedia.org/T252500 (10colewhite) 05Open→03Resolved a:03colewhite Thanks for the report! There was a bug in the updated hpsa parser on initial deployment that fired these emails. It was c... [16:15:36] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10colewhite) [16:15:39] sukhe_: looking [16:15:42] (03CR) 10Bstorm: kubeadm: add wmcs-k8s-node-upgrade.py script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [16:15:55] mutante: seems to be the postgres service that is currently paused [16:16:12] sukhe_: you want it to be paused or it broke? [16:16:29] mutante: paused for now but it broke after the reboot. looking [16:16:48] hm,ok [16:18:10] sukhe_: it does not like permissions of the config files [16:18:29] yeah :) [16:18:50] (03PS3) 10Bstorm: wikireplicas: remove MCR-obsoleted fields from the replica views [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) [16:19:18] mutante: it is this [16:19:20] $ ls -ld /srv/metadb-data/drwxr-xr-x 21 root root 4096 Apr 13 17:28 /srv/metadb-data/ [16:19:26] it should be postgres [16:19:52] sukhe_: aha, ack [16:20:16] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10AntiCompositeNumber) [16:21:10] (03CR) 10CRusnov: [C: 03+2] prometheus::ops: Add prometheus job to scrape Netbox scripts [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [16:22:03] (03PS10) 10Ppchelko: changeprop: add cpjobqueue configuration switching [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:22:55] (03CR) 10Ppchelko: [C: 03+2] "Maybe we will find some issue when deploying, but I can't spot any from staring at yaml, so let's try" [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:23:20] (03Merged) 10jenkins-bot: changeprop: add cpjobqueue configuration switching [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:26:05] 10Operations, 10observability: smart-data-dump --syslog producing errors and spamming root@ - https://phabricator.wikimedia.org/T252500 (10Marostegui) There are reports from early today from `dbprov2001` for instance. [16:26:18] (03CR) 10Bstorm: [C: 04-1] "Need to remove from build.py as well." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/595966 (owner: 10BryanDavis) [16:27:12] (03PS1) 10Ssingh: cescout: fix permissions for metadb directory [puppet] - 10https://gerrit.wikimedia.org/r/595977 [16:28:55] 10Puppet, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10User-brennen: logspam-watch: Some exceptions may be missing from logspam - https://phabricator.wikimedia.org/T244528 (10brennen) 05Open→03Resolved After the last few weeks of usage of `logspam-watch` and Kiba... [16:29:40] (03CR) 10Ppchelko: "> This will need a new package to release this change" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/595601 (https://phabricator.wikimedia.org/T252127) (owner: 10Ppchelko) [16:31:09] (03CR) 10Dzahn: [C: 03+1] cescout: fix permissions for metadb directory [puppet] - 10https://gerrit.wikimedia.org/r/595977 (owner: 10Ssingh) [16:32:55] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/22487/" [puppet] - 10https://gerrit.wikimedia.org/r/595977 (owner: 10Ssingh) [16:32:57] (03CR) 10Ssingh: [C: 03+2] cescout: fix permissions for metadb directory [puppet] - 10https://gerrit.wikimedia.org/r/595977 (owner: 10Ssingh) [16:35:23] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [16:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:53] 10Operations, 10MediaWiki-Cache, 10Traffic, 10serviceops, and 4 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Krinkle) 05Open→03Resolved a:05daniel→03Krinkle [16:35:59] 10Operations, 10Core Platform Team, 10Traffic, 10serviceops, and 2 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Krinkle) [16:36:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:36:27] (03CR) 10Hnowlan: [C: 03+2] Preserve datetime field of the purge events [deployment-charts] - 10https://gerrit.wikimedia.org/r/595601 (https://phabricator.wikimedia.org/T252127) (owner: 10Ppchelko) [16:37:39] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:38:29] (03PS1) 10Andrew Bogott: Glance: make cloudcontrol1003 the primary Glance host again [puppet] - 10https://gerrit.wikimedia.org/r/595979 (https://phabricator.wikimedia.org/T252121) [16:40:29] (03PS1) 10Andrew Bogott: Revert "Openstack: move API traffic to cloudcontrol1005" [dns] - 10https://gerrit.wikimedia.org/r/595980 (https://phabricator.wikimedia.org/T252121) [16:40:42] !log mstyles@deploy1001 Started deploy [wdqs/wdqs@f617307]: v0.3.31 [16:40:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:49] (03PS1) 10Hnowlan: changeprop: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/595981 (https://phabricator.wikimedia.org/T220399) [16:41:03] RECOVERY - Check systemd state on cescout1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:35] (03CR) 10Hnowlan: [C: 03+2] changeprop: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/595981 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:42:06] (03Merged) 10jenkins-bot: changeprop: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/595981 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:42:14] Daimona: Did you/someone run FixOldLogEntries in prod already? It's reporting as done… [16:42:28] James_F: yes, but that's another one :) [16:42:35] This is updateVarDumps [16:42:36] Oh, duh, ignore me. :-) [16:42:42] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review, and 2 others: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10Pchelolo) [16:42:45] Just because killAllTheCrap was ugly [16:42:46] * James_F had a quick panic. [16:50:12] (03PS1) 10Hnowlan: changeprop-jobqueue: add stubs for secrets [labs/private] - 10https://gerrit.wikimedia.org/r/595985 (https://phabricator.wikimedia.org/T220399) [16:50:15] Huh, those progress markers are annoying [16:50:19] Daimona: OK, seems… reasonable. [16:50:41] And yeah, disabling the progress markers for quick runs. MW takes 14s. [16:50:56] I should change to print it every 10 batches or so, but we can just turn them off for now [16:50:57] aawiki takes 0.2s. ;-) [16:51:02] OK that's reasonable, yes [16:51:14] Eheh [16:52:24] Doing a dry run for all closed wikis. [16:52:31] Roger [16:52:36] Nothing surprising. [16:52:44] Shall we run there for real? [16:54:14] Yes please [16:54:31] (03PS3) 10Krinkle: Set "coalesceKeys" in mc.php to minimize host fan-out by WANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575098 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [16:54:47] I think there's not much we can infer from the dry-run output, except how many rows are affected [16:55:05] Yeah. [16:55:35] !log mstyles@deploy1001 Finished deploy [wdqs/wdqs@f617307]: v0.3.31 (duration: 14m 53s) [16:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:52] !log Running AbuseFilter updateVarDumps on closed wikis on mwmaint1002 T246539 [16:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:54] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [16:56:57] (03CR) 10Elukey: [C: 03+1] "LGTM, maybe let's add a couple of non-analytics nodes to the pcc for ease of review from other people (quicker time to +1s I think :)" [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) (owner: 10Ottomata) [16:57:08] Oh oops, I just erased the log. [16:58:11] Daimona: Anyway, done; worth poking e.g. the 226 updated entries on wikimania2018wiki ? [16:58:19] Oh [16:58:27] (03PS4) 10Krinkle: Set "coalesceKeys" in mc.php to minimize host fan-out by WANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575098 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [16:58:27] Sure [16:59:23] Which is the whole table except for the last entry [16:59:30] Ha. [17:00:04] halfak and accraze: Your horoscope predicts another unfortunate Services – Graphoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200512T1700). [17:00:19] On-wiki looks good, and logstash is clear [17:00:33] Excellent. Let's do testwikis and then stop for a day? [17:00:47] (03PS1) 10Cwhite: admin: add Daniele Rama to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/595990 (https://phabricator.wikimedia.org/T252129) [17:01:23] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10colewhite) Hi Daniele! There are a few things we'll need to continue: We need an NDA on file with legal to proceed. @KFrancis, would you... [17:01:26] Yes, good idea. [17:10:35] !log Running AbuseFilter updateVarDumps on testwikis on mwmaint1002 T246539 [17:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:39] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [17:12:43] 10Operations, 10observability: smart-data-dump --syslog producing errors and spamming root@ - https://phabricator.wikimedia.org/T252500 (10colewhite) I found the email you are referring to. Logs: ` Traceback (most recent call last): File "/usr/local/sbin/smart-data-dump", line 459, in sys.exit(... [17:13:47] (03PS1) 10CRusnov: prometheus::ops: Set netbox class config to not hostnames_only [puppet] - 10https://gerrit.wikimedia.org/r/595991 [17:14:03] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/595991 (owner: 10CRusnov) [17:15:13] 10Operations, 10Puppet: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 (10colewhite) Interestingly, `dbprov2001` hit the timeout fetching its facter version earlier today: ` Traceback (most recent call last): File "/usr/local/sbin/smart-data-dump", line 459, in sys.exi... [17:17:29] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) [17:19:55] (03CR) 10CRusnov: "pcc output" [puppet] - 10https://gerrit.wikimedia.org/r/595991 (owner: 10CRusnov) [17:24:42] 10Operations, 10ops-codfw, 10DBA: db2097 (backup source) restarted itself - https://phabricator.wikimedia.org/T252492 (10jcrespo) I resetup the host from backups. I am going to generate a logical backup (and a snapshot will be also generated later this day) and then send this to dc ops. [17:24:45] (03CR) 10Herron: [C: 03+1] prometheus::ops: Set netbox class config to not hostnames_only [puppet] - 10https://gerrit.wikimedia.org/r/595991 (owner: 10CRusnov) [17:25:00] (03CR) 10CRusnov: [C: 03+2] prometheus::ops: Set netbox class config to not hostnames_only [puppet] - 10https://gerrit.wikimedia.org/r/595991 (owner: 10CRusnov) [17:31:55] James_F: I see it has finished, and nothing exploded. Flawless victory? [17:33:00] Daimona: Sure. :-) [17:34:00] (03CR) 10Jdlrobson: "thx!" [puppet] - 10https://gerrit.wikimedia.org/r/595621 (https://phabricator.wikimedia.org/T252222) (owner: 10Ottomata) [17:36:17] Fantastic! I hope I'll be able to be around tomorrow as well [17:37:17] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10Rvvalentim) [1] [1] [1] [17:43:11] !log updating maxmind database on puppetmasters (usually automated weekly; we're mid-cycle) [17:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:51] (03CR) 10Andrew Bogott: [C: 03+2] Glance: make cloudcontrol1003 the primary Glance host again [puppet] - 10https://gerrit.wikimedia.org/r/595979 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [17:47:53] (03PS1) 10Jforrester: Stop loading the ParsoidBatchAPI extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595994 (https://phabricator.wikimedia.org/T242430) [17:47:55] (03PS1) 10Jforrester: Stop loading i18n for the ParsoidBatchAPI extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595995 (https://phabricator.wikimedia.org/T242430) [17:49:18] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Openstack: move API traffic to cloudcontrol1005" [dns] - 10https://gerrit.wikimedia.org/r/595980 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [17:49:39] !log 'gdnsdctl replace' on all authdns to load new maxmind data [17:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:13] (03PS2) 10Ppchelko: Preserve datetime field of the purge events [deployment-charts] - 10https://gerrit.wikimedia.org/r/595601 (https://phabricator.wikimedia.org/T252127) [17:50:21] (03CR) 10Ppchelko: [C: 03+2] Preserve datetime field of the purge events [deployment-charts] - 10https://gerrit.wikimedia.org/r/595601 (https://phabricator.wikimedia.org/T252127) (owner: 10Ppchelko) [17:50:39] (03Merged) 10jenkins-bot: Preserve datetime field of the purge events [deployment-charts] - 10https://gerrit.wikimedia.org/r/595601 (https://phabricator.wikimedia.org/T252127) (owner: 10Ppchelko) [17:51:16] (03CR) 10Ppchelko: [C: 03+1] changeprop-jobqueue: add stubs for secrets [labs/private] - 10https://gerrit.wikimedia.org/r/595985 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200512T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:02:09] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) [18:05:51] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10Daniram3) >>! In T252129#6128843, @Daniram3 wrote: > [6] Here is my SSH public key for production access: > > ssh-ed25519 AAAAC3NzaC1lZDI1... [18:10:38] backkk [18:10:56] (03PS3) 10Jforrester: restrouter: Remove k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/573257 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [18:14:28] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [18:14:28] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [18:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:12] 10Operations, 10Traffic: Maxmind data update issues for DNS (and others?) - https://phabricator.wikimedia.org/T252577 (10BBlack) p:05Triage→03Medium [18:17:05] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [18:17:05] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [18:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:13] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [18:23:22] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [18:23:23] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [18:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:39] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10chasemp) >>! In T252210#6128718, @MoritzMuehlenhoff wrote: > Why does this need a complete VM, though? If this simply sends som... [18:28:36] (03CR) 10CRusnov: "> Patch Set 1:" (032 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/595717 (owner: 10CRusnov) [18:33:34] 10Operations, 10Traffic: Maxmind data update issues for DNS (and others?) - https://phabricator.wikimedia.org/T252577 (10BBlack) Diving a little deeper on the symlink issue: 1. gdnsd uses libev's `ev_stat` watcher for this and other similar cases, as documented here: http://pod.tst.eu/http://cvs.schmorp.de/li... [18:33:55] I will process with the train to group0 in ~ 30 minutes [18:40:10] (03PS1) 10Bstorm: toolforge-kubeadm: calico upgrade changes [puppet] - 10https://gerrit.wikimedia.org/r/596012 (https://phabricator.wikimedia.org/T250863) [18:41:49] !log started codereview-archiver script in screen on mwmaint1002 [18:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:51] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [18:42:23] (03CR) 10Bstorm: "This worked beautifully in toolsbeta via livehack. It is currently running without typha at 3.14.0." [puppet] - 10https://gerrit.wikimedia.org/r/596012 (https://phabricator.wikimedia.org/T250863) (owner: 10Bstorm) [18:43:03] (03CR) 10Bstorm: "Restoring toolsbeta now, but you'll still be able to see the affect on the cluster since I'm not rolling back the upgrade itself, just the" [puppet] - 10https://gerrit.wikimedia.org/r/596012 (https://phabricator.wikimedia.org/T250863) (owner: 10Bstorm) [18:44:41] 10Operations, 10MediaWiki-extensions-CodeReview: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Legoktm) >>! In T243056#6111775, @Dzahn wrote: > @Legoktm I added the new name to DNS and started with the puppet patch to create the htt... [18:45:04] 10Operations, 10Traffic: Maxmind data update issues for DNS (and others?) - https://phabricator.wikimedia.org/T252577 (10faidon) > I know that historically MaxMind has claimed they update the data roughly on a weekly basis, and maybe in this case it was a normal weekly update and we're just misaligned with the... [18:57:05] (03PS1) 10Legoktm: static-codereview: Add link to SQL dumps [puppet] - 10https://gerrit.wikimedia.org/r/596018 (https://phabricator.wikimedia.org/T243055) [18:57:06] (03PS1) 10Legoktm: static-codereview: Update notes URL to new wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/596019 (https://phabricator.wikimedia.org/T243056) [18:58:33] (03CR) 10Legoktm: "I didn't uncomment the monitoring stuff, I was going to leave that to you in case anything else needed changing." [puppet] - 10https://gerrit.wikimedia.org/r/596019 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [19:00:04] hashar and twentyafterfour: That opportune time is upon us again. Time for a Mediawiki train - European+American Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200512T1900). [19:01:07] hashar: is there anything I can help with the train? Time for group0? [19:01:23] twentyafterfour: hiiii [19:01:48] well not much, I had to exit the office as soon as I have finished the roll out to testwiki [19:02:05] and I guess I will do the group0 now. But surely I can use help with the log spam triaging! [19:02:47] let me know if/when I caused more UBNs with my Revision wor [19:02:49] k [19:03:27] (03PS1) 10Hashar: group0 wikis to 1.35.0-wmf.32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596020 [19:03:29] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.35.0-wmf.32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596020 (owner: 10Hashar) [19:04:00] hi DannyS712 ! Thank you for the nice summary you have put on the blocker task. You mentioned a patch worth having (which has a return to cast to integer explicitly). It is not in this branch though [19:04:10] but we can cherry pick if need be [19:04:24] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596020 (owner: 10Hashar) [19:05:33] Not sure cherry picking is needed for now, but it was merged to master so its possible to do [19:05:40] ahh cool [19:05:53] so yeah we can cherry pick the master patch to 1.35.0-wmf.32 and deploy it [19:05:57] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.32 [19:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:12] * hashar heads to log triaging [19:07:06] all calm so far [19:08:00] DannyS712: and kudos for adding all those tests :] [19:08:49] I cherry picked it at v [19:08:51] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/596022/ [19:09:30] twentyafterfour: no new log at all for .32 :] I am the most lucky one at deploying the train!! [19:13:12] DannyS712: I guess I will deploy your patch in an hour or so [19:20:23] it is all quiet ;) [19:27:45] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [19:28:30] 10Operations, 10MediaWiki-extensions-PdfHandler, 10MW-1.35-notes (1.35.0-wmf.24; 2020-03-17): Error creating PDF on Commons: "convert: no decode delegate for this image format" (fixed in GS 9.07) - https://phabricator.wikimedia.org/T50007 (10Aklapper) 05Open→03Resolved This has been fixed either via the... [19:35:03] (03PS1) 10Ottomata: wgEventStreams and wgEventLoggingStreamNames Use +deploymentwiki for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596034 (https://phabricator.wikimedia.org/T238230) [19:38:43] (03CR) 10Ottomata: [C: 03+2] wgEventStreams and wgEventLoggingStreamNames Use +deploymentwiki for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596034 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:39:32] (03Merged) 10jenkins-bot: wgEventStreams and wgEventLoggingStreamNames Use +deploymentwiki for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596034 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:46:53] no log errors for wmf.32 so far [19:47:33] Yay :) [19:55:09] !log dpifke@deploy1001 Started deploy [performance/navtiming@48110b9]: Fixes swapped dc/host labels - T238086 [19:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:12] T238086: Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 [19:55:14] !log dpifke@deploy1001 Finished deploy [performance/navtiming@48110b9]: Fixes swapped dc/host labels - T238086 (duration: 00m 05s) [19:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:36] DannyS712: I am going to hot deploy your patch [19:56:52] hot deploy? [19:57:10] oh [19:57:17] well I mean deploy it outside of the train [19:57:24] Oh. Okay [19:58:29] it is in the ci pipe which is going to take ~ 15 minutes [20:07:45] Krinkle: this look ok to you? [20:07:51] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/596049/1/includes/EventLoggingHooks.php [20:08:50] ottomata: not sure. I think you want to move the two around and keep + [20:09:04] yeah you think that is better? [20:09:06] array_merge will produce a sequential array sometimes and strip the keys [20:09:18] ok PHP. [20:09:22] :) [20:09:23] k will do [20:09:38] Good spot though, I hadn't realised it was the other way around from what you wanted to re-use the feature for :) [20:12:04] ok Krinkle tested and patched [20:13:10] I am going to deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/596022/ for DannyS712 [20:15:42] !log hashar@deploy1001 Synchronized php-1.35.0-wmf.32/includes/revisionlist/RevisionItemBase.php: Fix RevisionItemBase::getId to actually return an int, as intended - T252076 (duration: 01m 06s) [20:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:46] T252076: Cover RevisionList/RevisionItem classes with tests - https://phabricator.wikimedia.org/T252076 [20:16:16] Was only the one file synced? [20:16:21] yeah [20:16:25] okay [20:16:30] the others are tests files which we dont need on prod [20:16:35] yeah [20:17:23] no spam so far [20:19:08] DannyS712: it is probably all fine. I guess you can return to your other occupations :] [20:19:20] thank you so much for being around, that is really helpful [20:19:39] my primary occupation is trying to convince people to review my Revision patches [20:19:49] Any volunteers? [20:22:08] I guess people from the core plateform team are probably ideal candidates, they are in #wikimedia-cpt [20:22:22] depends on how busy they are though [20:23:01] I would have loved to review those refactoring things, but I haven't touched mediawiki in ages and it probably has been refactored twice since I last touched it [20:24:47] 10Operations, 10Traffic, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10dpifke) This is deployed, and I updated the Grafana dashboard. To nuke the data, we would need to restart Prometheus with `--web.enable-admin-api` flag... [20:30:11] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:36:01] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:51:52] 10Operations, 10Anti-Harassment, 10Traffic, 10serviceops: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10aezell) I wanted to clarify that this is just in the experiment and investigation stage. We want to start a discussion about using MaxMind to g... [21:00:33] I am off [21:17:25] (03PS3) 10Privacybatm: transfer.py: Move Transferer, MariaDB logic and Firewall logic to its new module files [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) [21:17:49] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Move Transferer, MariaDB logic and Firewall logic to its new module files [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [21:19:01] (03CR) 10Andrew Bogott: "I've already applied the change from this patch by hand -- if there are still things to do lmk and we can sync up." [puppet] - 10https://gerrit.wikimedia.org/r/595207 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [21:24:12] (03CR) 10Privacybatm: "> Patch Set 2:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [21:27:41] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10KFrancis) >>! In T252129#6130190, @colewhite wrote: > Hi Daniele! > > There are a few things we'll need to continue: > > We need an NDA o... [21:35:17] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10Miriam) @KFrancis thanks! We discussed this case over email, and my understanding was that the signed letter of agreement already contains... [21:59:56] (03PS1) 10Bstorm: cloud-node-exporter: ignore NFS on the cloud client side [puppet] - 10https://gerrit.wikimedia.org/r/596063 (https://phabricator.wikimedia.org/T252260) [22:09:35] 10Operations, 10observability, 10serviceops: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10CDanis) [22:11:40] 10Operations, 10observability, 10serviceops: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10CDanis) p:05Triage→03Medium [22:20:29] (03CR) 10Mstyles: "not sure if this should be a part of this patch or a separate one. But we should probably rename the journal file (get from hieradata) fro" [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [22:20:44] 10Operations, 10observability, 10serviceops: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10Krinkle) >>! @CDanis wrote: > (The reason for the difference between the per-process reported state and the Status: "Processes active: 0, idle 8 state that php-fpm alread... [22:32:16] 10Operations, 10observability, 10serviceops: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10CDanis) >>! In T252605#6131597, @Krinkle wrote: >>>! @CDanis wrote: >> (The reason for the difference between the per-process reported state and the Status: "Processes ac... [23:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200512T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:01:50] I have trust issues with you, jouncebot. [23:03:08] (03PS1) 10Alex Monk: puppet-facts-export-puppetdb: Read localcacert from right section [puppet] - 10https://gerrit.wikimedia.org/r/596069 (https://phabricator.wikimedia.org/T252606) [23:03:13] (03CR) 10Bstorm: "Seems to do the thing: https://puppet-compiler.wmflabs.org/compiler1002/22495/clouddb1001.clouddb-services.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/596063 (https://phabricator.wikimedia.org/T252260) (owner: 10Bstorm) [23:03:42] (03CR) 10jerkins-bot: [V: 04-1] puppet-facts-export-puppetdb: Read localcacert from right section [puppet] - 10https://gerrit.wikimedia.org/r/596069 (https://phabricator.wikimedia.org/T252606) (owner: 10Alex Monk) [23:05:24] (03PS2) 10Alex Monk: puppet-facts-export-puppetdb: Read localcacert from right section [puppet] - 10https://gerrit.wikimedia.org/r/596069 (https://phabricator.wikimedia.org/T252606) [23:06:04] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [23:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:24] (03CR) 10Bstorm: "I can confirm that on production, the two certs returned by the command with and without the section argument are the same. We just have " [puppet] - 10https://gerrit.wikimedia.org/r/596069 (https://phabricator.wikimedia.org/T252606) (owner: 10Alex Monk) [23:09:30] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [23:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:57] PROBLEM - LibreNMS has a critical alert #page on icinga1001 is CRITICAL: Primary outbound port utilisation over 80% #page (asw2-esams.mgmt.esams.wmnet) https://wikitech.wikimedia.org/wiki/Network_monitoring%23LibreNMS_alerts [23:20:45] 👋 [23:21:50] RECOVERY - LibreNMS has a critical alert #page on icinga1001 is OK: OK: zero critical LibreNMS alerts https://wikitech.wikimedia.org/wiki/Network_monitoring%23LibreNMS_alerts [23:21:53] 👋 [23:23:08] according to https://librenms.wikimedia.org/alert-log the relevant port was ae2 on asw2-esams [23:23:19] but https://librenms.wikimedia.org/graphs/to=1589325600/id=19265/type=port_bits/from=1589304000/ [23:23:59] * volans almost in bed but here if needed [23:24:07] volans: I think it was just a monitoring glitch of some sort [23:24:29] ae2@asw2-esams is an 80Gbps virtual interface that's an aggregate of sevreal underlying physical interfaces [23:24:43] and it peaked for the day (hours ago) at 19Gbps [23:24:58] and has some weird missing-data artifact for the time window of the alert [23:25:29] ack [23:28:22] anyway, dunno what happened, but seems like just a glitch, I'm going afk again [23:30:10] same [23:30:12] thanks cdanis [23:34:05] (03PS1) 10Papaul: DNS: Add DNS for db213[6-9] and db2140 [dns] - 10https://gerrit.wikimedia.org/r/596071 [23:56:33] (03PS1) 10Ryan Kemper: sre.wdqs.data-transfer: use proper systemctl path [cookbooks] - 10https://gerrit.wikimedia.org/r/596073 (https://phabricator.wikimedia.org/T206951) [23:56:45] (03PS2) 10Papaul: DNS: Add DNS for db213[6-9] and db2140 [dns] - 10https://gerrit.wikimedia.org/r/596071 [23:57:54] (03CR) 10EBernhardson: [C: 03+1] "Matches what i see on wdqs2002, going to assume that holds for debian in general" [cookbooks] - 10https://gerrit.wikimedia.org/r/596073 (https://phabricator.wikimedia.org/T206951) (owner: 10Ryan Kemper)