[00:00:05] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190725T0000). [00:00:11] (03PS1) 10Thcipriani: helmfile: Update README to mention ".hfenv" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525468 [00:18:59] (03PS1) 10CDanis: conftool-data: add entries for mwconfig/dbconfig [puppet] - 10https://gerrit.wikimedia.org/r/525469 [00:25:41] 10Operations, 10MobileFrontend, 10Traffic, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking): Do not cache the beta version of the mobile site - https://phabricator.wikimedia.org/T228861 (10Jdlrobson) [00:31:02] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [00:53:55] (03PS1) 10Ayounsi: Anycast, make bird::bind_service more generic [puppet] - 10https://gerrit.wikimedia.org/r/525470 [01:04:00] (03PS7) 10Ayounsi: Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037 [01:06:36] (03CR) 10Ayounsi: [C: 03+2] Reserve IP for syslog anycast [dns] - 10https://gerrit.wikimedia.org/r/524045 (owner: 10Ayounsi) [01:06:41] (03PS2) 10Ayounsi: Reserve IP for syslog anycast [dns] - 10https://gerrit.wikimedia.org/r/524045 [01:08:01] (03PS2) 10Ayounsi: Anycast move bird::neighbors_list from role/site for all sites [puppet] - 10https://gerrit.wikimedia.org/r/525303 [01:08:02] (03PS8) 10Ayounsi: Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037 [01:08:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) > Dear Rob Halsell, > > Your dispatch shipped on 7/24/2019 7:50 PM > > What's Next? > > If you need to make a... [01:11:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) EQ inbound shipment ticket - 1-191287024247 [02:02:59] !log remove peer AS63541 from cr1-eqsin [02:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:48] 10Operations, 10netops, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10ayounsi) 05Open→03Resolved Fixed with the latest upgrade of Routinator [02:05:52] 10Operations, 10netops, 10Patch-For-Review: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) [02:06:20] 10Operations, 10netops: AS63541's session down reported by cr1-eqsin - https://phabricator.wikimedia.org/T228617 (10ayounsi) 05Open→03Resolved a:03ayounsi Peer removed, cf. emails to peering@ [02:14:33] (03PS1) 10Ayounsi: Fastnetmon: bump threshold_mbps to 700 [puppet] - 10https://gerrit.wikimedia.org/r/525471 (https://phabricator.wikimedia.org/T226810) [02:15:11] (03CR) 10Ayounsi: [C: 03+2] Fastnetmon: bump threshold_mbps to 700 [puppet] - 10https://gerrit.wikimedia.org/r/525471 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [02:21:05] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:21:13] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:39:27] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:40:07] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:41:17] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:43:59] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 30906952 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:49:01] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 393936 and 74 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:04:31] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:08:27] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:09:31] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:37:51] PROBLEM - Host pc2010 is DOWN: PING CRITICAL - Packet loss = 100% [04:30:46] pc2010 again? :( [04:36:51] (03PS4) 10Marostegui: mariadb: Promote db1128 as master for m3 [puppet] - 10https://gerrit.wikimedia.org/r/523941 (https://phabricator.wikimedia.org/T228243) [04:38:55] 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228892 (10Marostegui) [04:38:57] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Marostegui) [04:39:43] RECOVERY - Host pc2010 is UP: PING OK - Packet loss = 0%, RTA = 36.52 ms [04:42:25] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Marostegui) @robh if you guys don't have any preference on which rack to start with...from the DB side, `B3` can be a good option if it can be done before Tu... [04:42:36] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) This host crashed again, this time it was totally frozen and I had to reset it via idrac. These are the HW logs, same issue: ` ----------------------------------------------------------... [04:43:13] 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228892 (10Marostegui) [04:43:43] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Marostegui) [04:43:46] 10Operations, 10ops-eqiad, 10DC-Ops: dbprov1001 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228891 (10Marostegui) [04:44:43] 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Marostegui) >>! In T228732#5362544, @Cmjohnson wrote: > @Marostegui This can be done any day...Let's plan 8/6 @1000EDT /1400UTC Great! Thank you. I have made a note on my calendar so the host wi... [04:49:03] PROBLEM - puppet last run on pc2010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:51:54] !log Start pre-failover steps on m3 T228243 [04:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:03] T228243: Switchover m3 (phabricator) master db1072 to db1128 - https://phabricator.wikimedia.org/T228243 [04:55:09] (03CR) 10Marostegui: mariadb: Promote db1128 as master for m3 [puppet] - 10https://gerrit.wikimedia.org/r/523941 (https://phabricator.wikimedia.org/T228243) (owner: 10Marostegui) [04:55:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1128 as master for m3 [puppet] - 10https://gerrit.wikimedia.org/r/523941 (https://phabricator.wikimedia.org/T228243) (owner: 10Marostegui) [04:56:09] ACKNOWLEDGEMENT - MariaDB Slave Lag: pc1 on pc2010 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2692.11 seconds Marostegui hw issues https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:56:09] ACKNOWLEDGEMENT - puppet last run on pc2010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 14 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat] Marostegui hw issues https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:03:59] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) [05:04:23] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) db2121-db2125 looking good! Thanks [05:06:01] RECOVERY - puppet last run on pc2010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:06:10] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) [05:06:44] twentyafterfour: morning! I am all set from my side, just waiting for the .30 hour, all the pre-steps are done [05:19:57] marostegui: cool, I'm ready to set read-only when you tell me you're ready [05:20:25] twentyafterfour: awesome!. In 10 minutes it will be [05:20:31] The whole db thing will take just a few seconds [05:20:40] So once you set read only, get ready to remove it too :) [05:30:05] marostegui, jynus, and twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do m3 phabricator database master failover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190725T0530). [05:30:15] twentyafterfour: I am ready when you are! [05:30:24] marostegui: ok going read-only now... [05:30:29] ok, let me know when done [05:30:44] !log phabricator set to read-only mode [05:30:49] marostegui: done [05:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:54] !log Failover m3 from db1072 to db1128 - T228243 [05:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:01] T228243: Switchover m3 (phabricator) master db1072 to db1128 - https://phabricator.wikimedia.org/T228243 [05:31:22] twentyafterfour: I am done [05:31:37] !log set phabricator to read-write mode [05:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:54] everything appears to be working [05:32:06] 10Operations: test - https://phabricator.wikimedia.org/T228955 (10Marostegui) [05:32:07] I am testing too :) [05:32:22] 10Operations: test - https://phabricator.wikimedia.org/T228955 (10Marostegui) a:03mmodell [05:32:34] 10Operations: test - https://phabricator.wikimedia.org/T228955 (10mmodell) looks good [05:32:49] 10Operations: test - https://phabricator.wikimedia.org/T228955 (10Marostegui) 05Open→03Resolved [05:33:05] 10Operations: test - https://phabricator.wikimedia.org/T228955 (10Marostegui) [05:33:06] twentyafterfour: I think we are good then :) [05:33:16] 10Operations, 10DBA, 10Phabricator, 10User-notice: Switchover m3 (phabricator) master db1072 to db1128 - https://phabricator.wikimedia.org/T228243 (10Marostegui) [05:33:31] marostegui: thanks, that was fast, nice work. [05:33:44] twentyafterfour: Thank you for your help - much appreciated! Have a good sleep :) [05:33:56] You're welcome, no problem at all [05:34:27] looks all fine indeed, I could edit a task just fine [05:34:42] Great - thanks moritzm [05:34:49] I am going to do the clean up steps now then [05:41:39] 10Operations, 10DBA, 10Phabricator, 10User-notice: Switchover m3 (phabricator) master db1072 to db1128 - https://phabricator.wikimedia.org/T228243 (10Marostegui) 05Open→03Resolved This has been done. Phabricator read only start: 05:30:44 Phabricator read only stop: 05:31:37 Total read-only time: 53 se... [05:41:42] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:43:59] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:45:14] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227538 (10Marostegui) As of today, db1072 is no longer a master (T228243#5363931), so this rack is also good to go. db1072 will be decommissioned in a few days [05:45:33] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227538 (10Marostegui) [05:47:09] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:48:25] 10Operations, 10DBA, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Marostegui) [05:48:32] 10Operations, 10DBA, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Marostegui) p:05Triage→03Normal [05:51:35] (03PS1) 10Marostegui: db1072: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/525480 (https://phabricator.wikimedia.org/T228956) [05:52:29] (03CR) 10Marostegui: [C: 03+2] db1072: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/525480 (https://phabricator.wikimedia.org/T228956) (owner: 10Marostegui) [06:11:45] (03CR) 10Smalyshev: [C: 03+1] "Doesn't seem to be a lot happening with this for the last 2 weeks, are we going to merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [06:14:39] (03PS1) 10Jeena Huneidi: Add Parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/525481 (https://phabricator.wikimedia.org/T228909) [06:16:20] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Eldarado) I want to talk about azwiki community. I can't understand that why you fluff azwiki problem this much. I know that we had two admins who are against the wiki rules. But we alrea... [06:17:56] (03PS2) 10Jeena Huneidi: Add Parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/525481 (https://phabricator.wikimedia.org/T228909) [06:18:19] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10MoritzMuehlenhoff) 05Resolved→03Open @herron : You've added her to the wrong group, staff members need to be a member of cn=wm... [06:18:51] (03CR) 10ArielGlenn: "> Doesn't seem to be a lot happening with this for the last 2 weeks," [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [06:22:09] (03PS3) 10Jeena Huneidi: Add Parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/525481 (https://phabricator.wikimedia.org/T228909) [06:32:38] PROBLEM - puppet last run on elastic2037 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:44] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Allow analytics VLAN to reach eventgate-analytics.discovery.wmnet:31192 - https://phabricator.wikimedia.org/T228882 (10elukey) 05Open→03Resolved a:03elukey ` + term eventgate { + from { + destination-address { +... [06:33:54] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10elukey) [06:34:53] !log add term eventgate to analytics-in4 on cr1/cr2-eqiad - T228882 [06:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:02] T228882: Allow analytics VLAN to reach eventgate-analytics.discovery.wmnet:31192 - https://phabricator.wikimedia.org/T228882 [06:37:14] !log restart cassandra instances on aqs1004 to pick up new openjdk-8 version [06:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:00] !log restart kafka* on kafka-jumbo1001 to pick up new openjdk-8 version [06:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:44] (03Abandoned) 10Muehlenhoff: kibana: Read LDAP servers from standard Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525057 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [06:56:11] (03PS5) 10Muehlenhoff: Configure unconditional flushes of the L1 cache during VMENTER [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) [06:58:05] 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10akosiaris) @RobH, @Cmjohnson Indeed the refreshes are for ganeti100[1-4] so row C it is. Try to spread them across 1G racks. However, the 6 ganeti nodes of the expa... [06:58:58] (03CR) 10Muehlenhoff: [C: 03+2] Configure unconditional flushes of the L1 cache during VMENTER [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff) [07:00:44] RECOVERY - puppet last run on elastic2037 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:01:35] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10akosiaris) @RobH, @Cmjohnson Despite the designation as CI, we will be treating these uniformly as far as ganeti goes (we will handling the capacity allocations within ganeti) so: * Single rack ro... [07:02:36] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:02:38] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:16] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 70.60, 40.60, 23.78 https://wikitech.wikimedia.org/wiki/Application_servers [07:06:22] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 56.44, 29.45, 17.62 https://wikitech.wikimedia.org/wiki/Application_servers [07:06:28] (03CR) 10Jbond: [C: 03+1] "seems sensible" [puppet] - 10https://gerrit.wikimedia.org/r/525204 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [07:06:36] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 52.98, 27.81, 17.79 https://wikitech.wikimedia.org/wiki/Application_servers [07:06:38] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 75.01, 37.82, 21.77 https://wikitech.wikimedia.org/wiki/Application_servers [07:06:48] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 63.82, 31.37, 18.26 https://wikitech.wikimedia.org/wiki/Application_servers [07:06:50] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 66.50, 32.89, 19.05 https://wikitech.wikimedia.org/wiki/Application_servers [07:07:06] PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 54.35, 30.35, 18.15 https://wikitech.wikimedia.org/wiki/Application_servers [07:07:28] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 76.81, 42.64, 24.74 https://wikitech.wikimedia.org/wiki/Application_servers [07:08:32] PROBLEM - High CPU load on API appserver on mw1224 is CRITICAL: CRITICAL - load average: 51.82, 28.38, 16.71 https://wikitech.wikimedia.org/wiki/Application_servers [07:08:34] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 86.47, 45.17, 24.64 https://wikitech.wikimedia.org/wiki/Application_servers [07:09:28] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 82.99, 43.13, 23.20 https://wikitech.wikimedia.org/wiki/Application_servers [07:09:42] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 67.26, 32.71, 18.80 https://wikitech.wikimedia.org/wiki/Application_servers [07:10:14] PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 67.84, 36.77, 20.81 https://wikitech.wikimedia.org/wiki/Application_servers [07:10:46] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:10:58] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:11:04] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:11:58] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:14:04] (03PS2) 10Ema: vcl: update Vary:XFP fixup comment [puppet] - 10https://gerrit.wikimedia.org/r/525308 (https://phabricator.wikimedia.org/T51700) [07:14:19] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2126.codfw.wmnet'] ` The log can be found in `/var/log/wmf-aut... [07:15:14] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:16:27] (03CR) 10Ema: [C: 03+2] vcl: update Vary:XFP fixup comment [puppet] - 10https://gerrit.wikimedia.org/r/525308 (https://phabricator.wikimedia.org/T51700) (owner: 10Ema) [07:16:44] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:16:58] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:17:28] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:18:18] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:18:42] (03PS1) 10Marostegui: install_server: Allow reimage dbproxy1020,dbproxy1021 [puppet] - 10https://gerrit.wikimedia.org/r/525482 (https://phabricator.wikimedia.org/T228618) [07:19:02] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:19:02] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:19:10] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:19:12] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:19:22] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:19:44] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:10] (03PS2) 10Marostegui: install_server: Allow reimage dbproxy1020,dbproxy1021 [puppet] - 10https://gerrit.wikimedia.org/r/525482 (https://phabricator.wikimedia.org/T228618) [07:20:57] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage dbproxy1020,dbproxy1021 [puppet] - 10https://gerrit.wikimedia.org/r/525482 (https://phabricator.wikimedia.org/T228618) (owner: 10Marostegui) [07:21:50] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:22:24] (03PS1) 10Elukey: profile::tlsproxy::instance: avoid dynamic tls records in Buster [puppet] - 10https://gerrit.wikimedia.org/r/525483 (https://phabricator.wikimedia.org/T228730) [07:22:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10Marostegui) @Cmjohnson As we spoke yesterday, I have added an install recipe for these hosts, so they can be re-imaged. Once you've committed th... [07:22:49] (03PS3) 10Marostegui: mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/524411 (https://phabricator.wikimedia.org/T227062) [07:23:17] !log Upgrade MySQL on db1072 [07:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:04] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:26:00] is anyone looking at those spikes of load on api servers? [07:27:15] latency has increased https://grafana.wikimedia.org/d/000000002/api-backend-summary?refresh=5m&orgId=1&from=1563953183077&to=1564039583077 and parsoid seems to be the offender regarding load [07:27:22] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:27:37] I was not, for some reason irssi didn't highlight them for me sigh [07:27:53] https://grafana.wikimedia.org/d/000000327/apache-hhvm?orgId=1&from=now-6h&to=now [07:28:10] hhvm load and queued rose a lot [07:28:28] I think that we should keep one for investigation and roll restart hhvm on api appservers [07:28:35] _joe_ what do you think? [07:28:36] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:29:14] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:29:18] or might be more traffic, let's also triple check [07:29:24] <_joe_> elukey: looks like a lot of traffic [07:29:25] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2126.codfw.wmnet'] ` Of which those **FAILED**: ` ['db2126.codfw.wmnet'] ` [07:29:28] ah okok [07:29:45] <_joe_> uhm no actually [07:30:12] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:30:12] the rps seems stable, but httpd's workers are busy [07:30:21] looks like hhvm is slowing down [07:30:24] <_joe_> yes [07:30:27] <_joe_> only on api [07:30:33] <_joe_> lemme do a triple check [07:30:56] <_joe_> https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=2200&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All [07:30:59] I started see yesterday more apis pegged to high load for ~30+ minutes [07:31:22] <_joe_> mw1348 is only running php7 and has an increase in cpu usage [07:31:40] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/525470 (owner: 10Ayounsi) [07:31:45] <_joe_> elukey: try restarting a couple of the more ill-behaved [07:32:05] <_joe_> looks like someone is doing a ton of expensive requests [07:32:27] (03PS2) 10Elukey: profile::tlsproxy::instance: avoid dynamic tls records in Buster [puppet] - 10https://gerrit.wikimedia.org/r/525483 (https://phabricator.wikimedia.org/T228730) [07:32:28] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:32:34] <_joe_> but the load is going ddown [07:32:38] yep [07:32:47] I'd wait a bit more, I think it is recovering [07:32:59] <_joe_> yes [07:33:10] <_joe_> it would be interesting to know what api pattern is causing this [07:33:17] !log rebooting cloudvirt2001-dev [07:33:18] <_joe_> or which user, if we only had api tokens [07:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:23] fsero: didn't mean to overstep, if you want to investigate please go ahead [07:34:02] glkad that you overstep i was worried nobody with some experience on it was looking into it [07:34:07] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/525303 (owner: 10Ayounsi) [07:36:25] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/17603/" [puppet] - 10https://gerrit.wikimedia.org/r/525483 (https://phabricator.wikimedia.org/T228730) (owner: 10Elukey) [07:38:03] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) @Papaul I tried to install db2126 myself to advance on the task, but looks like it keeps rebooting on PXE forever :-) I think it needs your on-site magic, as with... [07:40:41] thcipriani: Who can merge, https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/+/525456/ ? [07:46:06] (03PS1) 10Muehlenhoff: Fix name for kmod::options, it needs to match the kernel module name [puppet] - 10https://gerrit.wikimedia.org/r/525484 [07:46:45] I have CR+2 over there kart_ but should I? [07:47:33] hauskatze: If you know what it does :) I don't have any idea, but it seems blocking patches further due to CI failure. [07:48:03] Let me reword: Who can review it :) [07:48:10] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2126.codfw.wmnet'] ` The log can be found in `/var/log/wmf-aut... [07:50:47] ah, well, I don't really have much knowledge about blubber; perhaps hashar when he's back? [07:51:24] (03CR) 10Volans: [C: 03+1] "I'm wondering if we need the same for dbconfig-section and dbconfig-instance. LGTM otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/525469 (owner: 10CDanis) [07:51:32] RECOVERY - High CPU load on API appserver on mw1224 is OK: OK - load average: 10.45, 12.01, 23.86 https://wikitech.wikimedia.org/wiki/Application_servers [07:52:45] (03CR) 10Muehlenhoff: [C: 03+2] Fix name for kmod::options, it needs to match the kernel module name [puppet] - 10https://gerrit.wikimedia.org/r/525484 (owner: 10Muehlenhoff) [07:54:24] RECOVERY - High CPU load on API appserver on mw1235 is OK: OK - load average: 9.26, 10.90, 22.84 https://wikitech.wikimedia.org/wiki/Application_servers [07:54:54] RECOVERY - High CPU load on API appserver on mw1221 is OK: OK - load average: 9.18, 11.32, 23.41 https://wikitech.wikimedia.org/wiki/Application_servers [07:55:50] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 11.94, 12.40, 23.76 https://wikitech.wikimedia.org/wiki/Application_servers [07:56:32] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 10.71, 10.97, 23.64 https://wikitech.wikimedia.org/wiki/Application_servers [07:59:34] <_joe_> !log repooling mw1276-1283 in the API cluster [07:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:47] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=api_appserver,dc=eqiad,name=mw12(7[6-9|8[0-3]).* [07:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:54] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 9.17, 10.93, 23.91 https://wikitech.wikimedia.org/wiki/Application_servers [07:59:57] 10Operations, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10akosiaris) 05Open→03Resolved +1. Thanks @Joe , thanks @DZahn. I 've added both of... [08:00:56] !log rebooting cloudvirt2001-dev [08:01:00] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 7.73, 9.90, 23.18 https://wikitech.wikimedia.org/wiki/Application_servers [08:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:03] <_joe_> !log repooling mw1268-1275 in the appserver cluster [08:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:10] <_joe_> elukey: this might help... [08:01:43] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=appserver,dc=eqiad,name=mw12(6[89]|7[0-5]).* [08:01:44] RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 13.07, 11.71, 23.17 https://wikitech.wikimedia.org/wiki/Application_servers [08:01:45] ack [08:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:05] <_joe_> we were running short of 8 api servers [08:02:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, but note that this will have to be manually rebased (essentially index.yaml will have to be regenerated)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi) [08:02:53] hauskatze: OK. Thanks. I also want to know more about it. [08:02:54] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 8.05, 11.12, 23.79 https://wikitech.wikimedia.org/wiki/Application_servers [08:03:08] _joe_ sigh I didn't get it at first.. [08:03:08] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 10.66, 11.59, 23.19 https://wikitech.wikimedia.org/wiki/Application_servers [08:04:58] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=api_appserver,dc=eqiad,name=mw128[0-3].* [08:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:54] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 7.54, 10.91, 22.82 https://wikitech.wikimedia.org/wiki/Application_servers [08:07:55] (03Abandoned) 10Ema: varnish: retry requests upon 502 errors [puppet] - 10https://gerrit.wikimedia.org/r/507953 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [08:09:12] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 7.21, 10.63, 22.66 https://wikitech.wikimedia.org/wiki/Application_servers [08:10:24] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 8.63, 11.20, 23.11 https://wikitech.wikimedia.org/wiki/Application_servers [08:11:53] (03CR) 10Jbond: [C: 04-1] "I think in general this seems fine for facts which don't change or ones we don't care about. however we should be careful regarding facts" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525149 (https://phabricator.wikimedia.org/T228805) (owner: 10Andrew Bogott) [08:12:22] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2126.codfw.wmnet'] ` and were **ALL** successful. [08:13:30] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) >>! In T227113#5364053, @Marostegui wrote: > @Papaul I tried to install db2126 myself to advance on the task, but looks like it keeps rebooting on PXE forever :-)... [08:14:34] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) [08:14:43] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/525259 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [08:17:50] (03CR) 10Jbond: [C: 03+2] puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [08:17:59] (03PS17) 10Jbond: puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 [08:18:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, although in theory with new ferm versions this isn't needed anymore, I'll let Moritz comment on it" [puppet] - 10https://gerrit.wikimedia.org/r/525309 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [08:19:21] (03CR) 10Filippo Giunchedi: [C: 03+1] Allow the use of Ipv6 in the Hadoop Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [08:21:10] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [08:21:14] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [08:25:34] 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10elukey) Adding @dcausse to the conversation since @Gehel is on holiday. I would simply buy the disk now, but not sure if elastic1046 is scheduled to be refreshed soon. [08:27:23] elukey: thanks ! Only on my phone atm. This one is not scheduled for replacement. [08:27:58] elukey: lvs has probably already marked it as down, but could you depool it just in case ? [08:30:14] (03PS1) 10Ema: tlsproxy: toggle dynamic ssl_buffer_size settings [puppet] - 10https://gerrit.wikimedia.org/r/525490 (https://phabricator.wikimedia.org/T228730) [08:30:28] gehel: hello! Sorry I didn't mean to cause a ping :( I depooled it a while ago when the ticket was cut and alerted David, all good :) [08:30:33] I'll proceed with ordering the disk then [08:31:26] 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10elukey) The host is not scheduled for replacement, @wiki_willy please proceed with the order of the disk :) [08:32:07] !log Password reset for SUL user Strejc [08:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:01] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/17604/ looks happy" [puppet] - 10https://gerrit.wikimedia.org/r/524037 (owner: 10Ayounsi) [08:34:20] RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [08:34:26] (03CR) 10Ema: "pcc looks good on text/upload https://puppet-compiler.wmflabs.org/compiler1001/17605/" [puppet] - 10https://gerrit.wikimedia.org/r/525490 (https://phabricator.wikimedia.org/T228730) (owner: 10Ema) [08:35:23] (03PS1) 10Jbond: puppetmaster: point neodymium and sarin to puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/525492 [08:35:31] (03PS1) 10Muehlenhoff: Fix kernel option [puppet] - 10https://gerrit.wikimedia.org/r/525493 [08:38:02] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [08:38:27] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2128.codfw.wmnet'] ` The log can be found in `/var/log/wmf-aut... [08:39:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/525492 (owner: 10Jbond) [08:39:50] (03Abandoned) 10Elukey: profile::tlsproxy::instance: avoid dynamic tls records in Buster [puppet] - 10https://gerrit.wikimedia.org/r/525483 (https://phabricator.wikimedia.org/T228730) (owner: 10Elukey) [08:40:33] (03PS3) 10Arturo Borrero Gonzalez: openstack: fullstack: use a readable-friendly name for VMs [puppet] - 10https://gerrit.wikimedia.org/r/525320 [08:41:04] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [08:41:44] PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/check-and-restart-php] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:41:44] (03PS2) 10Muehlenhoff: Fix kernel option [puppet] - 10https://gerrit.wikimedia.org/r/525493 [08:43:28] (03PS2) 10Ema: vcl: do not cache the beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/525268 (https://phabricator.wikimedia.org/T228861) [08:44:52] (03CR) 10Elukey: "LGTM, but there are a lot of hosts that will be affected: https://puppet-compiler.wmflabs.org/compiler1001/17606/" [puppet] - 10https://gerrit.wikimedia.org/r/525490 (https://phabricator.wikimedia.org/T228730) (owner: 10Ema) [08:46:16] (03CR) 10Jbond: [C: 03+2] puppetmaster: point neodymium and sarin to puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/525492 (owner: 10Jbond) [08:46:24] (03CR) 10Muehlenhoff: [C: 03+2] Fix kernel option [puppet] - 10https://gerrit.wikimedia.org/r/525493 (owner: 10Muehlenhoff) [08:46:32] (03PS3) 10Muehlenhoff: Fix kernel option [puppet] - 10https://gerrit.wikimedia.org/r/525493 [08:47:40] (03PS4) 10Arturo Borrero Gonzalez: openstack: fullstack: use a readable-friendly name for VMs [puppet] - 10https://gerrit.wikimedia.org/r/525320 [08:48:03] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2129.codfw.wmnet'] ` The log can be found in `/var/log/wmf-aut... [08:48:41] (03PS5) 10Arturo Borrero Gonzalez: openstack: fullstack: use a readable-friendly name for VMs [puppet] - 10https://gerrit.wikimedia.org/r/525320 [08:49:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: fullstack: use a readable-friendly name for VMs [puppet] - 10https://gerrit.wikimedia.org/r/525320 (owner: 10Arturo Borrero Gonzalez) [08:49:50] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 64.29% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:50:21] (03PS2) 10Ema: tlsproxy: toggle dynamic ssl_buffer_size settings [puppet] - 10https://gerrit.wikimedia.org/r/525490 (https://phabricator.wikimedia.org/T228730) [08:50:59] (03PS3) 10Ema: vcl: do not cache the beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/525268 (https://phabricator.wikimedia.org/T228861) [08:51:45] (03CR) 10Ema: [C: 03+2] vcl: do not cache the beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/525268 (https://phabricator.wikimedia.org/T228861) (owner: 10Ema) [08:54:07] !log rebooting cloudvirt2001-dev [08:54:12] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:14] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:40] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2128.codfw.wmnet'] ` and were **ALL** successful. [09:03:53] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2130.codfw.wmnet'] ` The log can be found in `/var/log/wmf-aut... [09:03:58] (03PS1) 10Ema: vcl: pass beta version of mobile site in vcl_recv [puppet] - 10https://gerrit.wikimedia.org/r/525500 (https://phabricator.wikimedia.org/T228861) [09:04:48] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) [09:07:08] (03PS1) 10Jbond: puppetmaster: pass canary_host via the profile class [puppet] - 10https://gerrit.wikimedia.org/r/525501 [09:07:52] (03PS1) 10Filippo Giunchedi: prometheus: aggregate puppet failure percent by cluster [puppet] - 10https://gerrit.wikimedia.org/r/525502 (https://phabricator.wikimedia.org/T228878) [09:09:21] RECOVERY - puppet last run on mw2238 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:09:59] PROBLEM - puppet last run on cp5012 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 7 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[load-new-vcl-file-frontend],Exec[load-new-vcl-file] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:12:40] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2129.codfw.wmnet'] ` and were **ALL** successful. [09:13:30] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) [09:14:32] 10Puppet: Passenger stderr warnings for regex and htpasswd.rb - https://phabricator.wikimedia.org/T228966 (10fgiunchedi) [09:14:39] 10Puppet: Passenger stderr warnings for regex and htpasswd.rb - https://phabricator.wikimedia.org/T228966 (10fgiunchedi) p:05Triage→03Low [09:14:52] (03PS2) 10Jbond: puppetmaster: pass canary_host via the profile class [puppet] - 10https://gerrit.wikimedia.org/r/525501 [09:16:11] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:16:15] (03PS1) 10Fsero: k8s: enable limitranges and resource quotas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/525505 (https://phabricator.wikimedia.org/T228965) [09:18:24] (03PS2) 10Fsero: k8s: adding PodSecurityPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/525281 (https://phabricator.wikimedia.org/T228967) [09:18:48] (03PS2) 10Fsero: k8s: enable limitranges and resource quotas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/525505 (https://phabricator.wikimedia.org/T228965) [09:19:39] (03PS1) 10Marostegui: Revert "wmnet: Failover dbproxy1001 to dbproxy1006" [dns] - 10https://gerrit.wikimedia.org/r/525506 [09:19:42] (03CR) 10Ema: [C: 03+2] vcl: pass beta version of mobile site in vcl_recv [puppet] - 10https://gerrit.wikimedia.org/r/525500 (https://phabricator.wikimedia.org/T228861) (owner: 10Ema) [09:19:53] (03PS2) 10Marostegui: Revert "wmnet: Failover dbproxy1001 to dbproxy1006" [dns] - 10https://gerrit.wikimedia.org/r/525506 [09:20:17] 10Operations, 10DBA, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Marostegui) [09:20:26] (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Failover dbproxy1001 to dbproxy1006" [dns] - 10https://gerrit.wikimedia.org/r/525506 (owner: 10Marostegui) [09:21:25] !log Failover m1 from dbproxy1006 to dbproxy1001 - T227139 [09:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:33] T227139: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 [09:24:00] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 5 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[load-new-vcl-file-frontend],Exec[load-new-vcl-file] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:24:33] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10jijiki) @Eevans I was under the impression we have more work to be done on the server. Shall we mark this task as resolved? [09:24:34] PROBLEM - puppet last run on cp1085 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 3 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[load-new-vcl-file-frontend],Exec[load-new-vcl-file] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:25:07] fixing ^ [09:25:52] PROBLEM - puppet last run on cp1083 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 5 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[load-new-vcl-file-frontend],Exec[load-new-vcl-file] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:27:25] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2130.codfw.wmnet'] ` and were **ALL** successful. [09:27:53] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) [09:28:04] (03CR) 10Jbond: [C: 03+2] puppetmaster: pass canary_host via the profile class [puppet] - 10https://gerrit.wikimedia.org/r/525501 (owner: 10Jbond) [09:28:12] (03PS3) 10Jbond: puppetmaster: pass canary_host via the profile class [puppet] - 10https://gerrit.wikimedia.org/r/525501 [09:29:00] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) @Papaul all hosts but db2127 have been installed. I have managed to get into the BIOS of all of them and change the boot settings, however, db2127's idrac password... [09:29:16] 10Operations, 10serviceops, 10Core Platform Team Legacy (Watching / External), 10Core Platform Team Workboards (Clinic Duty Team), and 4 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [09:29:28] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:30:00] RECOVERY - puppet last run on cp1085 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:31:04] RECOVERY - puppet last run on cp5012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:31:24] RECOVERY - puppet last run on cp1083 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:32:59] 10Operations, 10serviceops, 10Core Platform Team Legacy (Watching / External), 10Core Platform Team Workboards (Clinic Duty Team), and 4 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) All async jobs run on PHP7, we will keep an eye for about a week, and then c... [09:34:30] 10Operations, 10observability, 10serviceops, 10User-Elukey: Test memsniff as possible replacement of memkeys - https://phabricator.wikimedia.org/T228970 (10elukey) p:05Triage→03Normal [09:41:56] (03PS1) 10Jbond: puppetmaster: enable puppetmaster1003 as a canary host [puppet] - 10https://gerrit.wikimedia.org/r/525508 [09:42:07] (03PS6) 10Giuseppe Lavagetto: New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 [09:42:08] (03PS4) 10Giuseppe Lavagetto: Add debian package build [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517979 [09:43:02] (03CR) 10Elukey: [C: 03+1] profile::mediawiki::nutcracker: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/525229 (owner: 10Muehlenhoff) [09:45:35] (03CR) 10jerkins-bot: [V: 04-1] Add debian package build [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517979 (owner: 10Giuseppe Lavagetto) [09:45:44] (03CR) 10jerkins-bot: [V: 04-1] New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto) [09:47:09] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10elukey) 05Open→03Resolved a:03elukey [09:47:15] (03CR) 10Giuseppe Lavagetto: New library to interact with poolcounter from python (0310 comments) [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto) [09:47:22] (03CR) 10Jbond: [C: 03+2] puppetmaster: enable puppetmaster1003 as a canary host [puppet] - 10https://gerrit.wikimedia.org/r/525508 (owner: 10Jbond) [09:47:37] [09:50:18] (03PS7) 10Giuseppe Lavagetto: New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 [09:50:20] (03PS5) 10Giuseppe Lavagetto: Add debian package build [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517979 [09:51:31] (03CR) 10Effie Mouzeli: [C: 03+1] "Let's do it!" [puppet] - 10https://gerrit.wikimedia.org/r/525224 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey) [09:52:00] (03PS1) 10Jbond: Revert "puppetmaster: enable puppetmaster1003 as a canary host" [puppet] - 10https://gerrit.wikimedia.org/r/525510 [09:52:37] (03CR) 10Jbond: [C: 03+2] Revert "puppetmaster: enable puppetmaster1003 as a canary host" [puppet] - 10https://gerrit.wikimedia.org/r/525510 (owner: 10Jbond) [09:53:59] (03CR) 10jerkins-bot: [V: 04-1] New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto) [09:54:28] (03CR) 10jerkins-bot: [V: 04-1] Add debian package build [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517979 (owner: 10Giuseppe Lavagetto) [09:58:27] (03PS3) 10Mforns: analytics::refinery::job::druid_load add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/515125 (https://phabricator.wikimedia.org/T225314) [10:05:00] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/17612/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/515125 (https://phabricator.wikimedia.org/T225314) (owner: 10Mforns) [10:05:21] (03PS4) 10Elukey: analytics::refinery::job::druid_load: add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/515125 (https://phabricator.wikimedia.org/T225314) (owner: 10Mforns) [10:09:21] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:12:41] (03PS1) 10Filippo Giunchedi: monitoring: add logstash 5xx dashboard to availability alerts [puppet] - 10https://gerrit.wikimedia.org/r/525511 (https://phabricator.wikimedia.org/T228878) [10:12:43] (03PS1) 10Filippo Giunchedi: prometheus: calculate nginx/varnish availability over 2m too [puppet] - 10https://gerrit.wikimedia.org/r/525512 (https://phabricator.wikimedia.org/T228878) [10:14:02] (03PS1) 10Marostegui: db-eqiad.php: Set s8 (wikidata) into read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525513 (https://phabricator.wikimedia.org/T227062) [10:14:21] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525513 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [10:15:32] (03PS2) 10Marostegui: db-eqiad.php: Set s8 (wikidata) into read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525513 (https://phabricator.wikimedia.org/T227062) [10:15:37] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:16:09] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:17:27] nice --^ [10:17:31] maintenance should be ended [10:22:24] (03PS3) 10Ema: tlsproxy: toggle dynamic ssl_buffer_size settings [puppet] - 10https://gerrit.wikimedia.org/r/525490 (https://phabricator.wikimedia.org/T228730) [10:26:19] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:29:24] (03CR) 10Volans: "hknust, urandom: I did a first pass, here below some general comment, all the details inline. I've marked the comments with [tags]." (0340 comments) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [10:35:03] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:35:04] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:16] !log rebooting cloudvirt1024 for kernel update [10:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:37] (03PS1) 10Elukey: cumin: add hadoop-worker-test alias [puppet] - 10https://gerrit.wikimedia.org/r/525514 [10:37:15] (03CR) 10Muehlenhoff: [C: 03+1] cumin: add hadoop-worker-test alias [puppet] - 10https://gerrit.wikimedia.org/r/525514 (owner: 10Elukey) [10:40:36] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/524411 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [10:40:58] (03CR) 10Elukey: [C: 03+2] cumin: add hadoop-worker-test alias [puppet] - 10https://gerrit.wikimedia.org/r/525514 (owner: 10Elukey) [10:42:48] (03PS1) 10Jbond: puppetmaster: use mod rewrite for conditional proxypass [puppet] - 10https://gerrit.wikimedia.org/r/525516 [10:45:11] (03PS1) 10Elukey: cumin: add hadoop-journal-hdfs-test alias [puppet] - 10https://gerrit.wikimedia.org/r/525517 [10:47:15] (03CR) 10Volans: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525513 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [10:52:24] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:52:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:08] !log rebooting cloudvirt2003-dev [10:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:37] (03PS2) 10Elukey: cumin: add hadoop-hdfs-journal-test alias [puppet] - 10https://gerrit.wikimedia.org/r/525517 [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190725T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:03:19] !log update stretch-wikimedia/thirdparty/kubeadm-k8s on install1002 for T215531 (kubeadm 1.15.1) [11:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:26] T215531: Deploy upgraded Kubernetes to toolsbeta - https://phabricator.wikimedia.org/T215531 [11:17:11] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10MoritzMuehlenhoff) The failed install might be due to https://phabricator.wikimedia.org/T222960#5327461 ? [11:22:44] (03PS2) 10Alexandros Kosiaris: Update tests to using puppet 4.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/524525 [11:24:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] "INFO: [change] Nodes: 1 FAIL 402 NOOP 1 ERROR" [puppet] - 10https://gerrit.wikimedia.org/r/524525 (owner: 10Alexandros Kosiaris) [11:35:31] (03PS1) 10Muehlenhoff: Move modprobe config to always enable L1 cache flushes to common Nova class [puppet] - 10https://gerrit.wikimedia.org/r/525519 (https://phabricator.wikimedia.org/T228870) [11:47:21] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s: enable limitranges and resource quotas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/525505 (https://phabricator.wikimedia.org/T228965) (owner: 10Fsero) [11:48:51] (03PS1) 10Elukey: Add sre.hadoop.rolling-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/525520 [11:53:04] !log Compress s3 wikis on labsdb1010 - T222978 [11:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:59] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [11:56:16] (03CR) 10jerkins-bot: [V: 04-1] Add sre.hadoop.rolling-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/525520 (owner: 10Elukey) [11:56:51] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17614/" [puppet] - 10https://gerrit.wikimedia.org/r/525519 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff) [11:57:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] k8s: adding PodSecurityPolicies (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/525281 (https://phabricator.wikimedia.org/T228967) (owner: 10Fsero) [12:09:42] (03PS2) 10Alexandros Kosiaris: Add accraze to team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/525143 (https://phabricator.wikimedia.org/T226417) (owner: 10Halfak) [12:09:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add accraze to team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/525143 (https://phabricator.wikimedia.org/T226417) (owner: 10Halfak) [12:09:59] (03CR) 10Fsero: "if i change the comments, the rest is LGTM to you?" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/525281 (https://phabricator.wikimedia.org/T228967) (owner: 10Fsero) [12:11:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:13:54] ^ is that expected? [12:16:17] (03PS3) 10Fsero: k8s: adding PodSecurityPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/525281 (https://phabricator.wikimedia.org/T228967) [12:16:29] (03PS1) 10Jbond: apache: add validate_cmd to apache config [puppet] - 10https://gerrit.wikimedia.org/r/525525 [12:31:28] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:36:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s: adding PodSecurityPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/525281 (https://phabricator.wikimedia.org/T228967) (owner: 10Fsero) [12:38:09] (03PS1) 10Marostegui: mariadb: Decommission dbproxy1004 and dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/525527 (https://phabricator.wikimedia.org/T228768) [12:41:56] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:42:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s: deploy user should be able to list any resource [deployment-charts] - 10https://gerrit.wikimedia.org/r/525240 (owner: 10Fsero) [12:45:16] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:45:42] (03CR) 10Filippo Giunchedi: "LGTM overall, is this going to reload nginx when deployed ?" [puppet] - 10https://gerrit.wikimedia.org/r/525490 (https://phabricator.wikimedia.org/T228730) (owner: 10Ema) [12:48:20] 10Operations, 10media-storage, 10User-fgiunchedi: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 (10fgiunchedi) [12:49:16] !log Drop abuse_filter_log.afl_log_id in s4 codfw (lag will appear on codfw) - T226851 [12:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:32] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [12:50:36] (03PS1) 10Muehlenhoff: role::mediawiki::common: Remove support for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/525532 [12:53:56] there doesn't seem to be any train blockers, or many new errors from wmf.15 in the past 24 hours, so I'm expecting to promote it go group2 soon, once the deployment window opens [12:59:43] (03PS2) 10Arturo Borrero Gonzalez: openstack: move modprobe config to always enable L1 cache flushes to base Nova class [puppet] - 10https://gerrit.wikimedia.org/r/525519 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff) [12:59:55] (03PS3) 10Arturo Borrero Gonzalez: openstack: move modprobe config to always enable L1 cache flushes to base Nova class [puppet] - 10https://gerrit.wikimedia.org/r/525519 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff) [13:00:04] liw: Dear deployers, time to do the MediaWiki train - European version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190725T1300). [13:01:05] (03CR) 10jerkins-bot: [V: 04-1] openstack: move modprobe config to always enable L1 cache flushes to base Nova class [puppet] - 10https://gerrit.wikimedia.org/r/525519 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff) [13:01:39] (03PS1) 10Lars Wirzenius: all wikis to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525533 [13:01:41] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525533 (owner: 10Lars Wirzenius) [13:02:44] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525533 (owner: 10Lars Wirzenius) [13:02:49] (03PS1) 10Ema: ATS: w.wiki rewrite to meta [puppet] - 10https://gerrit.wikimedia.org/r/525534 (https://phabricator.wikimedia.org/T133485) [13:03:01] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525533 (owner: 10Lars Wirzenius) [13:03:11] (03PS4) 10Arturo Borrero Gonzalez: openstack: enable L1 cache flushes in base Nova class [puppet] - 10https://gerrit.wikimedia.org/r/525519 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff) [13:04:20] (03PS2) 10Ema: kibana: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/524529 (https://phabricator.wikimedia.org/T210411) [13:04:34] !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.15 [13:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: enable L1 cache flushes in base Nova class [puppet] - 10https://gerrit.wikimedia.org/r/525519 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff) [13:05:17] (03PS3) 10Ema: kibana: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/524529 (https://phabricator.wikimedia.org/T210411) [13:05:44] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:06:18] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:06:19] (03CR) 10Ema: [C: 03+2] kibana: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/524529 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [13:08:39] all sites at 1.34.0-wmf.15, now seeing if logs explode with error messages [13:09:20] !log Drop abuse_filter_log.afl_log_id in s5 eqiad - T226851 [13:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:27] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [13:10:34] 10Operations, 10vm-requests, 10Patch-For-Review: eqiad: One VM request for identity provider - https://phabricator.wikimedia.org/T228403 (10akosiaris) 05Open→03Resolved idp1001.wikimedia.org has been installed and is up and running, I 'll resolve this [13:14:21] (03PS1) 10Filippo Giunchedi: WIP consolidate critical and contact groups logic [puppet] - 10https://gerrit.wikimedia.org/r/525535 [13:14:23] (03PS1) 10Filippo Giunchedi: WIP tweak description for paging alerts [puppet] - 10https://gerrit.wikimedia.org/r/525536 [13:19:03] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: deploy user should be able to list any resource [deployment-charts] - 10https://gerrit.wikimedia.org/r/525240 (owner: 10Fsero) [13:19:09] (03PS3) 10Fsero: k8s: deploy user should be able to list any resource [deployment-charts] - 10https://gerrit.wikimedia.org/r/525240 [13:19:15] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: deploy user should be able to list any resource [deployment-charts] - 10https://gerrit.wikimedia.org/r/525240 (owner: 10Fsero) [13:19:40] !log recreating clusterrole deploy from helmfile in staging [13:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:42] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [13:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! (preauth is protocol option which prevents a class of brute force attacks on tickets)" [puppet] - 10https://gerrit.wikimedia.org/r/525301 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [13:28:22] (03PS1) 10Matthias Mullie: Enable other statements on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525538 [13:29:45] (03PS2) 10Ema: ATS: w.wiki rewrite to meta [puppet] - 10https://gerrit.wikimedia.org/r/525534 (https://phabricator.wikimedia.org/T133485) [13:29:55] (03PS2) 10Filippo Giunchedi: monitoring: add logstash 5xx dashboard to availability alerts [puppet] - 10https://gerrit.wikimedia.org/r/525511 (https://phabricator.wikimedia.org/T228878) [13:29:57] (03PS2) 10Filippo Giunchedi: prometheus: calculate nginx/varnish availability over 2m too [puppet] - 10https://gerrit.wikimedia.org/r/525512 (https://phabricator.wikimedia.org/T228878) [13:29:59] (03PS1) 10Filippo Giunchedi: monitoring: allow logstash in dashboard_links [puppet] - 10https://gerrit.wikimedia.org/r/525539 [13:31:01] (03PS2) 10Filippo Giunchedi: WIP consolidate critical and contact groups logic [puppet] - 10https://gerrit.wikimedia.org/r/525535 [13:31:03] (03PS2) 10Filippo Giunchedi: WIP tweak description for paging alerts [puppet] - 10https://gerrit.wikimedia.org/r/525536 [13:31:16] sigh, I thought the WIP changes weren't going to spam [13:31:47] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add logstash 5xx dashboard to availability alerts [puppet] - 10https://gerrit.wikimedia.org/r/525511 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [13:31:52] (03CR) 10jerkins-bot: [V: 04-1] monitoring: allow logstash in dashboard_links [puppet] - 10https://gerrit.wikimedia.org/r/525539 (owner: 10Filippo Giunchedi) [13:33:13] (03CR) 10jerkins-bot: [V: 04-1] WIP tweak description for paging alerts [puppet] - 10https://gerrit.wikimedia.org/r/525536 (owner: 10Filippo Giunchedi) [13:35:24] !log cloudvirt1015 offline for ram swap via T220853 [13:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:32] T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 [13:36:01] (03PS1) 10Fsero: k8s: change deploy rolebinding to use view clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/525540 [13:36:26] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: change deploy rolebinding to use view clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/525540 (owner: 10Fsero) [13:36:33] (03PS2) 10Fsero: k8s: change deploy rolebinding to use view clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/525540 [13:36:39] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: change deploy rolebinding to use view clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/525540 (owner: 10Fsero) [13:38:38] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [13:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:05] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [13:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:29] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [13:39:30] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [13:39:30] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [13:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:38] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. This is fixed in the ferm package in buster-wikimedia and I plan to also backport/deploy 2.4.2 to stretch-wikimedia, but haven" [puppet] - 10https://gerrit.wikimedia.org/r/525309 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [13:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:01] (03PS1) 10Jhedden: dumps dist: switch active web to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/525541 (https://phabricator.wikimedia.org/T224228) [13:41:13] (03CR) 10Volans: [C: 04-1] "Minor details inline" (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/525520 (owner: 10Elukey) [13:42:17] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [13:42:18] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [13:42:18] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [13:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:28] PROBLEM - Host cloudvirt1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/525490 (https://phabricator.wikimedia.org/T228730) (owner: 10Ema) [13:44:23] (03CR) 10Ema: [C: 03+2] ATS: w.wiki rewrite to meta [puppet] - 10https://gerrit.wikimedia.org/r/525534 (https://phabricator.wikimedia.org/T133485) (owner: 10Ema) [13:44:42] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10serviceops-radar, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Joe) [13:45:48] (03CR) 10Elukey: [C: 03+1] "Can't speak for all the puppet code since I am not super familiar with it, but +1 for this!" [puppet] - 10https://gerrit.wikimedia.org/r/525527 (https://phabricator.wikimedia.org/T228768) (owner: 10Marostegui) [13:46:17] (03PS1) 10Krinkle: Remove redundant wgResourceLoaderStorageEnabled override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525542 [13:46:25] (03PS2) 10Marostegui: mariadb: Decommission dbproxy1004 and dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/525527 (https://phabricator.wikimedia.org/T228768) [13:47:19] (03PS3) 10Elukey: profile::kerberos::kdcserver: add +requires_preauth to new users [puppet] - 10https://gerrit.wikimedia.org/r/525301 (https://phabricator.wikimedia.org/T226104) [13:47:23] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission dbproxy1004 and dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/525527 (https://phabricator.wikimedia.org/T228768) (owner: 10Marostegui) [13:48:12] RECOVERY - Host cloudvirt1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [13:48:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) ` Record: 1 Date/Time: 07/24/2019 13:23:07 Source: system Severity: Ok Description: Log cleared. ----... [13:49:31] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [13:49:31] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [13:49:32] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [13:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:55] 10Operations, 10Analytics, 10Analytics-EventLogging, 10DBA, and 2 others: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Marostegui) [13:50:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) a:05Cmjohnson→03Andrew @Andrew: We've swapped out the failed memory dimm on this system and the new one hasn't... [13:52:28] !log installing Java security updates on AQS, Hadoop and Kafka/Jumbo servers [13:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:46] (03CR) 10Elukey: "Thanks for the review!" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/525520 (owner: 10Elukey) [13:54:35] (03PS3) 10Fsero: k8s: enable limitranges and resource quotas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/525505 (https://phabricator.wikimedia.org/T228965) [13:54:56] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: enable limitranges and resource quotas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/525505 (https://phabricator.wikimedia.org/T228965) (owner: 10Fsero) [13:55:31] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kdcserver: add +requires_preauth to new users [puppet] - 10https://gerrit.wikimedia.org/r/525301 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [13:55:37] (03PS4) 10Elukey: profile::kerberos::kdcserver: add +requires_preauth to new users [puppet] - 10https://gerrit.wikimedia.org/r/525301 (https://phabricator.wikimedia.org/T226104) [13:56:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10JHedden) Thanks @RobH. I'll spin up some stress testing VMs on that host and let them run until Andrew gets back from vaca... [13:59:00] PROBLEM - puppet last run on db1107 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:00:00] (03CR) 10Herron: "LGTM overall from a cursory look. Could you add some additional details/reasoning to the commit msg and any associated Bugs/tasks?" [puppet] - 10https://gerrit.wikimedia.org/r/525525 (owner: 10Jbond) [14:02:47] !log installing Java security updates on Druid servers [14:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:03] (03PS4) 10Fsero: k8s: adding PodSecurityPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/525281 [14:04:05] (03PS1) 10Fsero: k8s: bug: added missing pods key needed for helmfile templating [deployment-charts] - 10https://gerrit.wikimedia.org/r/525546 [14:04:35] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: bug: added missing pods key needed for helmfile templating [deployment-charts] - 10https://gerrit.wikimedia.org/r/525546 (owner: 10Fsero) [14:04:40] (03PS2) 10Fsero: k8s: bug: added missing pods key needed for helmfile templating [deployment-charts] - 10https://gerrit.wikimedia.org/r/525546 [14:04:43] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: bug: added missing pods key needed for helmfile templating [deployment-charts] - 10https://gerrit.wikimedia.org/r/525546 (owner: 10Fsero) [14:06:23] (03PS1) 10Elukey: profile::kerberos::kadminserver: add missing space in manage_princ.py [puppet] - 10https://gerrit.wikimedia.org/r/525547 [14:06:43] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [14:06:43] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [14:06:44] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:30] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: add missing space in manage_princ.py [puppet] - 10https://gerrit.wikimedia.org/r/525547 (owner: 10Elukey) [14:08:00] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:08:46] 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228892 (10RobH) The replacement powersupply is now on site. The old power supply does NOT need to be returned to Dell. {F29869190} The powersupplies on dbproxy1012 show no error LEDs at this tim... [14:10:00] PROBLEM - puppet last run on db1108 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:12:47] (03PS1) 10Ema: ATS: do not cache Authorization responses [puppet] - 10https://gerrit.wikimedia.org/r/525548 (https://phabricator.wikimedia.org/T227432) [14:13:15] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Increase size of cirrus curl pools [puppet] - 10https://gerrit.wikimedia.org/r/525156 (owner: 10EBernhardson) [14:13:54] (03CR) 10Muehlenhoff: "It's my understanding that the wmf-mariadb packages are only meant for production and any setup in Cloud VPS should use the mariadb packag" [puppet] - 10https://gerrit.wikimedia.org/r/524721 (owner: 10Muehlenhoff) [14:14:28] 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228892 (10RobH) 05Open→03Resolved did a racreset and the issue cleared. the spare psu is now a shelf spare [14:14:31] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [14:14:36] (03CR) 10Elukey: [C: 03+2] cumin: add hadoop-hdfs-journal-test alias [puppet] - 10https://gerrit.wikimedia.org/r/525517 (owner: 10Elukey) [14:14:50] (03PS3) 10Elukey: cumin: add hadoop-hdfs-journal-test alias [puppet] - 10https://gerrit.wikimedia.org/r/525517 [14:14:53] (03CR) 10Elukey: [V: 03+2 C: 03+2] cumin: add hadoop-hdfs-journal-test alias [puppet] - 10https://gerrit.wikimedia.org/r/525517 (owner: 10Elukey) [14:16:02] (03PS5) 10Fsero: k8s: adding PodSecurityPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/525281 [14:16:05] (03PS3) 10Elukey: profile::prometheus::jmx_exporter: allow IPv6 polling [puppet] - 10https://gerrit.wikimedia.org/r/525309 (https://phabricator.wikimedia.org/T225296) [14:16:43] (03PS3) 10BBlack: anycast recdns: use for all LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/520441 (https://phabricator.wikimedia.org/T228190) [14:16:45] (03PS1) 10BBlack: anycast recdns: use for eqsin LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525549 (https://phabricator.wikimedia.org/T228190) [14:16:47] (03PS1) 10BBlack: anycast recdns: use for esams LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525550 (https://phabricator.wikimedia.org/T228190) [14:16:50] (03PS1) 10BBlack: anycast recdns: use for ulsfo LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525551 (https://phabricator.wikimedia.org/T228190) [14:16:54] (03PS1) 10BBlack: anycast recdns: use for codfw LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525552 (https://phabricator.wikimedia.org/T228190) [14:16:56] (03CR) 10Herron: [C: 03+1] prometheus: calculate nginx/varnish availability over 2m too [puppet] - 10https://gerrit.wikimedia.org/r/525512 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [14:17:51] (03CR) 10Herron: [C: 03+1] monitoring: add logstash 5xx dashboard to availability alerts [puppet] - 10https://gerrit.wikimedia.org/r/525511 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [14:17:55] (03PS6) 10Fsero: k8s: adding PodSecurityPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/525281 (https://phabricator.wikimedia.org/T228967) [14:17:58] 10Operations, 10ops-eqiad, 10DC-Ops: dbprov1001 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228891 (10RobH) 05Open→03Resolved in checking, nothing is showing as offline and icinga is happy. [14:18:00] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [14:18:19] (03PS7) 10Fsero: k8s: adding PodSecurityPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/525281 (https://phabricator.wikimedia.org/T228967) [14:19:31] (03PS3) 10Alexandros Kosiaris: Add accraze to team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/525143 (https://phabricator.wikimedia.org/T226417) (owner: 10Halfak) [14:20:34] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: adding PodSecurityPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/525281 (https://phabricator.wikimedia.org/T228967) (owner: 10Fsero) [14:21:35] (03PS2) 10Elukey: Add sre.hadoop.rolling-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/525520 [14:22:27] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [14:22:28] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [14:22:29] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:20] (03CR) 10Elukey: [C: 03+2] profile::prometheus::jmx_exporter: allow IPv6 polling [puppet] - 10https://gerrit.wikimedia.org/r/525309 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [14:23:26] (03PS4) 10Elukey: profile::prometheus::jmx_exporter: allow IPv6 polling [puppet] - 10https://gerrit.wikimedia.org/r/525309 (https://phabricator.wikimedia.org/T225296) [14:24:27] (03PS2) 10BBlack: Anycast, make recdns VIP alerts page [puppet] - 10https://gerrit.wikimedia.org/r/524067 (https://phabricator.wikimedia.org/T228190) (owner: 10Ayounsi) [14:24:29] (03PS2) 10BBlack: anycast recdns: use for eqsin LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525549 (https://phabricator.wikimedia.org/T228190) [14:24:31] (03PS2) 10BBlack: anycast recdns: use for esams LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525550 (https://phabricator.wikimedia.org/T228190) [14:24:33] (03PS2) 10BBlack: anycast recdns: use for ulsfo LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525551 (https://phabricator.wikimedia.org/T228190) [14:24:35] (03PS2) 10BBlack: anycast recdns: use for codfw LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525552 (https://phabricator.wikimedia.org/T228190) [14:24:37] (03PS4) 10BBlack: anycast recdns: use for all LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/520441 (https://phabricator.wikimedia.org/T228190) [14:25:16] (03CR) 10jerkins-bot: [V: 04-1] Add sre.hadoop.rolling-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/525520 (owner: 10Elukey) [14:26:18] (03CR) 10BBlack: [C: 03+2] Anycast, make recdns VIP alerts page [puppet] - 10https://gerrit.wikimedia.org/r/524067 (https://phabricator.wikimedia.org/T228190) (owner: 10Ayounsi) [14:26:53] (03PS3) 10BBlack: Anycast, make recdns VIP alerts page [puppet] - 10https://gerrit.wikimedia.org/r/524067 (https://phabricator.wikimedia.org/T228190) (owner: 10Ayounsi) [14:26:55] (03PS3) 10BBlack: anycast recdns: use for eqsin LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525549 (https://phabricator.wikimedia.org/T228190) [14:26:57] (03PS3) 10BBlack: anycast recdns: use for esams LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525550 (https://phabricator.wikimedia.org/T228190) [14:26:59] (03PS3) 10BBlack: anycast recdns: use for ulsfo LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525551 (https://phabricator.wikimedia.org/T228190) [14:27:01] (03PS3) 10BBlack: anycast recdns: use for codfw LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525552 (https://phabricator.wikimedia.org/T228190) [14:27:03] (03PS5) 10BBlack: anycast recdns: use for all LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/520441 (https://phabricator.wikimedia.org/T228190) [14:27:38] (03PS1) 10Fsero: k8s: enabling PodSecurityPolicy admission controller in staging [puppet] - 10https://gerrit.wikimedia.org/r/525553 (https://phabricator.wikimedia.org/T228967) [14:28:06] (03CR) 10CDanis: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/525469 (owner: 10CDanis) [14:28:49] (03PS2) 10CDanis: conftool-data: add entries for mwconfig/dbconfig [puppet] - 10https://gerrit.wikimedia.org/r/525469 [14:28:57] (03CR) 10CDanis: [C: 03+2] conftool-data: add entries for mwconfig/dbconfig [puppet] - 10https://gerrit.wikimedia.org/r/525469 (owner: 10CDanis) [14:29:52] (03CR) 10Fsero: "PCC seems happy https://puppet-compiler.wmflabs.org/compiler1002/17618/neon.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/525553 (https://phabricator.wikimedia.org/T228967) (owner: 10Fsero) [14:30:30] (03PS3) 10CDanis: conftool-data: add entries for mwconfig/dbconfig [puppet] - 10https://gerrit.wikimedia.org/r/525469 [14:30:40] (03CR) 10CDanis: [V: 03+2 C: 03+2] conftool-data: add entries for mwconfig/dbconfig [puppet] - 10https://gerrit.wikimedia.org/r/525469 (owner: 10CDanis) [14:31:18] PROBLEM - puppet last run on authdns1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:31:35] (03PS3) 10Elukey: Add sre.hadoop.rolling-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/525520 (https://phabricator.wikimedia.org/T229003) [14:33:01] the puppet last run alert above for authdns1001, the logs say: [14:33:02] Jul 25 14:08:02 authdns1001 puppet-agent[169231]: 503 Proxy Error [14:33:27] I assume other recent similar alerts are the same [14:35:11] elukey: I am going to take alook at the db1107 and db1108 puppet errors, probably related to the m4 decomm [14:36:16] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:36:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s: enabling PodSecurityPolicy admission controller in staging [puppet] - 10https://gerrit.wikimedia.org/r/525553 (https://phabricator.wikimedia.org/T228967) (owner: 10Fsero) [14:36:32] (03CR) 10jerkins-bot: [V: 04-1] Add sre.hadoop.rolling-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/525520 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [14:36:46] RECOVERY - puppet last run on authdns1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:37:05] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T228853 (10RobH) So in the full physical disk output, the disk for slot 1 was missing: ` robh@cloudvirt1024:~$ sudo megacli -PDList -aALL Adapter #0 Enclosure Device ID: 32 Slo... [14:37:05] (03CR) 10Fsero: [C: 03+2] k8s: enabling PodSecurityPolicy admission controller in staging [puppet] - 10https://gerrit.wikimedia.org/r/525553 (https://phabricator.wikimedia.org/T228967) (owner: 10Fsero) [14:37:09] (03CR) 10Volans: "LGTM, 1 nit and one optional nit :)" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [14:37:13] (03PS2) 10Fsero: k8s: enabling PodSecurityPolicy admission controller in staging [puppet] - 10https://gerrit.wikimedia.org/r/525553 (https://phabricator.wikimedia.org/T228967) [14:38:06] elukey: Ah, I know what it is. I had doubts about deleting the grants file already or wait until we decom db1107 and db1108, so I am going to restore, it is not worth spending time cleaning this class if those hosts will go away [14:38:09] (03PS2) 10Ema: ATS: do not cache Authorization responses [puppet] - 10https://gerrit.wikimedia.org/r/525548 (https://phabricator.wikimedia.org/T227432) [14:38:11] (03PS1) 10Ema: ATS: save and restore CC/Expires when forcing no-cache [puppet] - 10https://gerrit.wikimedia.org/r/525554 (https://phabricator.wikimedia.org/T227432) [14:38:21] marostegui: :( ack thanks [14:38:33] elukey: I could even place an empty file and that'd work even [14:38:50] But I will restore entirely the .sql file and then once we kill those two hosts, we can finally remove it [14:41:40] (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/524297 (owner: 10CDanis) [14:42:24] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [14:42:37] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@87b25f2]: Convert oozie actions from hive to hive2 [14:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:44] 10Operations, 10ops-eqiad, 10DC-Ops: dbprov1001 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228891 (10Marostegui) Thanks! [14:42:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10JHedden) Created these VMs ` openstack server list --project testlabs --long -c ID -c Name -c Host| grep cv1015 | 30f17a94... [14:42:56] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@87b25f2]: Convert oozie actions from hive to hive2 (duration: 00m 19s) [14:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:05] marostegui: let's do what you think is quick, don't spend too much time on it [14:43:18] ebernhardson: <3 [14:43:40] (03CR) 10CDanis: [C: 03+2] syncer: verify no data == no removal [software/conftool] - 10https://gerrit.wikimedia.org/r/524297 (owner: 10CDanis) [14:43:52] 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228892 (10Marostegui) That was fast! Thank you! [14:44:52] (03PS1) 10Marostegui: mariadb: Re-add production-m4.sql grants file [puppet] - 10https://gerrit.wikimedia.org/r/525555 (https://phabricator.wikimedia.org/T228768) [14:46:22] (03Merged) 10jenkins-bot: syncer: verify no data == no removal [software/conftool] - 10https://gerrit.wikimedia.org/r/524297 (owner: 10CDanis) [14:46:53] (03CR) 10Marostegui: [C: 03+2] mariadb: Re-add production-m4.sql grants file [puppet] - 10https://gerrit.wikimedia.org/r/525555 (https://phabricator.wikimedia.org/T228768) (owner: 10Marostegui) [14:48:08] (03PS1) 10BBlack: Anycast recdns monitoring: use raw IP [puppet] - 10https://gerrit.wikimedia.org/r/525556 (https://phabricator.wikimedia.org/T228190) [14:49:06] (03CR) 10jerkins-bot: [V: 04-1] Anycast recdns monitoring: use raw IP [puppet] - 10https://gerrit.wikimedia.org/r/525556 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [14:49:54] RECOVERY - puppet last run on db1108 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:49:54] RECOVERY - puppet last run on db1107 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:49:56] elukey: ^ :) [14:50:08] <3 [14:50:09] (03PS2) 10BBlack: Anycast recdns monitoring: use raw IP [puppet] - 10https://gerrit.wikimedia.org/r/525556 (https://phabricator.wikimedia.org/T228190) [14:50:56] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:52:42] (03CR) 10BBlack: "https://puppet-compiler.wmflabs.org/compiler1002/17620/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/525556 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [14:52:45] (03CR) 10BBlack: [C: 03+2] Anycast recdns monitoring: use raw IP [puppet] - 10https://gerrit.wikimedia.org/r/525556 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [14:52:47] (03PS4) 10Elukey: Add sre.hadoop.rolling-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/525520 (https://phabricator.wikimedia.org/T229003) [14:52:53] (03CR) 10Cwhite: [C: 03+1] monitoring: add logstash 5xx dashboard to availability alerts [puppet] - 10https://gerrit.wikimedia.org/r/525511 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [14:53:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525557 [14:53:36] (03CR) 10Cwhite: [C: 03+1] prometheus: calculate nginx/varnish availability over 2m too [puppet] - 10https://gerrit.wikimedia.org/r/525512 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [14:53:55] (03PS1) 10Fsero: k8s, helmfile: 1 millicore is too low as a request [deployment-charts] - 10https://gerrit.wikimedia.org/r/525558 [14:54:34] 10Operations, 10Analytics, 10Analytics-EventLogging, 10DBA, and 2 others: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Marostegui) [14:55:22] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Marostegui) a:05Marostegui→03RobH These two hosts are ready for #dc-ops to decommission [14:56:57] (03CR) 10Volans: [C: 03+1] "> Patch Set 8:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) (owner: 10CRusnov) [14:57:39] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Add new dumpbackup.py script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) (owner: 10CRusnov) [14:59:34] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/525520 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [15:02:01] (03PS1) 10CDanis: install etcd-client on cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/525560 [15:05:59] 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 2 others: Write incident report for jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10WDoranWMF) [15:06:17] (03CR) 10Thcipriani: [C: 03+1] "Seems reasonable for blubberoid" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525558 (owner: 10Fsero) [15:09:31] !log rebooting labstore1004.eqiad.wmnet for updates T224228 [15:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:37] !log installing patch security updates for jessie [15:15:39] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [15:15:39] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 340 bytes in 60.013 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:15:52] PROBLEM - toolschecker: showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/secondary_cluster_showmount - 340 bytes in 60.009 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:16:01] jeh: related to the reboot? [15:16:20] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/cron - 340 bytes in 60.005 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:16:43] yes, that's related to labstore1004 reboot [15:16:58] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'test' . [15:16:58] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'staging' . [15:17:02] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [15:17:08] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:17:32] jeh: expected just not downtimed or real outage? [15:17:48] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' . [15:18:15] volans: it's expected [15:18:26] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [15:18:42] I'll update the procedures to downtime the effected services [15:18:49] ack, thanks [15:20:02] jeh: I've put in a 2h downtime for you [15:20:14] thank you [15:20:32] !log change netflow timeout settings on cr1/2-eqiad - T226810 [15:20:34] although FYI it would not prevent the page on recovery ;) [15:21:00] is gerrit bot down? Pushed a change and did not see it here [15:21:18] actually yes cmjohnson1 it quit [15:21:44] might be related to the labstore reboot? [15:21:50] it runs on tools [15:21:54] Yes, looks like. [15:22:07] It should come back up on its own. [15:23:28] RECOVERY - toolschecker: showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:24:16] !log jforrester@deploy1001 Synchronized wmf-config/extension-list: Add SecureLinkFixer and TheWikipediaLibrary to i18n build T200751 T132084 (duration: 00m 49s) [15:25:16] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 16.129 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:25:36] (03PS3) 10Ema: ATS: do not cache Authorization responses [puppet] - 10https://gerrit.wikimedia.org/r/525548 (https://phabricator.wikimedia.org/T227432) [15:25:38] (03PS2) 10Ema: ATS: save and restore CC/Expires when forcing no-cache [puppet] - 10https://gerrit.wikimedia.org/r/525554 (https://phabricator.wikimedia.org/T227432) [15:25:40] (03CR) 10Tarrow: [C: 03+1] "seems like 10% cpu makes sense for a termbox pod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525558 (owner: 10Fsero) [15:25:42] yay wikibugs [15:25:42] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s, helmfile: 1 millicore is too low as a request [deployment-charts] - 10https://gerrit.wikimedia.org/r/525558 (owner: 10Fsero) [15:25:46] (03CR) 10jerkins-bot: [V: 04-1] ATS: save and restore CC/Expires when forcing no-cache [puppet] - 10https://gerrit.wikimedia.org/r/525554 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [15:25:48] (03PS3) 10Ema: ATS: save and restore CC/Expires when forcing no-cache [puppet] - 10https://gerrit.wikimedia.org/r/525554 (https://phabricator.wikimedia.org/T227432) [15:25:54] (03Abandoned) 10Herron: kafka-main100[1-5]: add forward/reverse ipv4 dns entries [dns] - 10https://gerrit.wikimedia.org/r/521882 (https://phabricator.wikimedia.org/T226274) (owner: 10Herron) [15:25:58] (03PS1) 10Cmjohnson: Updating production dns for dbproxy1020/1021 [dns] - 10https://gerrit.wikimedia.org/r/525564 (https://phabricator.wikimedia.org/T228618) [15:26:00] (03PS1) 10RobH: db1065 decom [puppet] - 10https://gerrit.wikimedia.org/r/525565 (https://phabricator.wikimedia.org/T227560) [15:26:02] (03PS1) 10BBlack: Anycast recdns mon: use raw IP, for real this time [puppet] - 10https://gerrit.wikimedia.org/r/525566 (https://phabricator.wikimedia.org/T228190) [15:26:03] Minor backlog. [15:26:07] (03PS2) 10Jforrester: extension-list: Add SecureLinkFixer and TheWikipediaLibrary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524251 (https://phabricator.wikimedia.org/T200751) [15:26:09] (03CR) 10Jforrester: [C: 03+2] extension-list: Add SecureLinkFixer and TheWikipediaLibrary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524251 (https://phabricator.wikimedia.org/T200751) (owner: 10Jforrester) [15:26:11] (03PS1) 10RobH: decom db1065 prod dns [dns] - 10https://gerrit.wikimedia.org/r/525567 (https://phabricator.wikimedia.org/T227560) [15:26:13] (03CR) 10RobH: [C: 03+2] decom db1065 prod dns [dns] - 10https://gerrit.wikimedia.org/r/525567 (https://phabricator.wikimedia.org/T227560) (owner: 10RobH) [15:26:15] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::druid_load: add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/515125 (https://phabricator.wikimedia.org/T225314) (owner: 10Mforns) [15:26:17] (03PS5) 10Elukey: analytics::refinery::job::druid_load: add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/515125 (https://phabricator.wikimedia.org/T225314) (owner: 10Mforns) [15:26:19] (03CR) 10RobH: [C: 03+2] db1065 decom [puppet] - 10https://gerrit.wikimedia.org/r/525565 (https://phabricator.wikimedia.org/T227560) (owner: 10RobH) [15:26:21] (03CR) 10Cmjohnson: [C: 03+2] Updating production dns for dbproxy1020/1021 [dns] - 10https://gerrit.wikimedia.org/r/525564 (https://phabricator.wikimedia.org/T228618) (owner: 10Cmjohnson) [15:26:23] (03Merged) 10jenkins-bot: extension-list: Add SecureLinkFixer and TheWikipediaLibrary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524251 (https://phabricator.wikimedia.org/T200751) (owner: 10Jforrester) [15:26:25] (03PS2) 10Cmjohnson: Updating production dns for dbproxy1020/1021 [dns] - 10https://gerrit.wikimedia.org/r/525564 (https://phabricator.wikimedia.org/T228618) [15:26:27] (03PS4) 10Ema: ATS: save and restore CC/Expires when forcing no-cache [puppet] - 10https://gerrit.wikimedia.org/r/525554 (https://phabricator.wikimedia.org/T227432) [15:26:29] (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Updating production dns for dbproxy1020/1021 [dns] - 10https://gerrit.wikimedia.org/r/525564 (https://phabricator.wikimedia.org/T228618) (owner: 10Cmjohnson) [15:26:33] (03CR) 10jenkins-bot: extension-list: Add SecureLinkFixer and TheWikipediaLibrary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524251 (https://phabricator.wikimedia.org/T200751) (owner: 10Jforrester) [15:26:35] (03PS6) 10Elukey: analytics::refinery::job::druid_load: add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/515125 (https://phabricator.wikimedia.org/T225314) (owner: 10Mforns) [15:26:37] (03CR) 10Ayounsi: [C: 03+2] Routinator set refresh to 10min (instead of 1h) [puppet] - 10https://gerrit.wikimedia.org/r/525204 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [15:26:39] (03PS2) 10Ayounsi: Routinator set refresh to 10min (instead of 1h) [puppet] - 10https://gerrit.wikimedia.org/r/525204 (https://phabricator.wikimedia.org/T220669) [15:26:43] (03PS1) 10Fsero: k8s,helmfile: bump resource limits for eventgate in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/525568 [15:26:45] (03PS1) 10Bstorm: toolforge: Update the version string to match our software [puppet] - 10https://gerrit.wikimedia.org/r/525569 (https://phabricator.wikimedia.org/T215531) [15:26:47] (03Abandoned) 10Jforrester: Add SecureLinkFixer to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525346 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm) [15:26:49] (03PS2) 10Jforrester: Enable SecureLinkFixer everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525347 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm) [15:26:53] (03CR) 10BBlack: "https://puppet-compiler.wmflabs.org/compiler1002/17621/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/525566 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [15:26:55] (03CR) 10Jforrester: "Looks good to go next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525347 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm) [15:26:56] wikibugs remembers both your failures and your successes and reminds you of both at the same time <3 [15:26:57] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s,helmfile: bump resource limits for eventgate in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/525568 (owner: 10Fsero) [15:27:03] (03CR) 10BBlack: [C: 03+2] Anycast recdns mon: use raw IP, for real this time [puppet] - 10https://gerrit.wikimedia.org/r/525566 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [15:27:05] (03PS3) 10Ayounsi: Routinator set refresh to 10min (instead of 1h) [puppet] - 10https://gerrit.wikimedia.org/r/525204 (https://phabricator.wikimedia.org/T220669) [15:27:07] (03PS2) 10BBlack: Anycast recdns mon: use raw IP, for real this time [puppet] - 10https://gerrit.wikimedia.org/r/525566 (https://phabricator.wikimedia.org/T228190) [15:27:09] (03CR) 10Jforrester: [C: 03+2] Remove redundant wgResourceLoaderStorageEnabled override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525542 (owner: 10Krinkle) [15:27:18] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:27:21] !log nuria@deploy1001 Started deploy [analytics/refinery@f310917]: deploying refinery - migrations to hive2 actions [15:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:45] (03Merged) 10jenkins-bot: Remove redundant wgResourceLoaderStorageEnabled override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525542 (owner: 10Krinkle) [15:27:54] lack of wikibugs -> rebase race hell! [15:28:01] (03CR) 10jenkins-bot: Remove redundant wgResourceLoaderStorageEnabled override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525542 (owner: 10Krinkle) [15:28:06] (03PS3) 10BBlack: Anycast recdns mon: use raw IP, for real this time [puppet] - 10https://gerrit.wikimedia.org/r/525566 (https://phabricator.wikimedia.org/T228190) [15:28:12] (03CR) 10BBlack: [V: 03+2 C: 03+2] Anycast recdns mon: use raw IP, for real this time [puppet] - 10https://gerrit.wikimedia.org/r/525566 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [15:29:08] (03CR) 10Elukey: [C: 03+2] Add sre.hadoop.rolling-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/525520 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [15:30:03] seems like there's some lack of race-prevention on puppet-merge iterating the masters, or something like that [15:30:13] ERROR: puppet-merge on puppetmaster2001.codfw.wmnet failed [15:30:21] (03PS5) 10Jforrester: Order list of extensions by alphabet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455188 (owner: 10Aklapper) [15:30:25] and so-on, I think because of two closely-timed puppet-merges, maybe [15:30:35] (03Abandoned) 10Jforrester: Order list of extensions by alphabet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455188 (owner: 10Aklapper) [15:32:11] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10RobH) [15:32:35] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T228853 (10Cmjohnson) 05Open→03Resolved Replaced the disk in slot 1 it is still in Firmware state: Copyback I am resolving this task but if you find there is an issue with the disk please re-open the task and reas... [15:32:57] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Remove redundant wgResourceLoaderStorageEnabled override (duration: 00m 50s) [15:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:05] jouncebot: next [15:34:05] In 0 hour(s) and 25 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190725T1600) [15:34:08] jouncebot: now [15:34:08] No deployments scheduled for the next 0 hour(s) and 25 minute(s) [15:35:10] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [15:35:16] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:17] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [15:35:22] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `dbproxy1004.eqiad.wmnet` - dbproxy1004.eqi... [15:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:23] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:35:27] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `dbproxy1009.eqiad.wmnet` - dbproxy1009.eqi... [15:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10Cmjohnson) 05Open→03Resolved This has been completed, resolving this task - Removed both servers from Row D - Placed DBP1020 in C5 and DBP1... [15:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:11] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Cmjohnson) Removing the ops-eqiad and DC-Ops tag, if a hardware issue presents itself pleas... [15:37:52] (03PS1) 10RobH: decom dbproxy100[49] [puppet] - 10https://gerrit.wikimedia.org/r/525571 (https://phabricator.wikimedia.org/T228768) [15:38:14] (03PS2) 10CDanis: install etcd-client on cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/525560 [15:38:29] (03CR) 10CDanis: [C: 03+2] install etcd-client on cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/525560 (owner: 10CDanis) [15:39:07] (03CR) 10RobH: [C: 03+2] decom dbproxy100[49] [puppet] - 10https://gerrit.wikimedia.org/r/525571 (https://phabricator.wikimedia.org/T228768) (owner: 10RobH) [15:39:18] (03PS2) 10RobH: decom dbproxy100[49] [puppet] - 10https://gerrit.wikimedia.org/r/525571 (https://phabricator.wikimedia.org/T228768) [15:41:01] !log nuria@deploy1001 Finished deploy [analytics/refinery@f310917]: deploying refinery - migrations to hive2 actions (duration: 13m 40s) [15:41:05] !log rebooting cloudstore1009.wikimedia.org for updates T224228 [15:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:19] (03PS1) 10RobH: decom dbproxy100[49] [dns] - 10https://gerrit.wikimedia.org/r/525572 (https://phabricator.wikimedia.org/T228768) [15:41:21] (03PS3) 10RobH: decom dbproxy100[49] [puppet] - 10https://gerrit.wikimedia.org/r/525571 (https://phabricator.wikimedia.org/T228768) [15:41:43] (03CR) 10RobH: [C: 03+2] decom dbproxy100[49] [dns] - 10https://gerrit.wikimedia.org/r/525572 (https://phabricator.wikimedia.org/T228768) (owner: 10RobH) [15:42:56] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/525539 (owner: 10Filippo Giunchedi) [15:43:49] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10RobH) a:05RobH→03None [15:44:01] (03CR) 10jerkins-bot: [V: 04-1] monitoring: allow logstash in dashboard_links [puppet] - 10https://gerrit.wikimedia.org/r/525539 (owner: 10Filippo Giunchedi) [15:44:12] (03PS1) 10Elukey: sre.hadoop.rolling-restart-workers: fix bugs [cookbooks] - 10https://gerrit.wikimedia.org/r/525573 (https://phabricator.wikimedia.org/T229003) [15:45:08] (03CR) 10Aklapper: "Indeed, superseded by https://phabricator.wikimedia.org/rOMWC76a90871da55ba10990d15e9846794b422c2b7a4 (thanks!)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455188 (owner: 10Aklapper) [15:48:13] (03CR) 10Elukey: [C: 03+2] sre.hadoop.rolling-restart-workers: fix bugs [cookbooks] - 10https://gerrit.wikimedia.org/r/525573 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [15:50:16] (03PS2) 10Bstorm: toolforge: Update the version string to match our software [puppet] - 10https://gerrit.wikimedia.org/r/525569 (https://phabricator.wikimedia.org/T215531) [15:51:18] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10Nuria) +1 to access. Please read throughly: https://office.wikimedia.org/wiki/Data_access_guidelines what is the data @DLynch is interested on? [15:51:55] (03CR) 10Bstorm: [C: 03+2] toolforge: Update the version string to match our software [puppet] - 10https://gerrit.wikimedia.org/r/525569 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:52:07] !log rebooting cloudstore1008.wikimedia.org for updates T224228 [15:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:10] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/17622/" [puppet] - 10https://gerrit.wikimedia.org/r/525470 (owner: 10Ayounsi) [15:54:32] (03CR) 10Ayounsi: [C: 03+2] Anycast, make bird::bind_service more generic [puppet] - 10https://gerrit.wikimedia.org/r/525470 (owner: 10Ayounsi) [15:54:42] (03PS2) 10Ayounsi: Anycast, make bird::bind_service more generic [puppet] - 10https://gerrit.wikimedia.org/r/525470 [15:55:06] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10Nuria) Also, does @DLynch have shell access set up to [production, see my question on: https://phabricator.wikimedia.org/T224029 [15:58:43] (03PS1) 10Elukey: sre.hadoop.rolling-restart-workers.py: fix argparse defaults [cookbooks] - 10https://gerrit.wikimedia.org/r/525575 (https://phabricator.wikimedia.org/T229003) [16:00:04] _joe_: My dear minions, it's time we take the moon! Just kidding. Time for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190725T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:43] (03PS1) 10CDanis: Revert "Add accraze to team-scoring" [puppet] - 10https://gerrit.wikimedia.org/r/525576 [16:02:30] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add accraze to team-scoring" [puppet] - 10https://gerrit.wikimedia.org/r/525576 (owner: 10CDanis) [16:03:06] (03PS2) 10CDanis: Revert "Add accraze to team-scoring" [puppet] - 10https://gerrit.wikimedia.org/r/525576 [16:03:17] (03CR) 10Elukey: [C: 03+2] sre.hadoop.rolling-restart-workers.py: fix argparse defaults [cookbooks] - 10https://gerrit.wikimedia.org/r/525575 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [16:03:42] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10DLynch) I'm interested mainly in analytics data logged in relation to Editing products (the EditAttemptStep and VisualEditorFeatureUse schemas). I do not currently ha... [16:07:04] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 59.55, 38.12, 24.78 https://wikitech.wikimedia.org/wiki/Application_servers [16:07:08] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17623/" [puppet] - 10https://gerrit.wikimedia.org/r/525303 (owner: 10Ayounsi) [16:07:34] (03PS2) 10Jbond: apache: add validate_cmd to apache config [puppet] - 10https://gerrit.wikimedia.org/r/525525 [16:08:29] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10Nuria) @DLynch Ok, you need to get shell access to prod before this access is granted, let us know when that is done. [16:09:02] 10Operations, 10procurement: Request to Order Drive Replacement on elastic1046 - https://phabricator.wikimedia.org/T229017 (10wiki_willy) [16:10:15] (03CR) 1020after4: [C: 03+1] contint: remove arcanist [puppet] - 10https://gerrit.wikimedia.org/r/525063 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [16:10:28] 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10wiki_willy) a:05wiki_willy→03Cmjohnson Thanks @elukey, subtask #T229017 has been opened to order the replacement drive with procurement. Assigning this task back to @Cmjohnson, for when the disk arrives o... [16:10:47] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Security-Team: [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10MSantos) 05Open→03Resolved a:03MSantos So, the current solution is to keep using Debian stable packages and stay vigilant... [16:11:47] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [16:15:24] (03CR) 10Effie Mouzeli: [C: 03+1] "This is great!" [puppet] - 10https://gerrit.wikimedia.org/r/525525 (owner: 10Jbond) [16:16:07] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10DLynch) So, as far as I know, we're currently talking on a ticket which was created through all the "new users" steps stated on the request-shell-access document, and... [16:17:08] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+1] "LGTM although this will have no effect on jobrunners since we have stopped using HHVM there https://puppet-compiler.wmflabs.org/compiler1" [puppet] - 10https://gerrit.wikimedia.org/r/525156 (owner: 10EBernhardson) [16:17:36] (03CR) 10Hashar: [C: 03+1] "I have purged the jenkins instances." [puppet] - 10https://gerrit.wikimedia.org/r/525063 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [16:18:39] (03CR) 10EBernhardson: "no job runners is fine as only hhvm reports this metric. I also reviewed the nginx configuration for the jobrunner proxies and I don't see" [puppet] - 10https://gerrit.wikimedia.org/r/525156 (owner: 10EBernhardson) [16:19:22] !log Disable puppet on mw* servers for 525156 [16:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:29] 10Operations, 10SRE-Access-Requests, 10VisualEditor (Current work): Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10JTannerWMF) [16:19:43] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 16.79, 21.79, 23.80 https://wikitech.wikimedia.org/wiki/Application_servers [16:19:56] 10Operations, 10SRE-Access-Requests, 10VisualEditor (Current work): Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10JTannerWMF) I am putting this on the Visual Editor board to keep an eye on progress. [16:20:32] (03PS2) 10Elukey: contint: remove arcanist [puppet] - 10https://gerrit.wikimedia.org/r/525063 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [16:21:24] (03PS1) 10Herron: add conniecc1 to analytics-(wmde|privatedata)-users,researchers group [puppet] - 10https://gerrit.wikimedia.org/r/525578 (https://phabricator.wikimedia.org/T228447) [16:21:37] 10Operations, 10ops-eqiad, 10DC-Ops: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1020.eqiad.wmnet'] ` The log can be found in `/v... [16:21:45] (03CR) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [16:21:51] (03CR) 10Elukey: [C: 03+2] contint: remove arcanist [puppet] - 10https://gerrit.wikimedia.org/r/525063 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [16:22:23] (03CR) 10jerkins-bot: [V: 04-1] add conniecc1 to analytics-(wmde|privatedata)-users,researchers group [puppet] - 10https://gerrit.wikimedia.org/r/525578 (https://phabricator.wikimedia.org/T228447) (owner: 10Herron) [16:23:34] (03PS2) 10Herron: add conniecc1 to analytics-(wmde|privatedata)-users,researchers group [puppet] - 10https://gerrit.wikimedia.org/r/525578 (https://phabricator.wikimedia.org/T228447) [16:24:23] (03Abandoned) 10CDanis: Revert "Add accraze to team-scoring" [puppet] - 10https://gerrit.wikimedia.org/r/525576 (owner: 10CDanis) [16:28:17] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 49.16, 23.78, 17.26 https://wikitech.wikimedia.org/wiki/Application_servers [16:29:45] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 21.89, 21.38, 17.02 https://wikitech.wikimedia.org/wiki/Application_servers [16:33:11] 10Operations, 10Puppet, 10Continuous-Integration-Config: puppet.git rake fails with ruby 2.5 - https://phabricator.wikimedia.org/T208566 (10jbond) In case this needs rolling back the the issue can be fixed in 4.8 with the following patch https://phabricator.wikimedia.org/P8772 [16:34:42] (03PS1) 10Zoranzoki21: Enable VisualEditor in namespace Wikipédia on Slovak Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525580 (https://phabricator.wikimedia.org/T229014) [16:34:52] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] Increase size of cirrus curl pools [puppet] - 10https://gerrit.wikimedia.org/r/525156 (owner: 10EBernhardson) [16:35:02] (03PS2) 10Effie Mouzeli: Increase size of cirrus curl pools [puppet] - 10https://gerrit.wikimedia.org/r/525156 (owner: 10EBernhardson) [16:35:55] (03PS2) 10Zoranzoki21: Enable VisualEditor in namespace Wikipédia on Slovak Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525580 (https://phabricator.wikimedia.org/T229014) [16:36:03] (03PS4) 10BBlack: anycast recdns: use for eqsin LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525549 (https://phabricator.wikimedia.org/T228190) [16:36:56] (03CR) 10Jbond: [C: 04-1] "Thanks for the reviews everyone however I'm going to -1 this for now. I'm pretty sure the way puppet validat_cmd is meant to work, is to " [puppet] - 10https://gerrit.wikimedia.org/r/525525 (owner: 10Jbond) [16:37:29] 10Operations, 10Beta-Cluster-Infrastructure, 10Mail, 10MediaWiki-Email, and 2 others: [betacluster] Cannot confirm email address - confirmation never received - https://phabricator.wikimedia.org/T227714 (10JTannerWMF) I am adding this to our board for visibility [16:38:36] (03PS3) 10Ayounsi: Anycast move bird::neighbors_list from role/site for all sites [puppet] - 10https://gerrit.wikimedia.org/r/525303 [16:40:22] 10Operations, 10ops-eqiad, 10DC-Ops: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1020.eqiad.wmnet'] ` and were **ALL** successful. [16:40:31] (03CR) 10BBlack: [C: 03+2] anycast recdns: use for eqsin LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525549 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [16:40:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10herron) Looping in @nuria for analytics review/approval [16:40:36] 10Operations, 10ops-eqiad, 10DC-Ops: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1021.eqiad.wmnet'] ` The log can be found in `/v... [16:40:43] (03PS5) 10BBlack: anycast recdns: use for eqsin LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525549 (https://phabricator.wikimedia.org/T228190) [16:41:08] (03CR) 10BBlack: [V: 03+2 C: 03+2] anycast recdns: use for eqsin LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525549 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [16:41:54] (03PS4) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) [16:41:56] (03PS1) 10Giuseppe Lavagetto: mediawiki: allow installing php7 only [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) [16:43:08] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [16:43:12] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: allow installing php7 only [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto) [16:44:20] !log lvs5003 - restart pybal for resolv.conf change - T228190 [16:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:29] T228190: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 [16:45:44] (03PS7) 10Elukey: Allow the use of Ipv6 in the Hadoop Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) [16:48:57] (03PS1) 10Ayounsi: FNM: Tune average_calculation_time [puppet] - 10https://gerrit.wikimedia.org/r/525590 (https://phabricator.wikimedia.org/T226810) [16:49:55] (03CR) 10Ayounsi: [C: 03+2] FNM: Tune average_calculation_time [puppet] - 10https://gerrit.wikimedia.org/r/525590 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [16:50:54] !log lvs5002 - restart pybal for resolv.conf change - T228190 [16:50:56] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 55.68, 34.80, 27.42 https://wikitech.wikimedia.org/wiki/Application_servers [16:50:59] (03PS2) 10Ayounsi: FNM: Tune average_calculation_time [puppet] - 10https://gerrit.wikimedia.org/r/525590 (https://phabricator.wikimedia.org/T226810) [16:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:01] T228190: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 [16:52:09] (03CR) 10Ayounsi: [C: 03+2] FNM: Tune average_calculation_time [puppet] - 10https://gerrit.wikimedia.org/r/525590 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [16:52:22] !log Rolling restart of hhvm across the fleet [16:52:23] (03CR) 10Elukey: [C: 03+2] Allow the use of Ipv6 in the Hadoop Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [16:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:29] (03PS8) 10Elukey: Allow the use of Ipv6 in the Hadoop Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) [16:53:45] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.15/extensions/WikibaseMediaInfo/resources/statements/: T228807 Fix formatValue abort handling (duration: 00m 48s) [16:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:53] T228807: Existing qualifiers do not appear in read-mode on the file page - https://phabricator.wikimedia.org/T228807 [16:54:05] !log lvs5001 - restart pybal for resolv.conf change - T228190 [16:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:56] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 48.22, 28.42, 22.90 https://wikitech.wikimedia.org/wiki/Application_servers [16:58:15] 10Operations, 10ops-eqiad, 10DC-Ops: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1021.eqiad.wmnet'] ` and were **ALL** successful. [17:00:00] jouncebot next [17:00:00] In 0 hour(s) and 59 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190725T1800) [17:00:04] cscott, arlolra, subbu, and halfak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190725T1700). [17:00:07] jouncebot now [17:00:07] For the next 0 hour(s) and 59 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190725T1700) [17:01:46] !log elukey@cumin1001 START - Cookbook sre.hadoop.rolling-restart-workers [17:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:36] 10Operations, 10ops-eqiad, 10DC-Ops: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10Marostegui) Both hosts re-imaged. Thanks! [17:08:35] (03PS5) 10Volans: Update WDQS standard settings to new DB settings [puppet] - 10https://gerrit.wikimedia.org/r/524954 (https://phabricator.wikimedia.org/T228122) (owner: 10Smalyshev) [17:08:55] (03CR) 10Cwhite: [C: 03+1] "Looks good to me. Even if we do not choose a prometheus-based alerting solution I still see the value of globally aggregating this data." [puppet] - 10https://gerrit.wikimedia.org/r/525502 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [17:09:05] 10Operations, 10SRE-Access-Requests, 10VisualEditor (Current work): Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10Nuria) @DLynch Sorry the process is not more clear, requesting production access is a pre-requirement to get access to data, tickets f... [17:09:29] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [17:09:40] (03CR) 10Cwhite: [C: 03+2] hiera: deploy varnishkafka exporter to codfw [puppet] - 10https://gerrit.wikimedia.org/r/524932 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [17:09:44] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [17:09:50] (03PS2) 10Cwhite: hiera: deploy varnishkafka exporter to codfw [puppet] - 10https://gerrit.wikimedia.org/r/524932 (https://phabricator.wikimedia.org/T196066) [17:12:46] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:15:54] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:16:06] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 15.40, 18.08, 23.34 https://wikitech.wikimedia.org/wiki/Application_servers [17:17:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.rolling-restart-workers (exit_code=0) [17:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:09] 10Operations, 10SRE-Access-Requests: Requesting production shell access for DLynch - https://phabricator.wikimedia.org/T229028 (10DLynch) [17:18:15] !log disabled puppet on A:wdqs-all, deploying gerrit/524954 - T228122 [17:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:22] T228122: DB reload for WDQS - https://phabricator.wikimedia.org/T228122 [17:18:35] (03PS6) 10Volans: Update WDQS standard settings to new DB settings [puppet] - 10https://gerrit.wikimedia.org/r/524954 (https://phabricator.wikimedia.org/T228122) (owner: 10Smalyshev) [17:19:31] (03CR) 10Volans: [C: 03+2] Update WDQS standard settings to new DB settings [puppet] - 10https://gerrit.wikimedia.org/r/524954 (https://phabricator.wikimedia.org/T228122) (owner: 10Smalyshev) [17:25:13] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [17:26:03] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [17:26:28] (03CR) 10Cwhite: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/525259 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [17:26:50] 10Operations, 10SRE-Access-Requests: Requesting production shell access for DLynch - https://phabricator.wikimedia.org/T229028 (10marcella) I am David's manager, and I approve the business need for this access. Thank you! [17:31:05] (03PS1) 10Herron: admin: create kemayo shell acct, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/525601 (https://phabricator.wikimedia.org/T227200) [17:32:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10VisualEditor (Current work): Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10Nuria) >So, as far as I know, we're currently talking on a ticket which was created through all the "new users"... [17:33:46] !log running sudo cumin -s30 -b1 -m async 'A:wdqs-internal' 'run-puppet-agent -e "volans - T228122 - deploying gerrit/524954"' 'systemctl restart wdqs-blazegraph' [17:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:54] T228122: DB reload for WDQS - https://phabricator.wikimedia.org/T228122 [17:41:27] (03PS3) 10Jeena Huneidi: Package mediawiki-dev and add to index [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 [17:41:47] RECOVERY - Memory correctable errors -EDAC- on wtp2013 is OK: (C)4 ge (W)2 ge 0 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops [17:41:51] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 17.28, 18.84, 23.45 https://wikitech.wikimedia.org/wiki/Application_servers [17:42:37] (03PS3) 10Cwhite: profile: cleanup per-site varnishkafka deploy flags [puppet] - 10https://gerrit.wikimedia.org/r/524934 (https://phabricator.wikimedia.org/T196066) [17:42:57] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:43:05] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:43:11] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [17:43:17] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [17:43:25] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:31] 10Operations, 10SRE-Access-Requests: Requesting production shell access for DLynch - https://phabricator.wikimedia.org/T229028 (10herron) 05Open→03Resolved a:03herron Hey David, sorry for the confusion. T227200 should be sufficient for both shell access and group membership. So I'll merge these tasks,... [17:43:31] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:43:39] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:43:53] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [17:44:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10VisualEditor (Current work): Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10herron) [17:44:04] 10Operations, 10SRE-Access-Requests: Requesting production shell access for DLynch - https://phabricator.wikimedia.org/T229028 (10herron) [17:44:13] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:44:18] !log sudo cumin -s30 -b1 -m async 'A:wdqs-all and not A:wdqs-internal and not P{wdqs1009.eqiad.wmnet}' 'run-puppet-agent -e "volans - T228122 - deploying gerrit/524954"' 'systemctl restart wdqs-blazegraph' [17:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:35] T228122: DB reload for WDQS - https://phabricator.wikimedia.org/T228122 [17:44:43] ottomata, elukey: anything ongoing on stat1007? [17:44:58] (03PS2) 10Herron: admin: create kemayo shell acct, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/525601 (https://phabricator.wikimedia.org/T227200) [17:45:22] volans: hm, not that i know of! [17:45:24] looking [17:45:47] I think it is a user causing a lot of load [17:45:50] I'm deploying something on wdqs, but shouldn't be related AFAIK (but I don't know much in that realm) [17:45:59] for conntrack? [17:46:19] icinga-wm: well all alarms fired :D [17:46:22] err [17:46:24] ottomata: [17:46:27] it is set to 120! [17:46:37] https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:47:05] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:47:09] elukey@stat1007:~$ top [17:47:11] -bash: fork: Cannot allocate memory [17:47:13] hm [17:47:14] ok [17:47:23] oh yeah getting borked fo rme too [17:47:29] didn't see much cpu going on, [17:47:43] (03CR) 10Herron: [C: 03+2] admin: create kemayo shell acct, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/525601 (https://phabricator.wikimedia.org/T227200) (owner: 10Herron) [17:47:43] the oom killer is probably going to do its job soon [17:48:08] killing your shell ofc :D [17:48:33] wouldn't be the first time I see the oom_killer killing sshd :-P [17:49:07] oh i see just conntrack check refused connection [17:49:28] connecting via com2 [17:50:01] PROBLEM - SSH on stat1007 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:50:35] cannot even log in as root via serial [17:50:53] ottomata: ok if I powercycle? [17:51:12] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10RobH) [17:51:59] !log powercycle stat1007 [17:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:57] PROBLEM - Host stat1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:53:39] !log mbsantos@deploy1001 Started deploy [mobileapps/deploy@11d9d4a]: Update service-mobileapp-node to 200a323 (T228938 T228287) [17:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:53] T228287: [BUG] mobile-html: some pages are sized incorrectly on initial load - https://phabricator.wikimedia.org/T228287 [17:53:53] T228938: Mobile-html sometimes doesn't have etag - https://phabricator.wikimedia.org/T228938 [17:56:09] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:56:11] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:56:15] RECOVERY - Host stat1007 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:56:15] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [17:56:21] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [17:56:31] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:37] RECOVERY - SSH on stat1007 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:56:45] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:56:57] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [17:57:19] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:57:45] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:58:18] !log mbsantos@deploy1001 Finished deploy [mobileapps/deploy@11d9d4a]: Update service-mobileapp-node to 200a323 (T228938 T228287) (duration: 04m 39s) [17:58:37] bearND: mdholloway ^ [17:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:56] (03PS1) 10Ottomata: Allow eventgate-analytics to get schemas from remote schema.svc if not present locally [deployment-charts] - 10https://gerrit.wikimedia.org/r/525609 (https://phabricator.wikimedia.org/T206789) [17:59:00] mateusbs17: all well? [17:59:10] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10RobH) [17:59:22] mdholloway: Yep, just a heads up that its finished [17:59:27] sweet [17:59:43] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [17:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:49] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [17:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:55] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [17:59:56] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts:... [18:00:00] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190725T1800). [18:00:04] Zoranzoki21 and Daimona: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:05] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts:... [18:00:06] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:11] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:16] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts:... [18:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:23] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:00:46] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10RobH) [18:00:51] o/ [18:03:53] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10herron) 05Open→03Resolved >>! In T227496#5363988, @MoritzMuehlenhoff wrote: > staff members need to be a member of cn=wmf, cn=... [18:04:11] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:04:13] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:04:19] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:04:21] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:21] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:27] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:04:27] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts:... [18:04:28] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:04:28] (03CR) 10Krinkle: "This seems to make docker-pkg no longer work out of the box on macOS." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500417 (owner: 10Giuseppe Lavagetto) [18:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:34] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:04:35] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts:... [18:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:41] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts:... [18:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:22] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10RobH) [18:07:29] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:07:39] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:07:56] (03PS1) 10RobH: decom kafka10(1[234]|2[023]).eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/525613 (https://phabricator.wikimedia.org/T226517) [18:09:00] Someone around for SWAT? [18:09:34] (03PS1) 10RobH: decom old kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/525615 (https://phabricator.wikimedia.org/T226517) [18:09:47] (03CR) 10RobH: [C: 03+2] decom kafka10(1[234]|2[023]).eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/525613 (https://phabricator.wikimedia.org/T226517) (owner: 10RobH) [18:10:00] (03CR) 10RobH: [C: 03+2] decom old kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/525615 (https://phabricator.wikimedia.org/T226517) (owner: 10RobH) [18:11:58] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10RobH) [18:12:24] (03CR) 10Krinkle: "Workaround: run `export REQUESTS_CA_BUNDLE=;` before using docker-pkg, which fools this check and seems to also still be safely ignored by" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500417 (owner: 10Giuseppe Lavagetto) [18:15:49] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10RobH) [18:16:56] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10RobH) [18:19:17] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:19:23] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:28] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labsdb1006.eqiad.wmnet` -... [18:19:29] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:34] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:39] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labsdb1007.eqiad.wmnet` -... [18:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:18] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10RobH) [18:21:51] (03PS1) 10RobH: labsdb100[67] production dns decom [dns] - 10https://gerrit.wikimedia.org/r/525619 (https://phabricator.wikimedia.org/T220144) [18:22:35] (03CR) 10RobH: [C: 03+2] labsdb100[67] production dns decom [dns] - 10https://gerrit.wikimedia.org/r/525619 (https://phabricator.wikimedia.org/T220144) (owner: 10RobH) [18:22:37] (03PS1) 10RobH: decom labsdb100[67] [puppet] - 10https://gerrit.wikimedia.org/r/525620 (https://phabricator.wikimedia.org/T220144) [18:22:57] (03CR) 10RobH: [C: 03+2] decom labsdb100[67] [puppet] - 10https://gerrit.wikimedia.org/r/525620 (https://phabricator.wikimedia.org/T220144) (owner: 10RobH) [18:26:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1016 with 10G interfaces - https://phabricator.wikimedia.org/T228692 (10Cmjohnson) [18:30:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1016 with 10G interfaces - https://phabricator.wikimedia.org/T228692 (10Cmjohnson) @andrewbogott all the physical work is done and switch ports/vlans have been updated. [18:30:36] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10RobH) a:05RobH→03None [18:31:02] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Bstorm) Discussed with @RobH IRC. This is doable as long as it can wait behind some 10G decommissions, which seems fine to me. Updating the... [18:33:08] 10Operations, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, 10Traffic, and 3 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10TheDJ) Is this fixed now ? [18:34:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Cmjohnson) 05Open→03Resolved @Andrew Resolving this task (again) if the same issue returns please reopen. If it's a di... [18:35:11] (03PS1) 10Ottomata: Allow swift/upload/complete events in streams named .*swift.upload-complete [deployment-charts] - 10https://gerrit.wikimedia.org/r/525621 [18:35:25] (03CR) 10Krinkle: "Ive documented this for now on-wiki at https://www.mediawiki.org/wiki/Continuous_integration/Docker#Could_not_find_a_suitable_TLS_CA_certi" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500417 (owner: 10Giuseppe Lavagetto) [18:36:19] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:36:25] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:29] (03PS2) 10Ottomata: Allow swift/upload/complete events in streams named .*swift.upload-complete [deployment-charts] - 10https://gerrit.wikimedia.org/r/525621 [18:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `db1068.eqiad.wmnet` - db1068.eqiad.wmnet - Removed from Puppet master and PuppetDB - Downt... [18:36:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10RobH) [18:37:07] (03CR) 10Ottomata: "Thoughts about stream naming?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525621 (owner: 10Ottomata) [18:37:23] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Bstorm) a:05ayounsi→03RobH [18:37:33] (03PS3) 10Ottomata: Allow swift/upload/complete events in streams named .*swift.upload-complete [deployment-charts] - 10https://gerrit.wikimedia.org/r/525621 (https://phabricator.wikimedia.org/T227896) [18:38:25] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Bstorm) a:05Bstorm→03RobH The racking proposal is detailed in T224188, so re-assigning [18:39:18] 10Operations, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1017 with 10G interfaces - https://phabricator.wikimedia.org/T228691 (10Cmjohnson) @andrew MAC address for 10G NIC F4:E9:D4:BA:B7:10 dhcp file is not updated. I am removing the DC Ops and ops-eqiad tag. Please resolve this task... [18:40:07] 10Operations, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1016 with 10G interfaces - https://phabricator.wikimedia.org/T228692 (10Cmjohnson) @andrew MAC address for 10G NIC F4:E9:D4:BA:B7:40 dhcp file is not updated. I am removing the DC Ops and ops-eqiad tag. Please resolve this task... [18:40:21] (03PS1) 10RobH: decom db1068 [dns] - 10https://gerrit.wikimedia.org/r/525622 (https://phabricator.wikimedia.org/T226689) [18:40:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Cmjohnson) [18:40:53] (03CR) 10RobH: [C: 03+2] decom db1068 [dns] - 10https://gerrit.wikimedia.org/r/525622 (https://phabricator.wikimedia.org/T226689) (owner: 10RobH) [18:41:00] (03PS1) 10RobH: decom db1068 [puppet] - 10https://gerrit.wikimedia.org/r/525624 (https://phabricator.wikimedia.org/T226689) [18:41:28] (03CR) 10RobH: [C: 03+2] decom db1068 [puppet] - 10https://gerrit.wikimedia.org/r/525624 (https://phabricator.wikimedia.org/T226689) (owner: 10RobH) [18:44:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10RobH) [18:45:48] (03CR) 10Bstorm: [C: 03+1] dumps dist: switch active web to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/525541 (https://phabricator.wikimedia.org/T224228) (owner: 10Jhedden) [18:45:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10RobH) a:05RobH→03None [18:46:22] (03PS4) 10Ottomata: Allow swift/upload/complete events in streams named swift.*.upload-complete [deployment-charts] - 10https://gerrit.wikimedia.org/r/525621 (https://phabricator.wikimedia.org/T227896) [18:46:50] (03PS2) 10Ottomata: Allow eventgate-analytics to get schemas from remote schema.svc if not present locally [deployment-charts] - 10https://gerrit.wikimedia.org/r/525609 (https://phabricator.wikimedia.org/T206789) [18:47:22] (03PS2) 10Jhedden: dumps dist: switch active web to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/525541 (https://phabricator.wikimedia.org/T224228) [18:48:16] (03PS3) 10Ottomata: Allow eventgate-analytics to get schemas from remote schema.svc if not present locally [deployment-charts] - 10https://gerrit.wikimedia.org/r/525609 (https://phabricator.wikimedia.org/T206789) [18:48:55] (03CR) 10Jhedden: [C: 03+2] dumps dist: switch active web to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/525541 (https://phabricator.wikimedia.org/T224228) (owner: 10Jhedden) [18:49:37] 10Operations, 10SRE-Access-Requests, 10VisualEditor (Current work): Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10herron) 05Open→03Resolved a:03herron Hi David, the requested access has been provisioned. I'll transition this to resolved now... [18:49:48] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Allow eventgate-analytics to get schemas from remote schema.svc if not present locally [deployment-charts] - 10https://gerrit.wikimedia.org/r/525609 (https://phabricator.wikimedia.org/T206789) (owner: 10Ottomata) [18:50:06] (03PS5) 10Ottomata: Allow swift/upload/complete events in streams named swift.*.upload-complete [deployment-charts] - 10https://gerrit.wikimedia.org/r/525621 (https://phabricator.wikimedia.org/T227896) [18:51:04] (03CR) 10Ppchelko: [V: 03+2 C: 03+2] [RESTRouter] Switch event service to eventgate. [deployment-charts] - 10https://gerrit.wikimedia.org/r/524060 (https://phabricator.wikimedia.org/T524055) (owner: 10Ppchelko) [18:51:07] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Allow swift/upload/complete events in streams named swift.*.upload-complete [deployment-charts] - 10https://gerrit.wikimedia.org/r/525621 (https://phabricator.wikimedia.org/T227896) (owner: 10Ottomata) [18:51:47] oh Pchelolo cool restrouter (?) is in k8s? [18:51:58] ottomata: it's not really deployed yet [18:52:01] ah [18:52:08] (03CR) 10Ottomata: [C: 03+1] [RESTRouter] Switch event service to eventgate. [deployment-charts] - 10https://gerrit.wikimedia.org/r/524060 (https://phabricator.wikimedia.org/T524055) (owner: 10Ppchelko) [18:52:24] we got images building, we got a chart, but we didn't actually deploy it just yet [18:56:30] (03PS1) 10Ottomata: eventgate-analytics stream config swift upload regex - escape . [deployment-charts] - 10https://gerrit.wikimedia.org/r/525625 (https://phabricator.wikimedia.org/T227896) [18:57:15] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics stream config swift upload regex - escape . [deployment-charts] - 10https://gerrit.wikimedia.org/r/525625 (https://phabricator.wikimedia.org/T227896) (owner: 10Ottomata) [18:59:57] (03PS2) 10Ayounsi: Bird anycast, add monitoring for anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/520643 (https://phabricator.wikimedia.org/T186550) [19:01:03] !log otto@ helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [19:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:39] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [19:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:53] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:08] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [19:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:37] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:33] (03PS1) 10RobH: decom restbase10(0[7-9]|1[0-5]) prod dns [dns] - 10https://gerrit.wikimedia.org/r/525627 (https://phabricator.wikimedia.org/T226715) [19:09:11] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile- [19:09:11] the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:10:17] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [19:10:27] (03PS1) 10RobH: decom restbase10(0[7-9]|1[0-5]) [puppet] - 10https://gerrit.wikimedia.org/r/525628 (https://phabricator.wikimedia.org/T226715) [19:10:40] (03CR) 10RobH: [C: 03+2] decom restbase10(0[7-9]|1[0-5]) prod dns [dns] - 10https://gerrit.wikimedia.org/r/525627 (https://phabricator.wikimedia.org/T226715) (owner: 10RobH) [19:10:47] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:11:05] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [19:11:16] (03CR) 10RobH: [C: 03+2] decom restbase10(0[7-9]|1[0-5]) [puppet] - 10https://gerrit.wikimedia.org/r/525628 (https://phabricator.wikimedia.org/T226715) (owner: 10RobH) [19:12:07] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:13:29] PROBLEM - puppet last run on cloudvirt1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:13:43] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:15:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10RobH) [19:15:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10RobH) a:05RobH→03None [19:16:48] 10Operations, 10DC-Ops, 10Traffic, 10decommission: Decommission lvs100[123456] - https://phabricator.wikimedia.org/T228671 (10RobH) 05Open→03Declined This seems to be a duplicate of T224223 [19:17:16] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:18:16] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:24:50] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [19:27:56] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [19:29:00] !log ppchelko@deploy1001 Started deploy [restbase/deploy@279cf27]: Set proper CSP headers for mobile-html T229016 [19:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:07] T229016: Use CSP headers from backend even when stored payload is served - https://phabricator.wikimedia.org/T229016 [19:39:44] RECOVERY - puppet last run on cloudvirt1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:40:34] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:41:13] 10Operations, 10ops-eqiad, 10decommission: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10RobH) [19:42:04] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:42:42] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@279cf27]: Set proper CSP headers for mobile-html T229016 (duration: 13m 42s) [19:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:49] T229016: Use CSP headers from backend even when stored payload is served - https://phabricator.wikimedia.org/T229016 [19:42:53] !log ppchelko@deploy1001 Started deploy [restbase/deploy@279cf27]: Set proper CSP headers for mobile-html T229016, take 2 [19:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:04] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [19:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:10] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:44:12] 10Operations, 10ops-eqiad, 10decommission: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `rhenium.wikimedia.org` - rhenium.wikimedia.org - Removed from Puppet master and PuppetDB - Downtimed... [19:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:17] 10Operations, 10ops-eqiad, 10decommission: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10RobH) [19:46:16] (03PS1) 10RobH: decom rhenium [puppet] - 10https://gerrit.wikimedia.org/r/525635 (https://phabricator.wikimedia.org/T224268) [19:46:46] (03PS1) 10RobH: decom rhenium prod dns [dns] - 10https://gerrit.wikimedia.org/r/525636 (https://phabricator.wikimedia.org/T224268) [19:46:57] (03CR) 10RobH: [C: 03+2] decom rhenium [puppet] - 10https://gerrit.wikimedia.org/r/525635 (https://phabricator.wikimedia.org/T224268) (owner: 10RobH) [19:47:28] (03CR) 10RobH: [C: 03+2] decom rhenium prod dns [dns] - 10https://gerrit.wikimedia.org/r/525636 (https://phabricator.wikimedia.org/T224268) (owner: 10RobH) [19:49:07] 10Operations, 10ops-eqiad, 10decommission: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10RobH) a:05RobH→03None [19:49:26] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@279cf27]: Set proper CSP headers for mobile-html T229016, take 2 (duration: 06m 33s) [19:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:34] T229016: Use CSP headers from backend even when stored payload is served - https://phabricator.wikimedia.org/T229016 [19:49:37] !log ppchelko@deploy1001 Started deploy [restbase/deploy@279cf27]: Set proper CSP headers for mobile-html T229016, take 3 [19:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:55] 10Operations, 10decommission: Decommission old server wmf4077 - https://phabricator.wikimedia.org/T190086 (10RobH) This doesnt exist in netbox any longer, so it must have been removed and not closed out before we moved to netbox. [19:50:03] 10Operations, 10decommission: Decommission old server wmf4077 - https://phabricator.wikimedia.org/T190086 (10RobH) 05Open→03Invalid [19:50:28] 10Operations, 10ops-eqiad, 10decommission: Decommission db1064 - https://phabricator.wikimedia.org/T223217 (10RobH) [19:52:18] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ayounsi) Note that there are 38 servers using SFP-Ts, which mean using 1G on a 10G switch. ` asw2-b-eqiad> show chassis hardware | match SFP-... [19:52:50] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@279cf27]: Set proper CSP headers for mobile-html T229016, take 3 (duration: 03m 14s) [19:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:07] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [19:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:13] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:53:16] 10Operations, 10ops-eqiad, 10decommission: Decommission db1064 - https://phabricator.wikimedia.org/T223217 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `db1064.eqiad.wmnet` - db1064.eqiad.wmnet - Removed from Puppet master and PuppetDB - Downtimed host on... [19:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:19] 10Operations, 10ops-eqiad, 10decommission: Decommission db1064 - https://phabricator.wikimedia.org/T223217 (10RobH) [19:53:53] !log ppchelko@deploy1001 Started deploy [restbase/deploy@279cf27]: Set proper CSP headers for mobile-html T229016, feeds timing out. [19:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:24] (03PS1) 10RobH: decom db1064 prod dns [dns] - 10https://gerrit.wikimedia.org/r/525637 (https://phabricator.wikimedia.org/T223217) [19:54:47] (03CR) 10jerkins-bot: [V: 04-1] decom db1064 prod dns [dns] - 10https://gerrit.wikimedia.org/r/525637 (https://phabricator.wikimedia.org/T223217) (owner: 10RobH) [19:55:12] (03PS1) 10RobH: db1064 decom [puppet] - 10https://gerrit.wikimedia.org/r/525638 (https://phabricator.wikimedia.org/T223217) [19:56:48] (03PS2) 10RobH: decom db1064 prod dns [dns] - 10https://gerrit.wikimedia.org/r/525637 (https://phabricator.wikimedia.org/T223217) [19:56:55] (03CR) 10RobH: [C: 03+2] db1064 decom [puppet] - 10https://gerrit.wikimedia.org/r/525638 (https://phabricator.wikimedia.org/T223217) (owner: 10RobH) [19:57:36] (03CR) 10RobH: [C: 03+2] decom db1064 prod dns [dns] - 10https://gerrit.wikimedia.org/r/525637 (https://phabricator.wikimedia.org/T223217) (owner: 10RobH) [19:58:33] 10Operations, 10ops-eqiad, 10decommission: Decommission db1064 - https://phabricator.wikimedia.org/T223217 (10RobH) a:05RobH→03None [19:59:27] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@279cf27]: Set proper CSP headers for mobile-html T229016, feeds timing out. (duration: 05m 34s) [19:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:37] T229016: Use CSP headers from backend even when stored payload is served - https://phabricator.wikimedia.org/T229016 [20:02:55] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [20:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:02] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [20:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `db1069.eqiad.wmnet` - db1069.eqiad.wmnet - Removed from Puppet master and PuppetDB - Downt... [20:03:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10RobH) [20:04:14] (03PS1) 10RobH: db1069 prod dns decom [dns] - 10https://gerrit.wikimedia.org/r/525639 (https://phabricator.wikimedia.org/T227166) [20:04:59] (03CR) 10RobH: [C: 03+2] db1069 prod dns decom [dns] - 10https://gerrit.wikimedia.org/r/525639 (https://phabricator.wikimedia.org/T227166) (owner: 10RobH) [20:05:02] (03PS1) 10RobH: db1069 decom [puppet] - 10https://gerrit.wikimedia.org/r/525641 (https://phabricator.wikimedia.org/T227166) [20:05:27] (03CR) 10RobH: [C: 03+2] db1069 decom [puppet] - 10https://gerrit.wikimedia.org/r/525641 (https://phabricator.wikimedia.org/T227166) (owner: 10RobH) [20:05:56] (03CR) 10Jeena Huneidi: [V: 03+2 C: 03+2] Package mediawiki-dev and add to index [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 (owner: 10Jeena Huneidi) [20:07:13] (03PS4) 10Jeena Huneidi: Package mediawiki-dev and add to index [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 [20:07:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10RobH) a:05RobH→03None [20:07:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10RobH) [20:08:01] (03CR) 10Jeena Huneidi: [V: 03+2 C: 03+2] Package mediawiki-dev and add to index [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 (owner: 10Jeena Huneidi) [20:09:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, 10decommission: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10RobH) [20:15:27] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [20:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:55] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [20:15:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, 10decommission: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `lvs[1001-1006].wikimedia.org` - lvs1004.wikimedia.org - R... [20:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:39] (03PS1) 10RobH: decom lvs100[1-6] production dns [dns] - 10https://gerrit.wikimedia.org/r/525644 (https://phabricator.wikimedia.org/T224223) [20:22:35] (03CR) 10RobH: [C: 03+2] decom lvs100[1-6] production dns [dns] - 10https://gerrit.wikimedia.org/r/525644 (https://phabricator.wikimedia.org/T224223) (owner: 10RobH) [20:29:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, 10decommission: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10RobH) [20:34:32] 10Operations, 10procurement: Request to Order Drive Replacement on elastic1046 - https://phabricator.wikimedia.org/T229017 (10Peachey88) [20:37:20] !log add prometheus-bird-exporter to stretch-wikimedia repo [20:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:08] !log Rebasing mediawiki/extensions/MobileFrontend@wmf/1.34.0-wmf.15 for a build/CI related change to package.json https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/MobileFrontend/+/525632/ [20:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:46] (03PS4) 10BBlack: anycast recdns: use for esams LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525550 (https://phabricator.wikimedia.org/T228190) [20:58:48] (03PS4) 10BBlack: anycast recdns: use for ulsfo LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525551 (https://phabricator.wikimedia.org/T228190) [20:58:50] (03PS4) 10BBlack: anycast recdns: use for codfw LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525552 (https://phabricator.wikimedia.org/T228190) [20:58:52] (03PS6) 10BBlack: anycast recdns: use for all LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/520441 (https://phabricator.wikimedia.org/T228190) [20:59:22] (03CR) 10BBlack: [C: 03+2] anycast recdns: use for esams LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525550 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [20:59:29] (03CR) 10BBlack: [C: 03+2] anycast recdns: use for ulsfo LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525551 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [20:59:53] (03CR) 10BBlack: [C: 03+2] anycast recdns: use for codfw LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/525552 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [21:07:39] !log backup lvses in codfw, esams, ulsfo: restart pybal for resolv.conf changes - T228190 [21:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:46] T228190: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 [21:07:54] (03CR) 10Eevans: table-properties: Initial commit (036 comments) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [21:13:41] PROBLEM - puppet last run on cloudvirt1008 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:21:51] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [21:22:53] (03CR) 10Eevans: table-properties: Initial commit (032 comments) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [21:24:13] ACKNOWLEDGEMENT - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.092 second response time Jhedden investigating https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [21:24:55] RECOVERY - puppet last run on cloudvirt1008 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:27:31] (03CR) 10Kosta Harlan: "Roan, do you know why this isn't closed? I still see it on my outgoing reviews list in gerrit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523977 (https://phabricator.wikimedia.org/T228310) (owner: 10Kosta Harlan) [21:29:11] (03PS2) 10Jforrester: Beta: Add GrowthExperiments mentors list for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523977 (https://phabricator.wikimedia.org/T228310) (owner: 10Kosta Harlan) [21:29:18] (03CR) 10Jforrester: [C: 03+2] Beta: Add GrowthExperiments mentors list for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523977 (https://phabricator.wikimedia.org/T228310) (owner: 10Kosta Harlan) [21:30:09] (03PS7) 10Jforrester: Even more invariant config moved over to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 [21:30:16] (03Merged) 10jenkins-bot: Beta: Add GrowthExperiments mentors list for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523977 (https://phabricator.wikimedia.org/T228310) (owner: 10Kosta Harlan) [21:30:25] jouncebot: next [21:30:25] In 1 hour(s) and 29 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190725T2300) [21:30:45] (03CR) 10jenkins-bot: Beta: Add GrowthExperiments mentors list for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523977 (https://phabricator.wikimedia.org/T228310) (owner: 10Kosta Harlan) [21:31:49] (03PS5) 10Jforrester: [DNM] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 [21:32:15] (03PS6) 10Jforrester: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 [21:33:05] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [21:34:23] (03PS1) 10Ayounsi: Anycast: Add Prometheus exporter to Bird [puppet] - 10https://gerrit.wikimedia.org/r/525659 [21:34:33] (03CR) 10Jforrester: [C: 03+2] "Let's give this a go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 (owner: 10Jforrester) [21:35:06] I'm going to be using mwdebug1002 for a bit. [21:35:07] (03PS2) 10Ayounsi: Anycast: Add Prometheus exporter to Bird [puppet] - 10https://gerrit.wikimedia.org/r/525659 [21:35:27] (03Merged) 10jenkins-bot: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 (owner: 10Jforrester) [21:36:28] (03CR) 10jenkins-bot: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 (owner: 10Jforrester) [21:38:20] !log primary high-traffic1 lvses in codfw, esams, ulsfo: restart pybal for resolv.conf changes - T228190 [21:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:27] T228190: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 [21:39:05] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1877 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:39:25] PROBLEM - PHP7 rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2405 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:39:27] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1877 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:39:45] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1869 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:41:14] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/17625/" [puppet] - 10https://gerrit.wikimedia.org/r/525659 (owner: 10Ayounsi) [21:41:39] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:41:43] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:45:49] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 494 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:46:05] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 492 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:46:22] !log apply export BGP_Wikimedia_no_dfz to eqiad's Confed_esams - T227808 [21:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:28] T227808: Standardize cross confederation BGP policies - https://phabricator.wikimedia.org/T227808 [21:47:05] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 77233 bytes in 0.238 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:47:05] !log primary high-traffic2 lvses in codfw, esams, ulsfo: restart pybal for resolv.conf changes - T228190 [21:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:12] T228190: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 [21:47:23] RECOVERY - PHP7 rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 77280 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:48:29] Well, that didn't work. [21:49:03] (03PS1) 10Jforrester: Revert "CommonSettings: Factor out variant config generation into MWConfigCacheGenerator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525664 [21:49:09] (03CR) 10Jforrester: [C: 03+2] Revert "CommonSettings: Factor out variant config generation into MWConfigCacheGenerator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525664 (owner: 10Jforrester) [21:50:10] (03Merged) 10jenkins-bot: Revert "CommonSettings: Factor out variant config generation into MWConfigCacheGenerator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525664 (owner: 10Jforrester) [21:50:25] (03CR) 10jenkins-bot: Revert "CommonSettings: Factor out variant config generation into MWConfigCacheGenerator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525664 (owner: 10Jforrester) [21:56:34] (Production is clear again.) [21:57:12] (03CR) 10BBlack: [C: 03+2] anycast recdns: use for all LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/520441 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [21:59:06] !log lvs1016 - restart pybal for resolv.conf changes - T228190 [21:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:13] T228190: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 [22:02:45] !log lvs1015 - restart pybal for resolv.conf changes - T228190 [22:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:11] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [22:04:41] !log lvs1014 - restart pybal for resolv.conf changes - T228190 [22:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:50] T228190: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 [22:07:03] !log lvs1013 - restart pybal for resolv.conf changes - T228190 [22:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:19] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [22:08:04] 10Operations, 10netops: Standardize cross confederation BGP policies - https://phabricator.wikimedia.org/T227808 (10ayounsi) Confirmed working as expected, eg. esams still show the customer prefixes, plus now BGP advertised prefixes (LVS/Anycast). Will let it sit before rolling out to all sites. [22:13:18] 10Operations, 10Traffic: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 (10BBlack) All the LVSes are now using the anycasted recdns, which gets rid of the LVS<->recdns dependency loop and simplifies recdns server downtime processes: https://wikitech.wikimedia.org/w/index.ph... [22:16:33] 10Operations, 10MediaWiki-Uploading, 10Multimedia, 10media-storage, 10Wikimedia-production-error: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 (10Aklapper) {T228929} and {T229056} might be outcomes of this... [22:18:01] PROBLEM - puppet last run on db1076 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:45:25] (03PS1) 10Thcipriani: blubberoid: update base chart for "helm test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525678 [22:46:01] RECOVERY - puppet last run on db1076 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:46:45] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] blubberoid: update base chart for "helm test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525678 (owner: 10Thcipriani) [22:54:09] PROBLEM - puppet last run on ms-be1046 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:54:43] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190725T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:17] * Urbanecm is going to deploy a few things [23:00:20] Go for it [23:00:40] Urbanecm: Could you do me a favor and review+merge https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/525668 ? [23:00:51] Certainly :-) [23:01:08] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 2 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10CDanis) [23:01:17] Thanks, then I'll deploy it after you're done [23:01:26] wait [23:01:27] I can't [23:01:32] I don't have +2 on mediawiki/* [23:01:47] 10Operations, 10Goal: TEC6: Database Automation - https://phabricator.wikimedia.org/T220395 (10CDanis) [23:01:54] RoanKattouw, ^^ [23:02:00] (03CR) 10Urbanecm: [C: 03+2] Enable VisualEditor in namespace Wikipédia on Slovak Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525580 (https://phabricator.wikimedia.org/T229014) (owner: 10Zoranzoki21) [23:02:08] Argh [23:02:13] OK I'll ask someone else [23:03:15] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 3 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10Krinkle) [23:03:21] 10Operations, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10greg) What are the next steps with this incident task? The... [23:03:56] +1'ed, in case it matters [23:05:24] (03Merged) 10jenkins-bot: Enable VisualEditor in namespace Wikipédia on Slovak Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525580 (https://phabricator.wikimedia.org/T229014) (owner: 10Zoranzoki21) [23:06:00] someone seems to be playing with deploy1001... git status says "you are currently rebasing" [23:06:08] (in /srv/mediawiki-stagging) [23:06:33] (03CR) 10Krinkle: "What about revision delete? Does the MW extension add the urls to purge actions and/or rev del hooks? (I don't know if those hooks exist)." [puppet] - 10https://gerrit.wikimedia.org/r/525142 (https://phabricator.wikimedia.org/T85499) (owner: 10Ladsgroup) [23:06:37] (03CR) 10jenkins-bot: Enable VisualEditor in namespace Wikipédia on Slovak Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525580 (https://phabricator.wikimedia.org/T229014) (owner: 10Zoranzoki21) [23:06:45] _seems_ to be clean, but RoanKattouw (or anyone awake/in US), could you check, just in case? [23:06:53] Looking [23:07:13] thanks [23:07:42] OK I think it's clean now [23:07:49] thanks [23:07:53] I also ran git pull --rebase which pulled in the Slovak Wikipedia patch [23:08:16] (03PS1) 10CDanis: Initial canary of dbctl, db config from etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) [23:08:43] thanks [23:09:16] (03CR) 10CDanis: [C: 04-2] "*Not* to be merged yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [23:11:02] (03CR) 10Krinkle: "I guess these factors are fine given this doesn't set any caching, it just doesn't unset it. this means the MW side is now fully responsib" [puppet] - 10https://gerrit.wikimedia.org/r/525142 (https://phabricator.wikimedia.org/T85499) (owner: 10Ladsgroup) [23:11:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:525580|Enable VisualEditor in namespace Wikipédia on Slovak Wikipedia]] (T229014) (duration: 00m 48s) [23:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:12] T229014: Enable VisualEditor in namespace Wikipédia on Slovak Wikipedia - https://phabricator.wikimedia.org/T229014 [23:11:13] (03PS4) 10Urbanecm: Revert "Delete Image-reviewer group from commonswiki for good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523214 (https://phabricator.wikimedia.org/T228098) [23:11:35] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523214 (https://phabricator.wikimedia.org/T228098) (owner: 10Urbanecm) [23:13:59] (03PS2) 10Urbanecm: Add sju, sjd, and rmf to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520364 (https://phabricator.wikimedia.org/T226701) (owner: 10Tulsi Bhagat) [23:14:08] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520364 (https://phabricator.wikimedia.org/T226701) (owner: 10Tulsi Bhagat) [23:15:26] (03Merged) 10jenkins-bot: Add sju, sjd, and rmf to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520364 (https://phabricator.wikimedia.org/T226701) (owner: 10Tulsi Bhagat) [23:16:24] (03PS1) 10QChris: Add .gitreview [software/homer] - 10https://gerrit.wikimedia.org/r/525687 [23:16:26] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/homer] - 10https://gerrit.wikimedia.org/r/525687 (owner: 10QChris) [23:16:30] (03CR) 10jenkins-bot: Add sju, sjd, and rmf to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520364 (https://phabricator.wikimedia.org/T226701) (owner: 10Tulsi Bhagat) [23:18:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:520364|Add sju, sjd, and rmf to wmgExtraLanguageNames]] (T226701) (duration: 00m 47s) [23:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:12] T226701: Add sju, sjd, and rmf to wmgExtraLanguageNames - https://phabricator.wikimedia.org/T226701 [23:18:34] (03CR) 10jenkins-bot: Revert "Delete Image-reviewer group from commonswiki for good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523214 (https://phabricator.wikimedia.org/T228098) (owner: 10Urbanecm) [23:19:14] RoanKattouw, one last sync and the window will be ready for you [23:19:18] Thanks! [23:19:47] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[:gerrit:523214|Revert "Delete Image-reviewer group from commonswiki for good"]] (T228098) (duration: 00m 47s) [23:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:54] T228098: Remove `unset( $wgGroupPermissions['Image-reviewer'] );` for commonswiki - https://phabricator.wikimedia.org/T228098 [23:20:20] RoanKattouw, done! [23:22:11] RECOVERY - puppet last run on ms-be1046 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:22:43] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:27:12] (03PS1) 10QChris: Add .gitreview [homer/public] - 10https://gerrit.wikimedia.org/r/525689 [23:27:14] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [homer/public] - 10https://gerrit.wikimedia.org/r/525689 (owner: 10QChris) [23:31:28] Ugh my patch failed Jenkins, retrying it now [23:32:41] (03PS1) 10QChris: Add .gitreview [homer/mock-private] - 10https://gerrit.wikimedia.org/r/525690 [23:32:43] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [homer/mock-private] - 10https://gerrit.wikimedia.org/r/525690 (owner: 10QChris) [23:47:49] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.15/extensions/GrowthExperiments/extension.json: Fix over-eager GrowthExperiments popups (T229045) (duration: 00m 50s) [23:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:57] T229045: Homepage: users receiving discovery tour even if old account - https://phabricator.wikimedia.org/T229045