[00:02:51] PROBLEM - Check systemd state on ms-be1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:10:05] RECOVERY - Check systemd state on ms-be1028 is OK: OK - running: The system is fully operational [00:12:45] 10Operations, 10Librarization, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: Split GeoIP into a new component - https://phabricator.wikimedia.org/T102848 (10Krinkle) [00:36:07] PROBLEM - High CPU load on API appserver on mw1347 is CRITICAL: CRITICAL - load average: 150.82, 63.25, 32.09 [00:40:29] RECOVERY - High CPU load on API appserver on mw1347 is OK: OK - load average: 15.50, 33.84, 27.25 [02:31:14] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) @arielGlenn I built the whole package successfully and uploaded the sshd bina... [02:31:30] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) a:03mmodell [03:16:24] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Bstorm) 100% agree with you @faidon, and I appreciate the reply. I'm aiming to avoid any sugar-coating in my assessments of risks until I h... [03:37:49] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Bstorm) Note: there are rate limits that can be set within openstack for this as well...but in some versions, they don't work right at all (t... [03:46:55] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 28431 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [04:02:57] RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [04:07:45] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10KartikMistry) >>! In T210704#5270565, @KartikMistry wrote: > I was able to reproduce error we saw in Production using end p... [04:12:17] RECOVERY - Memory correctable errors -EDAC- on wtp2020 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw+prometheus/ops [04:54:48] !log Rename table wikimedia_editor_tasks_entity_description_exists in db1092 T226326 [04:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:54] T226326: Drop the `wikimedia_editor_tasks_entity_description_exists` table - https://phabricator.wikimedia.org/T226326 [04:58:17] (03PS1) 10Marostegui: filtered_tables: Remove references to a table [puppet] - 10https://gerrit.wikimedia.org/r/518455 (https://phabricator.wikimedia.org/T226326) [04:59:20] !log Rename table wikimedia_editor_tasks_entity_description_exists in db1123 (testwikidatawiki) T226326 [04:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:55] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518456 (https://phabricator.wikimedia.org/T222682) [05:03:55] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518456 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:04:48] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518456 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:06:34] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518456 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:06:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1135 from config T222682 (duration: 01m 07s) [05:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:00] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [05:07:56] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db1135 from config T222682 (duration: 00m 55s) [05:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:08] (03CR) 10Marostegui: [C: 03+2] filtered_tables: Remove references to a table [puppet] - 10https://gerrit.wikimedia.org/r/518455 (https://phabricator.wikimedia.org/T226326) (owner: 10Marostegui) [05:11:53] (03PS5) 10Marostegui: db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725) [05:13:50] (03PS1) 10Marostegui: db1135: Allow reimage [puppet] - 10https://gerrit.wikimedia.org/r/518457 (https://phabricator.wikimedia.org/T222682) [05:14:58] (03CR) 10Marostegui: [C: 03+2] db1135: Allow reimage [puppet] - 10https://gerrit.wikimedia.org/r/518457 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:27:36] (03PS1) 10Marostegui: mariadb: Move db1135 from s4 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/518458 (https://phabricator.wikimedia.org/T222682) [05:28:24] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1135 from s4 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/518458 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:29:25] (03PS2) 10Marostegui: mariadb: Move db1135 from s4 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/518458 (https://phabricator.wikimedia.org/T222682) [05:30:09] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1135 from s4 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/518458 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:33:33] (03CR) 10Marostegui: "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/compiler1002/17061/" [puppet] - 10https://gerrit.wikimedia.org/r/518458 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:34:45] (03PS3) 10Marostegui: mariadb: Move db1135 from s4 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/518458 (https://phabricator.wikimedia.org/T222682) [05:35:47] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1135 from s4 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/518458 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:36:42] (03CR) 10Marostegui: [V: 03+2 C: 03+2] mariadb: Move db1135 from s4 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/518458 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:56:00] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10ArielGlenn) >>! In T224677#5277070, @mmodell wrote: > @arielGlenn I built the whole pa... [05:57:10] <_joe_> !log rebuilding base debian/alpine images to pick up security updates [05:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:17] 10Operations, 10Dumps-Generation, 10procurement: consider getting a third dumpsdata server - https://phabricator.wikimedia.org/T219768 (10ArielGlenn) [06:01:53] !log Stop MySQL on db1117:3321 to clone db1135 (haproxy alert will be triggered) - T222682 [06:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:58] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [06:02:27] PROBLEM - HHVM rendering on mw2261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [06:02:51] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Update nodejs10 image to use the latest version of the package - https://phabricator.wikimedia.org/T226346 (10Joe) [06:03:29] RECOVERY - HHVM rendering on mw2261 is OK: HTTP OK: HTTP/1.1 200 OK - 79284 bytes in 0.330 second response time https://wikitech.wikimedia.org/wiki/Application_servers [06:04:07] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Update nodejs10 image to use the latest version of the package - https://phabricator.wikimedia.org/T226346 (10Joe) p:05Triage→03High [06:04:43] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Update nodejs10 image to use the latest version of the package - https://phabricator.wikimedia.org/T226346 (10Joe) a:03Joe While I'm at it, I'll upgrade all production images with new base images too. [06:04:57] ACKNOWLEDGEMENT - haproxy failover on dbproxy1006 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy [06:09:50] ACKNOWLEDGEMENT - haproxy failover on dbproxy1001 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy [06:12:11] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ayounsi) > our network interface saturation monitoring is still diamond-based What does that mean? > Is there a way my team could monitor an... [06:16:06] !log powercycle analytics1060 (stuck, no ssh, no console com2 available) [06:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:06] RECOVERY - Host analytics1060 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [06:22:24] (03PS1) 10Giuseppe Lavagetto: Refresh nodejs10-slim image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/518468 (https://phabricator.wikimedia.org/T226346) [06:23:10] (03PS1) 10Elukey: role::analytics_test_cluste::hadoop::master|stanby: allow https port [puppet] - 10https://gerrit.wikimedia.org/r/518469 (https://phabricator.wikimedia.org/T212259) [06:23:45] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluste::hadoop::master|stanby: allow https port [puppet] - 10https://gerrit.wikimedia.org/r/518469 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [06:28:17] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Refresh nodejs10-slim image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/518468 (https://phabricator.wikimedia.org/T226346) (owner: 10Giuseppe Lavagetto) [06:30:14] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:31:02] PROBLEM - puppet last run on theemin is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:31:03] <_joe_> !log publishing docker-registry.wikimedia.org/nodejs10-slim:0.0.2, T226346 [06:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:08] T226346: Update nodejs10 image to use the latest version of the package - https://phabricator.wikimedia.org/T226346 [06:31:37] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Joe) [06:31:42] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Patch-For-Review, 10Services (next): Update nodejs10 image to use the latest version of the package - https://phabricator.wikimedia.org/T226346 (10Joe) 05Open→03Resolved [06:31:50] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/biocLite.R] [06:32:50] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Joe) @KartikMistry if we trigger a rebuild of the production container, it should now use the newer nodejs10-slim image and... [06:34:46] PROBLEM - Docker registry HTTPS interface on registry1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - string schemaVersion not found on https://registry1002.eqiad.wmnet:443/v2/wikimedia-stretch/manifests/latest - 382 bytes in 0.236 second response time https://wikitech.wikimedia.org/wiki/Docker [06:34:48] PROBLEM - Docker registry HTTPS interface on registry1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - string schemaVersion not found on https://registry1001.eqiad.wmnet:443/v2/wikimedia-stretch/manifests/latest - 382 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Docker [06:57:20] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:58:08] RECOVERY - puppet last run on theemin is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:52] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:05:52] (03PS1) 10Marostegui: install_server: Do not reimage db1135 [puppet] - 10https://gerrit.wikimedia.org/r/518645 (https://phabricator.wikimedia.org/T222682) [07:06:03] !log installing vim update for stretch [07:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:18] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1135 [puppet] - 10https://gerrit.wikimedia.org/r/518645 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [07:10:43] (03PS1) 10Elukey: hadoop: set 'hdfs' as admin user for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/518646 (https://phabricator.wikimedia.org/T212259) [07:13:20] (03CR) 10Elukey: [C: 03+2] hadoop: set 'hdfs' as admin user for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/518646 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [07:18:38] (03CR) 10Hashar: [C: 03+1] "PHP used to be enabled a while ago but PHP has been disabled for at least 7 months when upgrading to Stretch. It might even have been disa" [puppet] - 10https://gerrit.wikimedia.org/r/517926 (owner: 10Krinkle) [07:23:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This needs one minor correction, I'll take care of it." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/517926 (owner: 10Krinkle) [07:24:46] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10MoritzMuehlenhoff) >>! In T224188#5277181, @ayounsi wrote: >> our network interface saturation monitoring is still diamond-based > What does... [07:25:48] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] [07:26:26] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] [07:30:27] moritzm: ^ I guess that will get fixed on a second run? [07:33:24] PROBLEM - puppet last run on dns1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] [07:36:38] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] [07:37:21] yeah, that'll recover soonish, dpkg database is blocked during rollout of packages like the vim update and that triggers some Puppet/Icinga spam [07:37:58] PROBLEM - puppet last run on labweb1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[vim],Package[python-ldap] [07:41:27] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10Thumbor: Increased failure rate of varnish be fetches - https://phabricator.wikimedia.org/T226318 (10TheDJ) So the peaks on the graph stop right around the time in the -operations channel there was the log: 21:31 <+icinga-wm> PROBLEM... [07:41:31] (03PS1) 10Elukey: hadoop: format dfs.cluster.administrators correctly [puppet] - 10https://gerrit.wikimedia.org/r/518648 (https://phabricator.wikimedia.org/T212259) [07:42:28] (03CR) 10Elukey: [C: 03+2] hadoop: format dfs.cluster.administrators correctly [puppet] - 10https://gerrit.wikimedia.org/r/518648 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [07:48:24] (03PS1) 10Muehlenhoff: Extend account date for Mathew with new contract date [puppet] - 10https://gerrit.wikimedia.org/r/518649 [07:49:26] (03PS2) 10Muehlenhoff: Extend account date for Mathew with new contract date [puppet] - 10https://gerrit.wikimedia.org/r/518649 [07:51:58] !log stop mysql consumer on eventlog1002 (so traffic to db1107 will be stopped, to allow maintenance to happen) [07:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:02] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:53:42] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:53:57] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Update nodejs10 image to use the latest version of the package - https://phabricator.wikimedia.org/T226346 (10Petar.petkovic) [07:55:05] (03CR) 10Muehlenhoff: [C: 03+2] Extend account date for Mathew with new contract date [puppet] - 10https://gerrit.wikimedia.org/r/518649 (owner: 10Muehlenhoff) [07:55:46] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) [07:55:49] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [07:59:26] 10Operations, 10Cognate, 10DBA, 10Growth-Team, and 2 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) [07:59:36] 10Operations, 10Cognate, 10DBA, 10Growth-Team, and 2 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) p:05Triage→03Normal [08:00:34] RECOVERY - puppet last run on dns1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [08:03:52] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:05:10] RECOVERY - puppet last run on labweb1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:05:10] (03PS1) 10Marostegui: db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518651 (https://phabricator.wikimedia.org/T226358) [08:06:23] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518651 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [08:07:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518651 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [08:07:30] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518651 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [08:08:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1120 for upgrade T226358 (duration: 00m 56s) [08:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:53] T226358: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 [08:09:18] !log Stop MySQL on db1120 for upgrade - T226358 [08:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:24] (03CR) 10Hashar: [C: 03+1] "I have purged the package and user/group from the Docker agents." [puppet] - 10https://gerrit.wikimedia.org/r/518222 (https://phabricator.wikimedia.org/T226233) (owner: 10Hashar) [08:14:42] !log upgrade, stop and restart db1107 [08:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:49] (03PS8) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) [08:19:18] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:19:18] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:19:22] (03PS2) 10Giuseppe Lavagetto: peopleweb: Remove php module from httpd [puppet] - 10https://gerrit.wikimedia.org/r/517926 (owner: 10Krinkle) [08:19:37] ^ expected from jynus upgrade [08:19:53] (03CR) 10Hashar: [C: 03+1] contint: remove several unused packages [puppet] - 10https://gerrit.wikimedia.org/r/517093 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:19:54] also not in use [08:20:06] (03CR) 10Hashar: [C: 03+1] "Packages purged \o/" [puppet] - 10https://gerrit.wikimedia.org/r/517093 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:20:08] (03CR) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [08:20:18] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518653 [08:21:50] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518653 (owner: 10Marostegui) [08:22:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] peopleweb: Remove php module from httpd [puppet] - 10https://gerrit.wikimedia.org/r/517926 (owner: 10Krinkle) [08:22:41] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518653 (owner: 10Marostegui) [08:22:46] (03CR) 10Hashar: [C: 03+1] "I have removed it from the instances." [puppet] - 10https://gerrit.wikimedia.org/r/517094 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:24:07] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518653 (owner: 10Marostegui) [08:24:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1120 after upgrade T226358 (duration: 00m 56s) [08:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:30] T226358: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 [08:26:23] https://www.irccloud.com/pastebin/624qxEy1/ [08:27:04] ema vgutierrez ^ [08:27:22] persists through several refreshes [08:28:11] I guess it needs to get a restart [08:31:57] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10Thumbor: Increased failure rate of varnish be fetches - https://phabricator.wikimedia.org/T226318 (10Gilles) We have a higher volume of thumbnail rendering than usual due to the deployment of {T216339}, which results in twice the amoun... [08:32:32] (03PS9) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) [08:34:25] !log reloading haproxy on dbproxy1004/9 [08:34:26] (03PS1) 10Muehlenhoff: Extend access for jsamra [puppet] - 10https://gerrit.wikimedia.org/r/518654 [08:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:18] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:35:18] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:35:35] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for jsamra [puppet] - 10https://gerrit.wikimedia.org/r/518654 (owner: 10Muehlenhoff) [08:37:25] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10jcrespo) 05Resolved→03Open {P8645} [08:38:56] !log upgrade, stop and restart db1108 [08:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:01] (03CR) 10Hashar: [C: 03+1] "Cleaned up from the CI slaves :]" [puppet] - 10https://gerrit.wikimedia.org/r/517095 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:41:03] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Marostegui) @Cmjohnson as per the error @jcrespo pasted above is that enough to get Dell to send a new DIMM you think? [08:42:09] !log reboot an-master100[1,2] for kernel + openjdk upgrades [08:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:32] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:42:32] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:43:08] again, expected [08:43:55] (03CR) 10Hashar: [C: 03+1] "Unmounted and purged from /etc/fstab" [puppet] - 10https://gerrit.wikimedia.org/r/517098 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:47:05] 10Operations, 10MediaWiki-History-and-Diffs, 10Wikidata, 10wikidata-tech-focus, 10Performance: WMFTimeoutException when loading some diffs on Wikidata - https://phabricator.wikimedia.org/T140879 (10WMDE-Fisch) Without knowing much about how diffs are generated for Wikidata, I just wanted to mention curre... [08:47:56] 10Operations, 10Wikimedia Australia, 10Wikimedia-Mailing-lists: Wikimedia-au-members and wikimedia-au-announce password reset - https://phabricator.wikimedia.org/T225712 (10StevenCrossin) p:05Normal→03High Hi team, this has been open for some time now, can it please be actioned? We are locked out. [08:51:16] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:51:16] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:52:19] !log Upgrade Mysql on db1140 (checked that all snapshots backups are done) - T226358 [08:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:24] T226358: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 [08:52:56] PROBLEM - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: eventlogging-consumer@mysql-m4-master-00 eventlogging-consumer@mysql-eventbus https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [08:54:48] in theory I should have downtimed it [08:55:50] RECOVERY - Check status of defined EventLogging jobs on eventlog1002 is OK: OK: All defined EventLogging jobs are runnning. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [08:56:00] !log re-enable eventloggign mysql consumers after maintenance on eventlog1002 [08:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:01] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10Thumbor: Increased failure rate of varnish be fetches - https://phabricator.wikimedia.org/T226318 (10Gilles) I suspect it's an issue at the Swift level, possibly a capacity problem with the added Thumbnail miss load. I'm seeing these e... [09:04:50] PROBLEM - Check systemd state on kubernetes2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:05:10] (03PS2) 10Bmansurov: Labs: enable QuickSurveys on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518186 (https://phabricator.wikimedia.org/T225819) [09:10:00] 10Operations, 10media-storage: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10Gilles) [09:10:18] 10Operations, 10media-storage: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10Gilles) Config value to increase that timeout: https://github.com/openstack/swift/blob/4ee9545805f52ff0da5c56ab04abf6f053b31a50/etc/proxy-server.conf-sample#L148 [09:14:19] (03PS1) 10Jcrespo: Revert "mariadb: Disable checks of database snapshots" [puppet] - 10https://gerrit.wikimedia.org/r/518656 [09:14:31] (03PS2) 10Jcrespo: Revert "mariadb: Disable checks of database snapshots" [puppet] - 10https://gerrit.wikimedia.org/r/518656 [09:15:35] (03CR) 10Jcrespo: [C: 03+1] "This should work now better, because it will be very unlikely 2 backups fail in a row (and they no longer chain-fail)." [puppet] - 10https://gerrit.wikimedia.org/r/518656 (owner: 10Jcrespo) [09:15:46] (03PS1) 10Gilles: Increase swift proxy connection timeout to 1s [puppet] - 10https://gerrit.wikimedia.org/r/518658 (https://phabricator.wikimedia.org/T226373) [09:15:54] (03CR) 10Marostegui: [C: 03+1] Revert "mariadb: Disable checks of database snapshots" [puppet] - 10https://gerrit.wikimedia.org/r/518656 (owner: 10Jcrespo) [09:22:53] (03CR) 10Gehel: "Adding Eric as suggested." [cookbooks] - 10https://gerrit.wikimedia.org/r/517377 (https://phabricator.wikimedia.org/T225694) (owner: 10Mathew.onipe) [09:23:04] !log reboot of kafka-jumbo100[1-6] for kernel + openjdk upgrades [09:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:55] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Disable checks of database snapshots" [puppet] - 10https://gerrit.wikimedia.org/r/518656 (owner: 10Jcrespo) [09:40:32] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:41:20] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:43:05] 10Operations, 10Wikimedia Australia, 10Wikimedia-Mailing-lists: Wikimedia-au-members and wikimedia-au-announce password reset - https://phabricator.wikimedia.org/T225712 (10Aklapper) p:05High→03Normal This has not become more urgent and the Priority field is supposed to [reflect reality and does not caus... [09:44:50] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10Thumbor: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10ema) p:05Triage→03Normal [09:47:41] 10Operations, 10Wikimedia Australia, 10Wikimedia-Mailing-lists: Wikimedia-au-members and wikimedia-au-announce password reset - https://phabricator.wikimedia.org/T225712 (10Samwilson) I think it may have become more urgent because new members have joined the chapter and it's not possible for the list admins... [09:49:32] 10Operations, 10Traffic: Investigate esams text varnish backend fetch failures - https://phabricator.wikimedia.org/T226375 (10ema) [09:49:40] 10Operations, 10Traffic: Investigate esams text varnish backend fetch failures - https://phabricator.wikimedia.org/T226375 (10ema) p:05Triage→03Normal [09:50:38] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10Thumbor: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10ema) >>! In T226318#5276594, @TheDJ wrote: > We couldn't find anything, but there ar... [10:12:26] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10jbond) i have manully disabled the `/usr/local/sbin/smart-data-dump` cron job to reduce spam [10:13:28] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [10:14:08] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [10:16:08] 10Operations, 10Traffic: Investigate esams text varnish backend fetch failures - https://phabricator.wikimedia.org/T226375 (10Gilles) I believe the cause is {T226373} The increase is simply proportional to the thumbnail miss increase due to extra WebP requests. [10:19:57] 10Operations, 10ops-eqiad, 10DC-Ops: Hardware Request: puppet master eqiad - https://phabricator.wikimedia.org/T226382 (10jbond) p:05Triage→03Normal [10:21:28] 10Operations, 10ops-eqiad, 10DC-Ops: Hardware Request: puppet master eqiad - https://phabricator.wikimedia.org/T226382 (10MoritzMuehlenhoff) Once T201342 is done, it seems like the best candidate for this. [10:21:38] (03CR) 10Volans: "If I'm not mistaken the underlying issue has been solved and this patch is not needed anymore and could be abandoned." [dns] - 10https://gerrit.wikimedia.org/r/508979 (owner: 10Cwek) [10:28:23] !log re-enabling TCP SACKs on cp5001-cp5003 (half of Varnish/upload in eqsin) T225998 [10:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:28] T225998: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 [10:29:59] (03CR) 10Lucas Werkmeister (WMDE): "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518239 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [10:30:05] jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T1030). [10:37:08] !log re-enabling TCP SACKs on cp5007-cp5009 (half of Varnish/text in eqsin) T225998 [10:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:09] T225998: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 [10:38:19] (03CR) 10Elukey: [C: 03+2] Add hiera overrides for analytics1035 [puppet] - 10https://gerrit.wikimedia.org/r/518662 (owner: 10Elukey) [10:42:24] PROBLEM - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,9 instance=db2067:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2067&var-datasource=codfw+prometheus/ops [10:48:01] (03PS1) 10Giuseppe Lavagetto: safe-service-restart: logging improvements [puppet] - 10https://gerrit.wikimedia.org/r/518664 (https://phabricator.wikimedia.org/T224857) [10:48:03] (03PS1) 10Giuseppe Lavagetto: conftool: add safe_service_restart define [puppet] - 10https://gerrit.wikimedia.org/r/518665 [10:48:05] (03PS1) 10Giuseppe Lavagetto: profile::lvs::realserver: introduce ability to use safe-service-restart [puppet] - 10https://gerrit.wikimedia.org/r/518666 [10:48:07] (03PS1) 10Giuseppe Lavagetto: mediawiki: use safe-restart scripts on all appserver, apis [puppet] - 10https://gerrit.wikimedia.org/r/518667 [10:48:09] (03PS1) 10Giuseppe Lavagetto: docker-registry: use safe restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/518668 [10:48:11] (03PS1) 10Giuseppe Lavagetto: parsoid: use safe service restarts [puppet] - 10https://gerrit.wikimedia.org/r/518669 [10:48:13] (03PS1) 10Giuseppe Lavagetto: eventbus: use safe restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/518670 [10:48:15] (03PS1) 10Giuseppe Lavagetto: eventschemas: use safe service restart script [puppet] - 10https://gerrit.wikimedia.org/r/518671 [10:48:17] (03PS1) 10Giuseppe Lavagetto: profile::lvs::realserver: only use safe restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/518672 [10:49:05] (03CR) 10jerkins-bot: [V: 04-1] safe-service-restart: logging improvements [puppet] - 10https://gerrit.wikimedia.org/r/518664 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [10:50:24] PROBLEM - puppet last run on db1115 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [10:56:56] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.85% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T1100). [11:00:05] bmansurov: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] here [11:02:18] Amir1: Lucas_WMDE awight Urbanecm anyone here? [11:02:50] (03PS1) 10Elukey: Add hiera overrides to analytics1036 [puppet] - 10https://gerrit.wikimedia.org/r/518675 [11:03:26] (03CR) 10Elukey: [C: 03+2] Add hiera overrides to analytics1036 [puppet] - 10https://gerrit.wikimedia.org/r/518675 (owner: 10Elukey) [11:03:44] Can anyone deploy my labs config patch? [11:05:26] o/ [11:05:31] I’m here, sorry for the delay [11:05:36] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:05:37] looking at the patch now [11:08:27] (03CR) 10Lucas Werkmeister (WMDE): Labs: enable QuickSurveys on hewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518186 (https://phabricator.wikimedia.org/T225819) (owner: 10Bmansurov) [11:08:46] bmansurov: left a comment [11:11:27] (03CR) 10Lucas Werkmeister (WMDE): Labs: enable QuickSurveys on hewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518186 (https://phabricator.wikimedia.org/T225819) (owner: 10Bmansurov) [11:14:47] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10aborrero) Worth noting that even though we will be using 10G links, we don't expect them to be fully used **in any case** in the short term.... [11:17:32] RECOVERY - puppet last run on db1115 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:20:07] (03CR) 10Bmansurov: Labs: enable QuickSurveys on hewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518186 (https://phabricator.wikimedia.org/T225819) (owner: 10Bmansurov) [11:23:04] (03PS3) 10Bmansurov: Labs: enable QuickSurveys on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518186 (https://phabricator.wikimedia.org/T225819) [11:23:09] Lucas_WMDE: uploaded a new patch [11:23:15] ok [11:24:31] (03PS4) 10Lucas Werkmeister (WMDE): Labs: enable QuickSurveys on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518186 (https://phabricator.wikimedia.org/T225819) (owner: 10Bmansurov) [11:24:42] made one more quick change to add a comment with the task, hope that’s okay [11:25:46] and I suppose that’s not really a change that can be tested? all I can check is that the wikis still work after syncing it [11:25:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518186 (https://phabricator.wikimedia.org/T225819) (owner: 10Bmansurov) [11:26:27] Lucas_WMDE: yes, that's great. And yes, I'll check back later when the change goes live. [11:26:44] (03Merged) 10jenkins-bot: Labs: enable QuickSurveys on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518186 (https://phabricator.wikimedia.org/T225819) (owner: 10Bmansurov) [11:26:47] ok [11:26:58] (03CR) 10jenkins-bot: Labs: enable QuickSurveys on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518186 (https://phabricator.wikimedia.org/T225819) (owner: 10Bmansurov) [11:27:08] it’s on mwdebug1002, checking just a bit… [11:27:55] all seems to work [11:27:58] kafka-jumbo1005 was stuck after reboot, I just powercycled it. This may cause some alerts to fire [11:29:21] Lucas_WMDE: thanks! [11:29:50] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 98 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [11:29:58] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is CRITICAL: 91 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [11:30:10] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:518186|Labs: enable QuickSurveys on hewiki (T225819)]] (duration: 00m 57s) [11:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:15] T225819: Test demographics survey on beta wikis - https://phabricator.wikimedia.org/T225819 [11:30:30] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 76 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [11:30:32] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 57 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [11:30:54] nothing else in the deployment calendar for this slot [11:31:39] !log EU SWAT done [11:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:03] kafka-jumbo1005 is now running, alerts should clear out in a bit [11:32:57] bmansurov: and QuickSurveys seems to be enabled at https://he.wikipedia.beta.wmflabs.org/wiki/%D7%9E%D7%99%D7%95%D7%97%D7%93:%D7%92%D7%A8%D7%A1%D7%94#mw-version-ext-other-QuickSurveys [11:34:57] Lucas_WMDE: awesome [11:35:19] though the force-enable links mentioned in the task comments didn’t seem to work for me [11:35:28] but I’ll leave that to y’all :) [11:35:38] * Lucas_WMDE knows nothing about QuickSurveys [11:35:59] Lucas_WMDE: it's working for me. Maybe you have DoNotTrack enabled? [11:36:04] Here's the working link: https://he.wikipedia.beta.wmflabs.org/wiki/%D7%91%D7%A0%D7%93%D7%99%D7%A7%D7%98%D7%95%D7%A1_%D7%9E%D7%A0%D7%95%D7%A8%D7%A1%D7%99%D7%94?quicksurvey=true [11:36:08] I think I do [11:36:13] it respects that? nice [11:36:17] yes [11:37:04] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [11:37:12] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [11:37:42] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [11:37:44] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [11:37:58] gooood [11:38:03] I have only one broker left [11:38:25] (03CR) 10Gilles: New library to interact with poolcounter from python (034 comments) [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto) [11:54:00] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ayounsi) That sounds reasonable for the PoC, depending on rack space. @faidon for the last word. Note that we don't have visibility in the c... [12:04:03] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:04:35] that paged [12:04:46] yep [12:05:20] gehel: are you around? [12:05:21] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:08:12] gehel told me on hangouts that he is looking into it but IRC seems not working well for him [12:08:45] very interesting metrics in https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=now-3h&to=now [12:09:53] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:14:15] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10aborrero) >>! In T224188#5278081, @ayounsi wrote: > That sounds reasonable for the PoC, depending on rack space. @faidon for the last word. >... [12:14:32] I did not get a page, should I have? [12:18:38] (03PS1) 10Effie Mouzeli: annual.wikimedia.org: redirect to AnnualReport2018 [puppet] - 10https://gerrit.wikimedia.org/r/518690 (https://phabricator.wikimedia.org/T226066) [12:19:57] <_joe_> apergos: yes you should've [12:20:25] I'll try powercycling the phone [12:20:37] maybe they'll show up then [12:21:21] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.840 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:23:04] still nothing [12:30:26] (03PS1) 10CDanis: wdqs: temporarily ban a likely-problematic UA [puppet] - 10https://gerrit.wikimedia.org/r/518691 [12:32:45] (03PS1) 10Jbond: wdqs: temp ban user agent [puppet] - 10https://gerrit.wikimedia.org/r/518696 [12:33:17] (03CR) 10Effie Mouzeli: [C: 03+1] wdqs: temporarily ban a likely-problematic UA [puppet] - 10https://gerrit.wikimedia.org/r/518691 (owner: 10CDanis) [12:33:36] (03CR) 10Gehel: [C: 03+1] wdqs: temporarily ban a likely-problematic UA [puppet] - 10https://gerrit.wikimedia.org/r/518691 (owner: 10CDanis) [12:34:00] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10Papaul) [12:36:24] (03CR) 10Jbond: wdqs: temporarily ban a likely-problematic UA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518691 (owner: 10CDanis) [12:36:34] (03CR) 10Gehel: [C: 03+2] wdqs: temporarily ban a likely-problematic UA [puppet] - 10https://gerrit.wikimedia.org/r/518691 (owner: 10CDanis) [12:40:07] (03CR) 10Ema: [C: 04-1] wdqs: temp ban user agent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518696 (owner: 10Jbond) [12:41:50] (03PS2) 10Jbond: wdqs: temp ban user agent [puppet] - 10https://gerrit.wikimedia.org/r/518696 [12:43:52] (03CR) 10Jbond: wdqs: temporarily ban a likely-problematic UA (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/518691 (owner: 10CDanis) [12:44:22] (03PS1) 10Giuseppe Lavagetto: lvs::configuration: use a meaningful request to monitor wdqs [puppet] - 10https://gerrit.wikimedia.org/r/518705 [12:45:13] (03CR) 10CDanis: [C: 03+1] lvs::configuration: use a meaningful request to monitor wdqs [puppet] - 10https://gerrit.wikimedia.org/r/518705 (owner: 10Giuseppe Lavagetto) [12:45:15] (03CR) 10Gehel: [C: 03+1] "LGTM, we should have been using this all along!" [puppet] - 10https://gerrit.wikimedia.org/r/518705 (owner: 10Giuseppe Lavagetto) [12:47:59] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) @Bstorm when talking about the private network for those hosts, are you referring to the private1 network or the lab private networ... [12:49:18] !log restarting blazegraph on wdqs1004 (JVM thread out of control) [12:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:00] (03PS7) 10Arturo Borrero Gonzalez: toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) [12:52:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:52:46] 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10Papaul) @ayounsi let me know when this week you have time for us to replace the old msw. Thanks. [12:54:50] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [12:55:18] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [12:57:10] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:57:22] (03PS8) 10Arturo Borrero Gonzalez: toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) [12:59:15] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:00:17] !log shutdown wdqs updater on wdqs/public/eqiad [13:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:40] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [13:01:06] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:01:22] (03PS9) 10Arturo Borrero Gonzalez: toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) [13:02:01] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.341 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:03:27] 10Operations, 10Traffic: Investigate esams text varnish backend fetch failures - https://phabricator.wikimedia.org/T226375 (10ema) [13:04:13] !log re-enabling TCP SACKs on cp1075-1082 (half of Varnish/text and Varnish/upload in eqiad) T225998 [13:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:18] T225998: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 [13:06:52] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:06:54] jouncebot: now [13:06:55] No deployments scheduled for the next 3 hour(s) and 53 minute(s) [13:06:56] jouncebot: next [13:06:56] In 3 hour(s) and 53 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T1700) [13:08:20] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:10:37] !log re-enabling TCP SACKs on cp2001,2002,2004-2008,2010,2011, 2014, 2017 (half of Varnish/text and Varnish/upload in codfw) T225998 [13:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:43] T225998: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 [13:12:37] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [13:13:10] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,9 instance=db2067:9100 job=node site=codfw Marostegui T208323 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2067&var-datasource=codfw+prometheus/ops [13:16:57] 10Operations, 10netops: Telia IC-307235 reported down from the eqiad side - https://phabricator.wikimedia.org/T226394 (10CDanis) [13:19:03] !log re-enabling TCP SACKs on cp3040-cp3047, cp3049 (half of Varnish/text and Varnish/upload in esams) T225998 [13:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:08] T225998: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 [13:20:46] 10Operations, 10Traffic: Investigate esams text varnish backend fetch failures - https://phabricator.wikimedia.org/T226375 (10ema) [13:23:24] (03CR) 10Volans: [C: 03+1] "> Patch Set 3: Code-Review+2" [software/conftool] - 10https://gerrit.wikimedia.org/r/515323 (owner: 10CDanis) [13:24:48] PHP fatal error: [13:24:48] entire web request took longer than 60 seconds and timed out [13:24:56] https://meta.wikimedia.org/wiki/Special:Log/Maintenance_script [13:25:48] (03PS4) 10Ema: Honor first_byte_timeout for recycled backend connections [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518064 (https://phabricator.wikimedia.org/T226375) [13:25:50] (03PS2) 10Ema: varnish (5.1.3-1wm11) stretch-wikimedia; urgency=medium [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518224 (https://phabricator.wikimedia.org/T226375) [13:25:57] !log update libviry on cloudvirt* stretch servers [13:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:25] !log re-enabling TCP SACKs on cp4024-4029 (half of Varnish/text and Varnish/upload in ulsfo) T225998 [13:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:30] T225998: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 [13:27:25] 10Operations, 10Traffic, 10Patch-For-Review: Investigate esams text varnish backend fetch failures - https://phabricator.wikimedia.org/T226375 (10ema) [13:31:48] 10Operations, 10netops: Telia IC-307235 reported down from the eqiad side - https://phabricator.wikimedia.org/T226394 (10Volans) ` volans@re0.cr1-eqiad> show interfaces diagnostics optics xe-4/2/0 Physical interface: xe-4/2/0 Laser bias current : 39.156 mA Laser output power... [13:32:36] 10Operations, 10netops: Telia IC-307235 reported down from the eqiad side - https://phabricator.wikimedia.org/T226394 (10Volans) ` volans@re0.cr1-codfw> show interfaces diagnostics optics xe-5/2/1 Physical interface: xe-5/2/1 Laser bias current : 40.898 mA Laser output power... [13:33:01] (03PS5) 10Reedy: Move all FR config to an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518396 (https://phabricator.wikimedia.org/T225144) [13:33:05] a major outage. sigh [13:33:09] (telia) [13:36:07] 10Operations, 10Traffic, 10Patch-For-Review: Investigate esams text varnish backend fetch failures - https://phabricator.wikimedia.org/T226375 (10ema) [13:36:52] (03CR) 10Ema: [C: 03+2] Honor first_byte_timeout for recycled backend connections [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518064 (https://phabricator.wikimedia.org/T226375) (owner: 10Ema) [13:36:55] ACKNOWLEDGEMENT - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP CDanis Telia major outage https://phabricator.wikimedia.org/T226394 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:36:55] ACKNOWLEDGEMENT - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Telia major outage https://phabricator.wikimedia.org/T226394 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:37:09] (03CR) 10Ema: [C: 03+2] varnish (5.1.3-1wm11) stretch-wikimedia; urgency=medium [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518224 (https://phabricator.wikimedia.org/T226375) (owner: 10Ema) [13:37:48] (03PS6) 10Reedy: Move all FR config to an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518396 (https://phabricator.wikimedia.org/T225144) [13:39:11] (03Abandoned) 10Muehlenhoff: kerberos: add script to generate service principals/keytabs [puppet] - 10https://gerrit.wikimedia.org/r/470566 (https://phabricator.wikimedia.org/T212257) (owner: 10Muehlenhoff) [13:40:49] (03PS7) 10Reedy: Move all FR config to an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518396 (https://phabricator.wikimedia.org/T225144) [13:41:03] 10Operations, 10netops: Telia IC-307235 reported down from the eqiad side - https://phabricator.wikimedia.org/T226394 (10CDanis) Telia reports a 'major outage' and is tracking status of our circuit in case 00993514 [13:42:20] (03PS8) 10Reedy: Move all FR config to an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518396 (https://phabricator.wikimedia.org/T225144) [13:43:42] (03PS9) 10Reedy: Move all FR config to an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518396 (https://phabricator.wikimedia.org/T225144) [13:44:13] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) [13:44:43] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) p:05Triage→03Normal [13:45:09] (03PS10) 10Reedy: Move all FR config to an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518396 (https://phabricator.wikimedia.org/T225144) [13:45:14] (03CR) 10Reedy: [C: 03+2] Move all FR config to an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518396 (https://phabricator.wikimedia.org/T225144) (owner: 10Reedy) [13:46:07] (03Merged) 10jenkins-bot: Move all FR config to an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518396 (https://phabricator.wikimedia.org/T225144) (owner: 10Reedy) [13:46:39] (03CR) 10jenkins-bot: Move all FR config to an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518396 (https://phabricator.wikimedia.org/T225144) (owner: 10Reedy) [13:47:51] (03PS2) 10Giuseppe Lavagetto: safe-service-restart: logging improvements [puppet] - 10https://gerrit.wikimedia.org/r/518664 (https://phabricator.wikimedia.org/T224857) [13:48:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] safe-service-restart: logging improvements [puppet] - 10https://gerrit.wikimedia.org/r/518664 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [13:49:17] NOW I got the wdqs page. great [13:50:25] 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10hashar) Have you had a chance to speak about this //"rebase hell"// during the SRE offsite? I think I answered to all the concerns that have been raised so far. Should I then j... [13:50:54] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: T225144 T225276 T225414 T225776 T225797 T226054 (duration: 00m 56s) [13:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:05] T225144: Flagged Revs configuration may be broken - https://phabricator.wikimedia.org/T225144 [13:51:05] T225414: File namespace added to FlaggedRevs common configuration - https://phabricator.wikimedia.org/T225414 [13:51:05] T226054: FlaggedRevs disabled for NS "Wikisource" and "File" but it still requiring reviews for transcludes - https://phabricator.wikimedia.org/T226054 [13:51:06] T225776: FlaggedRevs: Disable sysop protection level on the English Wikipedia - https://phabricator.wikimedia.org/T225776 [13:51:06] T225276: FlaggedRevs (statistics) first three namespaces are listed twice - https://phabricator.wikimedia.org/T225276 [13:51:06] T225797: Remove FlaggedRevs-related user groups from all FlaggedRevs wikis that shouldn't have them - https://phabricator.wikimedia.org/T225797 [13:51:39] !log rolling restart of the conf servers starting in 10 minutes please let me know if you forsee any issue [13:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:58] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10jijiki) [13:53:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool: add safe_service_restart define [puppet] - 10https://gerrit.wikimedia.org/r/518665 (owner: 10Giuseppe Lavagetto) [14:01:29] 10Operations, 10Traffic, 10Patch-For-Review: Investigate esams text varnish backend fetch failures - https://phabricator.wikimedia.org/T226375 (10ema) [14:01:57] !log cp3032: upgrade varnish to 5.1.3-1wm11 T226375 [14:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:02] T226375: Investigate esams text varnish backend fetch failures - https://phabricator.wikimedia.org/T226375 [14:06:26] 10Operations, 10Cognate, 10DBA, 10Growth-Team, and 2 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) [14:07:58] Reedy: Items in Special:Contributions on en.wiki are getting highlighted as flaggedrevs-unreviewed despite not having flagged revisions enabled. I assume it's related to the above deploy [14:08:00] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Weird_highlighting_of_some_of_my_contributions [14:08:24] Probably [14:08:29] Lets have a look [14:10:17] (03PS1) 10MarcoAurelio: Close wikimania2018.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) [14:11:23] pcoombe: Looking at the config though... [14:11:33] NS_MAIN has always potentially been subject to FR on enwiki [14:14:34] pcoombe: But FR is enabled [14:15:13] But it is only enabled for certain articles (https://en.wikipedia.org/wiki/Wikipedia:Pending_changes) This is showing up for changes on all articles [14:15:19] (03CR) 10MarcoAurelio: "Scheduled for July 1st as indicated in the task and notified to the people. I've found no instances on IS.php/CS.php that needs to be remo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) (owner: 10MarcoAurelio) [14:15:36] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:15:36] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:24] (03PS2) 10Giuseppe Lavagetto: profile::lvs::realserver: introduce ability to use safe-service-restart [puppet] - 10https://gerrit.wikimedia.org/r/518666 [14:17:24] } elseif ( !isset( $row->fp_stable ) ) { [14:17:24] $classes[] = 'flaggedrevs-unreviewed'; [14:17:24] } [14:19:21] (03PS1) 10Muehlenhoff: Remove access for cwdent [puppet] - 10https://gerrit.wikimedia.org/r/518722 [14:19:22] PROBLEM - PyBal connections to etcd on lvs1014 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:20:16] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Bstorm) >>! In T224188#5277181, @ayounsi wrote: > Note that LibreNMS have a 5min granularity. That mean if a sudden spike of traffic appear,... [14:22:32] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 66 connections established with conf1004.eqiad.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [14:23:36] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for cwdent [puppet] - 10https://gerrit.wikimedia.org/r/518722 (owner: 10Muehlenhoff) [14:26:19] pcoombe: lol, well, https://en.wikipedia.org/w/index.php?diff=903244977 fixes it [14:26:45] 10Operations, 10netops: Remove access to network gear for Casey Dentinger - https://phabricator.wikimedia.org/T226405 (10MoritzMuehlenhoff) [14:26:56] PROBLEM - Check systemd state on conf1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:27:41] mutante: can your wikitech block (done by Ariel) be lifted or do you still need it for testing? [14:27:45] hi btw [14:28:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM but fix the whitespace. We should have a followup to remove annualreport completely as a separate host I guess, given we're not hosti" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518690 (https://phabricator.wikimedia.org/T226066) (owner: 10Effie Mouzeli) [14:29:47] (03PS3) 10Giuseppe Lavagetto: profile::lvs::realserver: introduce ability to use safe-service-restart [puppet] - 10https://gerrit.wikimedia.org/r/518666 [14:31:13] hauskatze: I think he's still off sick atm [14:31:26] oh, sad to learn that, okay [14:33:00] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2043 - https://phabricator.wikimedia.org/T225889 (10Papaul) a:05Papaul→03Marostegui Disk replaced [14:33:08] (03PS2) 10Effie Mouzeli: annual.wikimedia.org: redirect to AnnualReport2018 [puppet] - 10https://gerrit.wikimedia.org/r/518690 (https://phabricator.wikimedia.org/T226066) [14:33:21] 10Operations, 10netops: Telia IC-307235 reported down from the eqiad side - https://phabricator.wikimedia.org/T226394 (10jijiki) p:05Triage→03Unbreak! [14:33:51] 10Operations, 10netops: Telia IC-307235 reported down from the eqiad side - https://phabricator.wikimedia.org/T226394 (10jijiki) Triaged as UBN! even thought it is not something we can control [14:34:12] 10Operations, 10media-storage, 10Patch-For-Review: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10jijiki) p:05Triage→03High [14:34:34] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Image thumbnail (cache?) broken on English Wikipedia, e.g. Information.svg, when viewing non-default resolution (e.g. 241px) - https://phabricator.wikimedia.org/T226271 (10jijiki) p:05Triage→03Normal [14:35:07] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226213 (10jijiki) [14:35:44] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226213 (10jijiki) p:05Triage→03High [14:35:47] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2043 - https://phabricator.wikimedia.org/T225889 (10Marostegui) a:05Marostegui→03Papaul It failed already :( ` physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, Failed) ` [14:37:01] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:37:01] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:18] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226213 (10elukey) Same thing as the other time, this is the testing cluster.. I did a round of reboots and disks started to fail, this one can be safely ignored since we don't use those disks on that h... [14:39:18] (03PS2) 10Kosta Harlan: Betalabs: Enable GrowthExperiments features for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518109 (https://phabricator.wikimedia.org/T226205) [14:39:44] (03PS1) 10Reedy: Further partial revert of copy paste of config into code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518724 [14:42:10] 10Operations, 10ops-codfw, 10DC-Ops, 10netops: Replace cr[1-2].codfw fan filters - https://phabricator.wikimedia.org/T226407 (10Papaul) [14:42:21] 10Operations, 10ops-codfw, 10DC-Ops, 10netops: Replace cr[1-2].codfw fan filters - https://phabricator.wikimedia.org/T226407 (10Papaul) p:05Triage→03Normal [14:43:16] 10Operations, 10Wikimedia-Mailing-lists: Request mailing list Chad - https://phabricator.wikimedia.org/T225240 (10jijiki) @Abdallahbigboy it would be great if you could provide some more info about the group before we move to creating a mailing list [14:43:19] pcoombe: I think I know why [14:43:22] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:43:22] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:10] PROBLEM - Check systemd state on conf1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:48:41] ^^ checking [14:49:02] (03PS1) 10Volans: Use Python3 syntax for super() calls [software/conftool] - 10https://gerrit.wikimedia.org/r/518726 [14:49:04] (03PS1) 10Volans: Use Python3 syntax for class definition [software/conftool] - 10https://gerrit.wikimedia.org/r/518727 [14:51:25] (03CR) 10Effie Mouzeli: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518690 (https://phabricator.wikimedia.org/T226066) (owner: 10Effie Mouzeli) [14:51:43] (03CR) 10Effie Mouzeli: [C: 03+2] annual.wikimedia.org: redirect to AnnualReport2018 [puppet] - 10https://gerrit.wikimedia.org/r/518690 (https://phabricator.wikimedia.org/T226066) (owner: 10Effie Mouzeli) [14:51:57] (03PS3) 10Effie Mouzeli: annual.wikimedia.org: redirect to AnnualReport2018 [puppet] - 10https://gerrit.wikimedia.org/r/518690 (https://phabricator.wikimedia.org/T226066) [14:54:24] RECOVERY - Check systemd state on conf1004 is OK: OK - running: The system is fully operational [14:55:16] Reedy: thanks for responding there! I'm pretty clueless about the flaggedrevs configuration myself, just noticed the added classes [14:57:18] RECOVERY - Check systemd state on conf1006 is OK: OK - running: The system is fully operational [14:59:26] 10Operations, 10Annual-Report, 10serviceops, 10Patch-For-Review: Redirects for 2018 Annual Report - https://phabricator.wikimedia.org/T226066 (10jijiki) 05Open→03Resolved a:03jijiki @LTraer please let me know if there are any issues [15:01:00] PROBLEM - PyBal connections to etcd on lvs3004 is CRITICAL: CRITICAL: 2 connections established with conf1006.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [15:01:30] PROBLEM - PyBal connections to etcd on lvs3001 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [15:04:31] (03PS3) 10Mforns: analytics::refinery::job::data_purge add deletion for data_quality_hourly [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) [15:05:54] !log re-enabling wdqs updater on wdqs-public / eqiad [15:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:56] PROBLEM - Check systemd state on conf1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:12:21] 10Operations, 10observability: Upgrade grafana to 6.x - https://phabricator.wikimedia.org/T220838 (10CDanis) [15:14:41] 10Operations, 10Wikimedia-Mailing-lists: Request mailing list Chad - https://phabricator.wikimedia.org/T225240 (10Abdallahbigboy) Hello @jijiki Our user group was recognized on May 17, 2019. We wish to have a mailing list to better organize our communication. If you need more information, let me know. https://... [15:15:35] (03CR) 10Reedy: [C: 03+2] Further partial revert of copy paste of config into code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518724 (owner: 10Reedy) [15:16:12] RECOVERY - Check systemd state on conf1004 is OK: OK - running: The system is fully operational [15:16:28] (03Merged) 10jenkins-bot: Further partial revert of copy paste of config into code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518724 (owner: 10Reedy) [15:16:43] (03CR) 10jenkins-bot: Further partial revert of copy paste of config into code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518724 (owner: 10Reedy) [15:17:44] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: Remove some unnecessary copy pasted code (duration: 00m 55s) [15:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:24] 10Operations, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Halfak) This is related to {T225956}. They should have been merged into one ticket. [15:19:03] (03PS2) 10Alexandros Kosiaris: DHCP: Add MAC address entries for ganeti2009 and ganeti201[0-8] [puppet] - 10https://gerrit.wikimedia.org/r/518303 (https://phabricator.wikimedia.org/T224603) (owner: 10Papaul) [15:19:08] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] DHCP: Add MAC address entries for ganeti2009 and ganeti201[0-8] [puppet] - 10https://gerrit.wikimedia.org/r/518303 (https://phabricator.wikimedia.org/T224603) (owner: 10Papaul) [15:19:30] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10jijiki) @Arrbee I am a little confused. If @Jpita is a wmf employee, they can be added to the `wmf` group which provides access to logstash, this also implies that the user already exists... [15:20:08] (03PS2) 10Alexandros Kosiaris: Partman: Add ganeti201[0-8] [puppet] - 10https://gerrit.wikimedia.org/r/518305 (https://phabricator.wikimedia.org/T224603) (owner: 10Papaul) [15:21:03] 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Pchelolo) I do not have a solid theory of what's happening yet, but just throwing in some bits and pieces. Regarding the `LocalRenameUserJob` for `TRX... [15:21:58] (03PS3) 10Alexandros Kosiaris: Partman: Add ganeti201[0-8] [puppet] - 10https://gerrit.wikimedia.org/r/518305 (https://phabricator.wikimedia.org/T224603) (owner: 10Papaul) [15:22:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] Partman: Add ganeti201[0-8] [puppet] - 10https://gerrit.wikimedia.org/r/518305 (https://phabricator.wikimedia.org/T224603) (owner: 10Papaul) [15:22:53] 10Operations, 10SRE-Access-Requests: Access Q re maint1002 - https://phabricator.wikimedia.org/T225253 (10jijiki) 05Open→03Resolved a:03jijiki @Iflorez I am marking this as resolved, follow up on -sre of -operations and we will reopen if more actions are needed from our end [15:25:47] (03PS1) 10Reedy: Drag non array values out of callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518745 [15:26:06] (03PS2) 10Reedy: Drag non array values out of callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518745 [15:26:10] (03CR) 10Reedy: [C: 03+2] Drag non array values out of callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518745 (owner: 10Reedy) [15:26:24] PROBLEM - Check systemd state on conf1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:26:26] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:27:00] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:27:09] (03Merged) 10jenkins-bot: Drag non array values out of callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518745 (owner: 10Reedy) [15:27:41] (03CR) 10CDanis: [C: 03+2] Use Python3 syntax for class definition [software/conftool] - 10https://gerrit.wikimedia.org/r/518727 (owner: 10Volans) [15:28:14] (03CR) 10jenkins-bot: Drag non array values out of callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518745 (owner: 10Reedy) [15:28:24] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: Simple config outside callback (duration: 00m 56s) [15:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:46] RECOVERY - PyBal connections to etcd on lvs3001 is OK: OK: 4 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [15:29:08] 10Operations, 10SRE-Access-Requests, 10observability: Requesting access to icinga for tonycepo - https://phabricator.wikimedia.org/T224313 (10jijiki) 05Stalled→03Resolved a:03jijiki @Tonycepo I am marking as this as resolved, please reopen if there are reasons you would be needing this kind of access [15:29:34] 10Operations, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Halfak) [15:29:51] (03PS1) 10Andrew Bogott: wmcs-cold-migrate: use 'virsh undefine' to cleanup old VMs [puppet] - 10https://gerrit.wikimedia.org/r/518748 (https://phabricator.wikimedia.org/T226415) [15:30:13] * gehel is looking at cirrus update rate [15:30:16] (03CR) 10CDanis: [C: 03+2] Use Python3 syntax for super() calls [software/conftool] - 10https://gerrit.wikimedia.org/r/518726 (owner: 10Volans) [15:30:20] RECOVERY - PyBal connections to etcd on lvs1014 is OK: OK: 12 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [15:30:44] RECOVERY - Check systemd state on conf1004 is OK: OK - running: The system is fully operational [15:30:57] (03CR) 10Andrew Bogott: [C: 04-1] "This doesn't work -- running 'virsh undefine' as user 'nova' just errors out." [puppet] - 10https://gerrit.wikimedia.org/r/518748 (https://phabricator.wikimedia.org/T226415) (owner: 10Andrew Bogott) [15:31:24] 10Operations, 10ops-codfw, 10DC-Ops, 10netops: Replace cr[1-2].codfw fan filters - https://phabricator.wikimedia.org/T226407 (10ayounsi) Feel free to do it anytime. Doc is on https://www.juniper.net/documentation/en_US/release-independent/junos/topics/topic-map/mx480-maintain-cooling-system.html [15:31:33] 10Operations, 10netops: Telia IC-307235 reported down from the eqiad side - https://phabricator.wikimedia.org/T226394 (10CDanis) p:05Unbreak!→03High it's just one (not-often-used) link down, not a site down; UBN is unnecessary IMO [15:32:27] 10Operations, 10serviceops: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10jijiki) p:05High→03Normal [15:32:35] 10Operations, 10netops: Telia IC-307235 reported down from the eqiad side - https://phabricator.wikimedia.org/T226394 (10ayounsi) a:03ayounsi [15:32:44] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 67 connections established with conf1004.eqiad.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [15:32:46] RECOVERY - PyBal connections to etcd on lvs3004 is OK: OK: 8 connections established with conf1006.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [15:32:54] (03Merged) 10jenkins-bot: Use Python3 syntax for super() calls [software/conftool] - 10https://gerrit.wikimedia.org/r/518726 (owner: 10Volans) [15:32:56] (03Merged) 10jenkins-bot: Use Python3 syntax for class definition [software/conftool] - 10https://gerrit.wikimedia.org/r/518727 (owner: 10Volans) [15:34:56] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:34:56] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:05] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:35:06] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:47] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:39:48] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:56] pcoombe: And there's always one... :) https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&diff=next&oldid=903254337 [15:40:39] hehe [15:41:36] Reedy: ha! Thanks for the quick fix! [15:42:20] PROBLEM - PyBal connections to etcd on lvs2001 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [15:42:37] ^^ on it [15:43:34] PROBLEM - PyBal connections to etcd on lvs2002 is CRITICAL: CRITICAL: 2 connections established with conf2001.codfw.wmnet:2379 (min=10) https://wikitech.wikimedia.org/wiki/PyBal [15:44:32] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:44:50] ACKNOWLEDGEMENT - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] Gehel transient failure, recovering already https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:44:58] RECOVERY - PyBal connections to etcd on lvs2001 is OK: OK: 4 connections established with conf2001.codfw.wmnet:2379 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [15:45:24] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:45:30] PROBLEM - Check systemd state on conf2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:47:36] (03PS5) 10CRusnov: Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) [15:48:06] !log remove cwdent from all network devices - T226405 [15:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:11] T226405: Remove access to network gear for Casey Dentinger - https://phabricator.wikimedia.org/T226405 [15:48:22] 10Operations, 10Operations-Software-Development, 10netbox, 10Patch-For-Review: Cumin: add backend for Netbox - https://phabricator.wikimedia.org/T205900 (10crusnov) Just closing the loop here, the backend is up for review, but there is apparently a pylint bug preventing CI from passing (or there was last w... [15:48:26] RECOVERY - Check systemd state on conf2001 is OK: OK - running: The system is fully operational [15:48:56] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:48:57] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:49:00] RECOVERY - PyBal connections to etcd on lvs2002 is OK: OK: 10 connections established with conf2001.codfw.wmnet:2379 (min=10) https://wikitech.wikimedia.org/wiki/PyBal [15:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:52] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:49:52] (03CR) 10CRusnov: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/515058 (owner: 10Filippo Giunchedi) [15:50:35] volans: could the `START - Cookbook sre.hosts.downtime` message precise which hosts are being downtimed? [15:51:05] XioNoX: T221212 [15:51:08] T221212: spicerack/cookbook: add additional arguments IRC/SAL logging - https://phabricator.wikimedia.org/T221212 [15:51:30] there is a discussion in the related CR, I need to resume it and make some changes and see if I can get an agreement ;) [15:51:31] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Bstorm) Had a huddle with @jhedden, actually. He'll add his thoughts soon (with a some info from our existing monitoring). [15:51:56] yay [15:51:58] thx! [15:52:46] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [15:52:59] (03CR) 10jerkins-bot: [V: 04-1] Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [15:53:14] 10Operations, 10ops-codfw, 10DC-Ops, 10netops: Replace cr[1-2].codfw fan filters - https://phabricator.wikimedia.org/T226407 (10Papaul) 05Open→03Resolved filter replaced on both routers [15:54:08] PROBLEM - PyBal connections to etcd on lvs5003 is CRITICAL: CRITICAL: 6 connections established with conf2003.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [15:55:16] (03PS1) 10Reedy: Remove some pointlessly copied comments from flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518749 [15:56:40] 10Operations, 10cloud-services-team (Kanban): Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Bstorm) @Andrew this seems...fixed does it not? [15:56:42] (03PS1) 10Ppchelko: RunSingleJob: Don't silently fail if `page_title` is not provided in job. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518751 (https://phabricator.wikimedia.org/T226109) [15:57:38] (03CR) 10jerkins-bot: [V: 04-1] RunSingleJob: Don't silently fail if `page_title` is not provided in job. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518751 (https://phabricator.wikimedia.org/T226109) (owner: 10Ppchelko) [15:57:57] (03PS2) 10Ppchelko: RunSingleJob: Don't silently fail if `page_title` is not provided in job. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518751 (https://phabricator.wikimedia.org/T226109) [15:59:05] (03CR) 10Krinkle: RunSingleJob: Don't silently fail if `page_title` is not provided in job. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518751 (https://phabricator.wikimedia.org/T226109) (owner: 10Ppchelko) [15:59:14] 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue, 10Patch-For-Review: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Pchelolo) The above patch to mediawiki-config will fix it. This is another instance of having code in multiple places biting us ha... [15:59:24] 10Operations, 10ops-codfw, 10netops: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 (10RobH) p:05Triage→03Normal [15:59:38] 10Operations, 10ops-codfw, 10netops: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 (10RobH) [16:00:14] PROBLEM - Check systemd state on conf2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:00:29] 10Operations, 10ops-eqiad, 10netops: update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10RobH) p:05Triage→03Normal [16:00:42] 10Operations, 10ops-eqiad, 10netops: update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10RobH) [16:02:53] (03PS3) 10Ppchelko: RunSingleJob: Don't silently fail if `page_title` is not provided in job. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518751 (https://phabricator.wikimedia.org/T226109) [16:02:58] 10Operations, 10SRE-Access-Requests, 10observability: Requesting access to icinga for tonycepo - https://phabricator.wikimedia.org/T224313 (10Aklapper) 05Resolved→03Declined a:05jijiki→03None Setting status to declined as no access was given. [16:03:08] RECOVERY - Check systemd state on conf2003 is OK: OK - running: The system is fully operational [16:03:33] (03CR) 10Ppchelko: "> Patch Set 1:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518751 (https://phabricator.wikimedia.org/T226109) (owner: 10Ppchelko) [16:03:40] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226213 (10mforns) 05Open→03Invalid So, closing then as invalid. [16:04:02] (03CR) 10Reedy: [C: 03+2] Remove some pointlessly copied comments from flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518749 (owner: 10Reedy) [16:04:54] (03Merged) 10jenkins-bot: Remove some pointlessly copied comments from flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518749 (owner: 10Reedy) [16:05:02] RECOVERY - PyBal connections to etcd on lvs5003 is OK: OK: 12 connections established with conf2003.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [16:06:31] (03CR) 10jenkins-bot: Remove some pointlessly copied comments from flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518749 (owner: 10Reedy) [16:06:48] 10Operations, 10cloud-services-team (Kanban): Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Andrew) 05Open→03Resolved a:03Andrew Haven't seen this in ages. [16:08:01] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: comments (duration: 00m 56s) [16:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:46] Reedy: ok to take over deploy1001? [16:08:52] yup :) [16:08:56] kk going [16:09:03] Pchelolo: ^ [16:09:07] (03CR) 10Mobrovac: [C: 03+2] RunSingleJob: Don't silently fail if `page_title` is not provided in job. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518751 (https://phabricator.wikimedia.org/T226109) (owner: 10Ppchelko) [16:10:01] (03Merged) 10jenkins-bot: RunSingleJob: Don't silently fail if `page_title` is not provided in job. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518751 (https://phabricator.wikimedia.org/T226109) (owner: 10Ppchelko) [16:10:17] (03CR) 10jenkins-bot: RunSingleJob: Don't silently fail if `page_title` is not provided in job. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518751 (https://phabricator.wikimedia.org/T226109) (owner: 10Ppchelko) [16:10:27] going for the deploy [16:13:03] !log mobrovac@deploy1001 Synchronized rpc/RunSingleJob.php: RunSingleJob: check that only the database param is set and leave the rest to JobExecutor - T226109 (duration: 00m 55s) [16:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:08] T226109: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 [16:13:26] ok Reedy Pchelolo deploy done, we should be good now [16:14:27] I'll close the ticket [16:14:42] thnx Pchelolo [16:16:12] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Core Platform Team Kanban (Done with CPT): Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Pchelolo) 05Open→03Resolved a:03Pchelolo The fix to mw-config has been deployed, we should be good now. [16:21:00] (03PS1) 10Reedy: Remove simple config from callback function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518754 [16:22:01] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2043 - https://phabricator.wikimedia.org/T225889 (10RobH) Are the disks being installed (and then failing) new disks or old decom disks? If new spare disks are failing, we need to return them for replacement. Please let me know! [16:22:21] (03CR) 10Reedy: [C: 03+2] Remove simple config from callback function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518754 (owner: 10Reedy) [16:23:54] (03PS10) 10Arturo Borrero Gonzalez: toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) [16:24:22] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2043 - https://phabricator.wikimedia.org/T225889 (10Papaul) Those are decom disks and not new disks. We have no more new disks on site. [16:24:26] (03Merged) 10jenkins-bot: Remove simple config from callback function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518754 (owner: 10Reedy) [16:25:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [16:25:42] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: Remove some duplicated config (duration: 00m 55s) [16:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:20] (03CR) 10jenkins-bot: Remove simple config from callback function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518754 (owner: 10Reedy) [16:32:05] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10ema) We believe that Varnish fetch failures might be related to this issue, investigation is ongoing T2... [16:33:05] 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 3 others: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10mobrovac) [16:33:21] (03CR) 10Ottomata: "+1 from me, but one question I don't know the answer too that might need to be considered in your migration plan." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [16:33:36] (03CR) 10Ottomata: "to*" [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [16:38:02] akosiaris _joe_ : ottomata raises a good question question on https://gerrit.wikimedia.org/r/514361 above. do you know? [16:38:28] (03CR) 10Ottomata: [C: 03+1] "> We can also simply cease the whole cdh submodule approach and fold it into the main puppet." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518097 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:39:44] (03CR) 10Urbanecm: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) (owner: 10MarcoAurelio) [16:42:01] (03CR) 10Nuria: [C: 03+1] "Looks good, keeping quality data for now for 90 days, we can extend intervals as needed be." [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [16:45:01] (03CR) 10Elukey: "> > We can also simply cease the whole cdh submodule approach and" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518097 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:47:14] (03CR) 10Elukey: "What I'd also like to create is a kerberos::exec define, and deprecate the usage of cdh::exec :)" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518097 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:51:54] (03CR) 10Ottomata: [C: 03+1] "I think moving cdh into puppet will make more people happy than unhappy, so I am turning on the green light. :o :D" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518097 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:52:06] (03CR) 10Ottomata: [C: 03+1] "#utilitarianism" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518097 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:52:10] (03CR) 10Elukey: analytics::refinery::job::data_purge add deletion for data_quality_hourly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [16:52:47] herron: yeah, for now it has to be applied manually for now. It's informational and about to be fully deprecated as it will be moved into helm charts (pending https://phabricator.wikimedia.org/T207804). I 've already applied them, so it should be ok [16:53:30] (03CR) 10Alexandros Kosiaris: kafka-main: replace kafka2003 hardware with kafka-main2003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [16:54:45] akosiaris: awesome thank you! [16:55:26] (03CR) 10Muehlenhoff: "\o/ for merging into puppet.git!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518097 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:56:52] (03CR) 10Ottomata: [C: 03+1] "Great :)" [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [16:56:55] (03CR) 10Ema: "YEEEEEHHHH!!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518097 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:58:32] (03PS1) 10Jgreen: remove cwdentinger from nagios contactgroups.cfg [puppet] - 10https://gerrit.wikimedia.org/r/518758 (https://phabricator.wikimedia.org/T226396) [16:58:43] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10jijiki) [16:58:47] (03CR) 10Ottomata: [C: 03+1] "Before we do this." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518097 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:59:15] (03CR) 10Ottomata: [C: 03+1] "Most of those are forks of the old puppet-cdh4 version tho." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518097 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:59:25] (03CR) 10Jgreen: [V: 03+2 C: 03+2] remove cwdentinger from nagios contactgroups.cfg [puppet] - 10https://gerrit.wikimedia.org/r/518758 (https://phabricator.wikimedia.org/T226396) (owner: 10Jgreen) [17:00:04] gehel and onimisionipe: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T1700). [17:00:42] jouncebot: WDQS deployment will be delayed today [17:01:53] 10Operations, 10netops: Remove access to network gear for Casey Dentinger - https://phabricator.wikimedia.org/T226405 (10ayounsi) 05Open→03Resolved User (set as read-only user) removed from all network devices. [17:02:05] (03PS1) 10Reedy: Fix AddGroups/Remove groups for editor/autoreview [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518759 (https://phabricator.wikimedia.org/T226410) [17:02:28] (03PS2) 10Reedy: Fix AddGroups/RemoveGroups for editor/autoreview [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518759 (https://phabricator.wikimedia.org/T226410) [17:04:23] (03CR) 10Reedy: [C: 03+2] Fix AddGroups/RemoveGroups for editor/autoreview [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518759 (https://phabricator.wikimedia.org/T226410) (owner: 10Reedy) [17:05:35] (03CR) 10Herron: "Ok! I'll plan on proceeding with this tomorrow in the AM (Eastern)" [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [17:06:16] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: specify the port of etcd servers [puppet] - 10https://gerrit.wikimedia.org/r/518761 (https://phabricator.wikimedia.org/T215531) [17:08:42] (03Merged) 10jenkins-bot: Fix AddGroups/RemoveGroups for editor/autoreview [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518759 (https://phabricator.wikimedia.org/T226410) (owner: 10Reedy) [17:08:56] (03CR) 10jenkins-bot: Fix AddGroups/RemoveGroups for editor/autoreview [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518759 (https://phabricator.wikimedia.org/T226410) (owner: 10Reedy) [17:09:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: specify the port of etcd servers [puppet] - 10https://gerrit.wikimedia.org/r/518761 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [17:09:51] jouncebot: now [17:09:51] For the next 0 hour(s) and 20 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T1700) [17:09:59] jouncebot: next [17:10:00] In 0 hour(s) and 50 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T1800) [17:10:52] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: T226410 (duration: 00m 54s) [17:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:58] T226410: In de.wiktionary admins (again!!) no longer can grant the right for "autoreview" (passiver Sichter) or "editor" (Sichter). - https://phabricator.wikimedia.org/T226410 [17:16:06] (03PS1) 10Reedy: Remove three "is this needed" comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518763 [17:20:01] (03PS1) 10Elukey: Move the cdh submodule into environments/production [puppet] - 10https://gerrit.wikimedia.org/r/518764 [17:20:51] (03PS1) 10Reedy: Remove a few more boolean values from the callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518765 [17:20:52] PROBLEM - puppet last run on analytics1072 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/hadoop/data/b/yarn/local] [17:21:10] (03CR) 10Reedy: [C: 03+2] Remove three "is this needed" comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518763 (owner: 10Reedy) [17:22:04] (03Merged) 10jenkins-bot: Remove three "is this needed" comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518763 (owner: 10Reedy) [17:22:14] (03PS2) 10Reedy: Remove a few more boolean values from the callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518765 [17:22:17] (03CR) 10Reedy: [C: 03+2] Remove a few more boolean values from the callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518765 (owner: 10Reedy) [17:22:19] (03CR) 10jenkins-bot: Remove three "is this needed" comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518763 (owner: 10Reedy) [17:23:13] (03Merged) 10jenkins-bot: Remove a few more boolean values from the callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518765 (owner: 10Reedy) [17:23:40] (03PS1) 10Reedy: Add moar /////////////////////////////////////// [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518766 [17:23:50] (03CR) 10Reedy: [C: 03+2] Add moar /////////////////////////////////////// [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518766 (owner: 10Reedy) [17:24:07] (03CR) 10Ottomata: [C: 03+1] "> Patch Set 1:" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518097 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [17:24:13] (03CR) 10jenkins-bot: Remove a few more boolean values from the callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518765 (owner: 10Reedy) [17:24:22] Reedy: You having fun? twentyafterfour is wanting to deploy the train. :-) [17:24:31] The last one is fun, yes :P [17:24:34] PROBLEM - MegaRAID on analytics1072 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:24:43] ah snap [17:24:56] (03Merged) 10jenkins-bot: Add moar /////////////////////////////////////// [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518766 (owner: 10Reedy) [17:25:02] * twentyafterfour can wait [17:25:14] (03CR) 10jenkins-bot: Add moar /////////////////////////////////////// [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518766 (owner: 10Reedy) [17:25:20] twentyafterfour: Just deploying those three [17:25:25] Feel free to go [17:25:57] Next change is gonna take a bit longer to write up [17:26:05] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: Cleanup (duration: 00m 55s) [17:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:34] (03PS1) 10Elukey: Add hiera overrides for analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/518767 [17:27:51] OK so the plan is to push 1.36.0-wmf.1 to group 1 and then let it bake for a while. We may still deploy the branch to group 2 later today or perhaps wait until tomorrow depending on how things are looking. [17:28:11] I guess I should write that on the task.... [17:29:21] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: specify scheme:// for etcd servers [puppet] - 10https://gerrit.wikimedia.org/r/518768 (https://phabricator.wikimedia.org/T215531) [17:29:29] wmf.10 not .1 [17:31:23] (03CR) 10Volans: "Did a first pass, some are questions/thinks to discuss, see inline." (039 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) (owner: 10CRusnov) [17:31:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: specify scheme:// for etcd servers [puppet] - 10https://gerrit.wikimedia.org/r/518768 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [17:32:20] twentyafterfour, will your train plan affect upcoming SWAT window somehow? :) [17:33:08] (03PS2) 10Elukey: Add hiera overrides for analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/518767 [17:33:34] (03PS3) 10Elukey: Add hiera overrides for analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/518767 [17:34:18] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17068/analytics1072.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/518767 (owner: 10Elukey) [17:37:58] (03PS1) 10Nuria: Keeping webrequest_sampled data for 30 days [puppet] - 10https://gerrit.wikimedia.org/r/518769 (https://phabricator.wikimedia.org/T226227) [17:38:49] Urbanecm: it should not, I hope [17:38:55] cool [17:38:56] jouncebot: next [17:38:56] In 0 hour(s) and 21 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T1800) [17:38:58] (03PS1) 10Ottomata: Refine page_links_change data using old Refine job [puppet] - 10https://gerrit.wikimedia.org/r/518770 (https://phabricator.wikimedia.org/T226268) [17:39:26] (03PS2) 10Nuria: Keeping webrequest_sampled data for 30 days [puppet] - 10https://gerrit.wikimedia.org/r/518769 (https://phabricator.wikimedia.org/T226227) [17:40:04] 10Operations, 10Wikimedia-Mailing-lists: Request mailing list Chad - https://phabricator.wikimedia.org/T225240 (10jijiki) p:05Triage→03Normal [17:40:16] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10jijiki) p:05Triage→03Normal [17:40:18] (03CR) 10Ottomata: [C: 03+1] "Shall I merge?" [puppet] - 10https://gerrit.wikimedia.org/r/518769 (https://phabricator.wikimedia.org/T226227) (owner: 10Nuria) [17:40:37] (03CR) 10Ottomata: [C: 03+2] Refine page_links_change data using old Refine job [puppet] - 10https://gerrit.wikimedia.org/r/518770 (https://phabricator.wikimedia.org/T226268) (owner: 10Ottomata) [17:40:40] (03PS1) 1020after4: group1 wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518771 [17:40:42] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518771 (owner: 1020after4) [17:41:36] twentyafterfour: MW.org isn't on wmf.10 yet, BTW. :-) [17:41:40] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518771 (owner: 1020after4) [17:42:17] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518771 (owner: 1020after4) [17:42:18] James_F: hah good catch [17:42:57] (03CR) 10CRusnov: "Thank you for the review. Should be relatively straight forward to address these changes." (038 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) (owner: 10CRusnov) [17:42:58] Details, details. [17:42:59] (03PS1) 1020after4: group0 wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518772 [17:43:01] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518772 (owner: 1020after4) [17:43:09] (03CR) 10jerkins-bot: [V: 04-1] group0 wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518772 (owner: 1020after4) [17:43:15] (03CR) 10jerkins-bot: [V: 04-1] group0 wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518772 (owner: 1020after4) [17:43:51] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Nuria) Approved. [17:44:20] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Ottomata) @Halfak I'm assuming you want `statistics-privatedata-users` for stat1006 and `analytics-privatedata-users` group in order to access Hadoop etc, yes? [17:44:42] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@157f40c]: weekly WDQS deploy [17:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:08] 10Operations: maps hosts have bad permissions under /srv/deployment - https://phabricator.wikimedia.org/T220982 (10jijiki) @Gehel @cdanis should we mark this as resolved? [17:45:10] (03CR) 10CRusnov: Add new dumpbackup.py script (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) (owner: 10CRusnov) [17:45:43] 10Operations: maps hosts have bad permissions under /srv/deployment - https://phabricator.wikimedia.org/T220982 (10Gehel) 05Open→03Resolved a:03Gehel no further issues seen, let's get this closed. [17:46:35] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Ottomata) Oh, there is no shell account for this user yet? I think that does need SRE approval. Please see: https://wikitech.wikimedia.org/wiki/Production_she... [17:47:30] (03PS1) 10Reedy: Move some FR config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518773 [17:47:32] (03PS1) 10Reedy: Remove config now in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518774 [17:48:21] (03CR) 10jerkins-bot: [V: 04-1] Move some FR config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518773 (owner: 10Reedy) [17:48:24] (03CR) 10jerkins-bot: [V: 04-1] Remove config now in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518774 (owner: 10Reedy) [17:48:53] (03Abandoned) 1020after4: group0 wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518772 (owner: 1020after4) [17:49:03] 18:48:17 1) InitialiseSettingsTest::testOnlyExistingWikis [17:49:03] 18:48:17 0 is referenced, but it isn't either a wiki or a dblist [17:49:03] 18:48:17 Failed asserting that false is true. [17:49:03] wat [17:49:26] huh? [17:49:42] (03PS2) 10Reedy: Move some FR config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518773 [17:49:47] I think it's my fault [17:50:06] (03PS1) 1020after4: group0 wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518775 [17:50:09] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518775 (owner: 1020after4) [17:50:39] (03CR) 10Nuria: [C: 03+1] Refine page_links_change data using old Refine job [puppet] - 10https://gerrit.wikimedia.org/r/518770 (https://phabricator.wikimedia.org/T226268) (owner: 10Ottomata) [17:51:09] (03Merged) 10jenkins-bot: group0 wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518775 (owner: 1020after4) [17:51:26] (03CR) 10jenkins-bot: group0 wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518775 (owner: 1020after4) [17:51:29] (03PS2) 10Reedy: Remove config now in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518774 [17:51:38] ok deploying [17:51:54] hopefully this will be done before SWAT [17:53:04] The full scap is already done last week, which is the slow bit. [17:53:05] (03PS3) 10Reedy: Remove config now in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518774 [17:53:12] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.34.0-wmf.10 refs T220735 [17:53:15] (He says, jinxing us all.) [17:55:21] twentyafterfour@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [17:55:22] T220735: 1.34.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T220735 [17:56:02] * Reedy raises an eyebrow [17:56:25] uh oh [17:56:55] bunch of errors on wmf.10 [17:57:31] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2043 - https://phabricator.wikimedia.org/T225889 (10Marostegui) 05Open→03Resolved The RAID finished correctly, although the disk came with predictive failure. I am going to close this task as resolved as the ops-monitoring will open a new once once it... [17:57:48] oh someone is running a maintenance script? [17:57:51] TypeError from line 27 of /srv/mediawiki/php-1.34.0-wmf.10/extensions/Wikibase/repo/includes/Store/Sql/SqlEntityIdPagerFactory.php: Argument 2 passed to Wikibase\Repo\Store\Sql\SqlEntityIdPagerFactory::__construct() must be an instance of Wikibase\Store\EntityIdLookup, instance of Wikibase\DataModel\Entity\DispatchingEntityIdParser given, called in [17:57:53] /srv/mediawiki/php-1.34.0-wmf.10/extensions/Wikibase/repo/maintenance/dumpJson.php on line 85 [17:58:07] I'm guessing that's just dumps? [17:58:08] apergos: ^ [17:58:09] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [17:58:48] ah yeah it's snapshot1008 [17:59:09] so wmf.10 breaks dumps? should I roll back or just file a task [18:00:02] hmm that spike of errors doesn't continue [18:00:04] MaxSem, RoanKattouw, and Niharika: (Dis)respected human, time to deploy Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T1800). Please do the needful. [18:00:04] Urbanecm: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:12] hmmm that's weird... I did some changes recently to SqlEntityIdPager, I wonder if something is related [18:00:15] let me check [18:00:25] let me know when i can start with above [18:01:11] Urbanecm: give me about 5 minutes [18:01:21] sure [18:01:47] (03PS1) 10Reedy: Remove comments about how to "install" FR, very out of date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518778 [18:01:48] I want to be sure that things are stable. It looks good so far other than that SqlEntityIdPagerFactory error [18:02:53] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@157f40c]: weekly WDQS deploy (duration: 18m 11s) [18:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:08] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable mobile homepage for cswiki and kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518779 (https://phabricator.wikimedia.org/T225676) [18:04:57] twentyafterfour: yeah looks like ctor for dumpJson is not updated [18:05:14] I wonder why phan does not catch this stuff... [18:05:20] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:05:27] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Ottomata) Also found https://office.wikimedia.org/wiki/Technology/Onboarding which may (or may not) be helpful. :) [18:05:34] uh oh [18:05:51] should be same patch as a52594b1000611e3616009a31ebb638a3d1a664f [18:05:56] I'll make a patch [18:06:14] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:06:38] (03PS6) 10CRusnov: Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) [18:07:21] SMalyshev: thanks! [18:08:33] twentyafterfour: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/518780 [18:09:32] (03PS7) 10CRusnov: backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) [18:09:37] Urbanecm: go ahead with SWAT [18:09:47] thanks twentyafterfour, going to do the needful :) [18:10:31] (03PS3) 10Urbanecm: Fix wgMetaNamespaceTalk for aswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518014 (https://phabricator.wikimedia.org/T226027) [18:10:36] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518014 (https://phabricator.wikimedia.org/T226027) (owner: 10Urbanecm) [18:10:39] [FYI note - Maps] At English Wikivoyage some users are reporting intermittent display problems with maps. https://en.wikivoyage.org/wiki/Wikivoyage:Travellers%27_pub#Who_can_fix_about_wikimedia_map_trouble (two sections in a row) [18:11:13] (03PS4) 10Reedy: Remove config now in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518774 [18:11:39] (03Merged) 10jenkins-bot: Fix wgMetaNamespaceTalk for aswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518014 (https://phabricator.wikimedia.org/T226027) (owner: 10Urbanecm) [18:12:10] (03CR) 10jenkins-bot: Fix wgMetaNamespaceTalk for aswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518014 (https://phabricator.wikimedia.org/T226027) (owner: 10Urbanecm) [18:12:16] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518171 (https://phabricator.wikimedia.org/T226217) (owner: 10DannyS712) [18:12:21] (03PS4) 10Urbanecm: Add "mass-upload" to autopatrollers and patrollers on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518171 (https://phabricator.wikimedia.org/T226217) (owner: 10DannyS712) [18:12:28] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518171 (https://phabricator.wikimedia.org/T226217) (owner: 10DannyS712) [18:13:46] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:518014|Fix wgMetaNamespaceTalk for aswikisource]] (T226027) (duration: 00m 55s) [18:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:56] T226027: Wrong wgMetaNamespaceTalk for aswikisource - https://phabricator.wikimedia.org/T226027 [18:14:52] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:15:00] there's the recovery I was waiting for [18:15:26] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:15:28] 10Operations, 10Community-Relations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Krinkle) [18:15:51] (03Merged) 10jenkins-bot: Add "mass-upload" to autopatrollers and patrollers on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518171 (https://phabricator.wikimedia.org/T226217) (owner: 10DannyS712) [18:17:35] (03PS2) 10Urbanecm: Add hualab.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518278 (https://phabricator.wikimedia.org/T225917) [18:17:43] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518278 (https://phabricator.wikimedia.org/T225917) (owner: 10Urbanecm) [18:18:05] 10Operations, 10Community-Relations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Krinkle) @Community-Relations Just a heads-up in case you've heard anything ar... [18:19:19] (03Merged) 10jenkins-bot: Add hualab.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518278 (https://phabricator.wikimedia.org/T225917) (owner: 10Urbanecm) [18:19:21] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:518171|Add "mass-upload" to autopatrollers and patrollers on commons]] (T226217) (duration: 00m 55s) [18:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:26] T226217: Assign mass-upload flag to autopatrolled users on Commons - https://phabricator.wikimedia.org/T226217 [18:20:07] (03PS1) 10Ottomata: Disable CirrusSearchRequestSet avro monolog channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518784 (https://phabricator.wikimedia.org/T222268) [18:20:29] (03CR) 10Ottomata: "Good point, doing in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/518784" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517871 (https://phabricator.wikimedia.org/T222268) (owner: 10Ottomata) [18:21:15] Urbanecm: when you are done, I would like to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/518784 [18:21:22] lemme know when :) [18:21:42] ottomata, sure, deploying the last one now [18:21:47] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:518278|Add hualab.nl to $wgCopyUploadsDomains]] (T225917) (duration: 00m 55s) [18:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:52] T225917: Please add hualab.nl to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T225917 [18:21:56] ottomata, SWAT is yours [18:23:04] thanks! [18:23:30] (03CR) 10Ottomata: [C: 03+2] Disable CirrusSearchRequestSet avro monolog channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518784 (https://phabricator.wikimedia.org/T222268) (owner: 10Ottomata) [18:23:32] yw [18:23:50] (03PS2) 10Ottomata: Disable CirrusSearchRequestSet avro monolog channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518784 (https://phabricator.wikimedia.org/T222268) [18:25:22] (03PS2) 10Jbond: dsa-check-hpssacli: import latest version from DSA [puppet] - 10https://gerrit.wikimedia.org/r/516724 (https://phabricator.wikimedia.org/T220787) (owner: 10Faidon Liambotis) [18:25:44] (03PS3) 10Jbond: dsa-check-hpssacli: refactor for speed/efficiency [puppet] - 10https://gerrit.wikimedia.org/r/516725 (https://phabricator.wikimedia.org/T210723) (owner: 10Faidon Liambotis) [18:26:05] (03PS3) 10Jbond: dsa-check-hpssacli: make compatible with ssacli [puppet] - 10https://gerrit.wikimedia.org/r/516726 (https://phabricator.wikimedia.org/T220787) (owner: 10Faidon Liambotis) [18:26:24] (03PS4) 10Jbond: dsa-check-hpssacli: refactor for speed/efficiency [puppet] - 10https://gerrit.wikimedia.org/r/516725 (https://phabricator.wikimedia.org/T210723) (owner: 10Faidon Liambotis) [18:26:37] (03PS4) 10Jbond: dsa-check-hpssacli: make compatible with ssacli [puppet] - 10https://gerrit.wikimedia.org/r/516726 (https://phabricator.wikimedia.org/T220787) (owner: 10Faidon Liambotis) [18:27:51] !log otto@deploy1001 Scap failed!: Call to mwscript eval.php stderr: not empty [18:27:54] (03PS1) 10CRusnov: Add a passthrough configuration system [software/netbox] - 10https://gerrit.wikimedia.org/r/518785 [18:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:15] (03CR) 10jenkins-bot: Add "mass-upload" to autopatrollers and patrollers on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518171 (https://phabricator.wikimedia.org/T226217) (owner: 10DannyS712) [18:28:50] ottomata: Think you just broke beta [18:28:51] 19:28:15 18:28:15 scap failed: RuntimeError Scap failed!: Call to mwscript eval.php stderr: Notice: Undefined variable: wmgMonologAvroSchemas in /srv/mediawiki-staging/wmf-config/logging.php on line 223 (duration: 00m 00s) [18:28:54] yup [18:28:54] fixing. [18:29:37] (03PS1) 10Ottomata: Use empty array for default when removing all wmgMonologAvroSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518786 (https://phabricator.wikimedia.org/T222268) [18:29:49] (03PS2) 10Ottomata: Use empty array for default when removing all wmgMonologAvroSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518786 (https://phabricator.wikimedia.org/T222268) [18:31:08] (03CR) 10Ottomata: [C: 03+2] Use empty array for default when removing all wmgMonologAvroSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518786 (https://phabricator.wikimedia.org/T222268) (owner: 10Ottomata) [18:31:15] (03CR) 10jenkins-bot: Add hualab.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518278 (https://phabricator.wikimedia.org/T225917) (owner: 10Urbanecm) [18:31:30] Reedy: how did you see that in beta? [18:31:33] want to check there [18:31:45] (03CR) 10jenkins-bot: Disable CirrusSearchRequestSet avro monolog channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518784 (https://phabricator.wikimedia.org/T222268) (owner: 10Ottomata) [18:35:34] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable CirrusSearchRequestSet avro monolog channel - T222268 (duration: 00m 55s) [18:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:40] T222268: Port usage of mediawiki_CirrusSearchRequestSet to mediawiki_cirrussearch_request - https://phabricator.wikimedia.org/T222268 [18:35:49] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Nuria) This is from our team but also probably of help: https://wikitech.wikimedia.org/wiki/Analytics/Team/Onboarding#Getting_permits [18:36:42] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Nuria) It is probably agood idea to create onboarding docs for team now @Halfak as there is probably couple more hires in the near term [18:36:58] (03CR) 10jenkins-bot: Use empty array for default when removing all wmgMonologAvroSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518786 (https://phabricator.wikimedia.org/T222268) (owner: 10Ottomata) [18:37:38] ottomata: Any more for you? [18:42:59] (03PS3) 10Reedy: Move some FR config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518773 [18:43:04] (03CR) 10Reedy: [C: 03+2] Move some FR config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518773 (owner: 10Reedy) [18:44:05] (03Merged) 10jenkins-bot: Move some FR config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518773 (owner: 10Reedy) [18:44:21] (03CR) 10jenkins-bot: Move some FR config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518773 (owner: 10Reedy) [18:44:45] (03PS5) 10Reedy: Remove config now in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518774 [18:44:50] (03PS6) 10Reedy: Remove config now in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518774 [18:46:04] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Move some basic FR config into IS (duration: 00m 55s) [18:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:17] (03CR) 10CRusnov: "I have tested this on af-netbox with the swift settings sitting in PASSTHROUGH and it works as expected." [software/netbox] - 10https://gerrit.wikimedia.org/r/518785 (owner: 10CRusnov) [18:48:36] (03PS2) 10CRusnov: Add a passthrough configuration system [software/netbox] - 10https://gerrit.wikimedia.org/r/518785 (https://phabricator.wikimedia.org/T209182) [18:48:38] (03CR) 10Reedy: [C: 03+2] Remove config now in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518774 (owner: 10Reedy) [18:48:41] !log rebooting cloudvirt1018 [18:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:28] (03Merged) 10jenkins-bot: Remove config now in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518774 (owner: 10Reedy) [18:49:43] (03CR) 10jenkins-bot: Remove config now in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518774 (owner: 10Reedy) [18:50:46] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: Remove some now redundant config (duration: 00m 55s) [18:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:41] (03PS1) 10Ottomata: Remove usages of monolog kafka handler and avro formatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518787 (https://phabricator.wikimedia.org/T226436) [18:53:43] Reedy: ? [18:53:47] more changes? yes... [18:53:51] heh [18:53:54] I'm done again, so you're good :P [18:53:56] but i can do them outside of swat, and make sure they work in beta first [18:53:58] ok thanks [18:54:27] (03CR) 10jerkins-bot: [V: 04-1] Remove usages of monolog kafka handler and avro formatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518787 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [18:55:30] (03PS1) 10Reedy: Stop setting wgFlaggedRevsAutoReview to a boolean [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518788 [18:55:40] * Reedy carries on then :P [18:55:44] (03CR) 10Reedy: [C: 03+2] Stop setting wgFlaggedRevsAutoReview to a boolean [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518788 (owner: 10Reedy) [18:56:39] (03Merged) 10jenkins-bot: Stop setting wgFlaggedRevsAutoReview to a boolean [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518788 (owner: 10Reedy) [18:56:56] (03CR) 10jenkins-bot: Stop setting wgFlaggedRevsAutoReview to a boolean [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518788 (owner: 10Reedy) [18:58:10] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wgFlaggedRevsAutoReview to a boolean (duration: 00m 55s) [18:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] thcipriani and paladox: Time to snap out of that daydream and deploy Gerrit Upgrade. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T1900). [19:00:34] * paladox here [19:00:36] (03PS2) 10Ottomata: Remove usages of monolog kafka handler and avro formatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518787 (https://phabricator.wikimedia.org/T226436) [19:00:39] * thcipriani snaps out of it [19:00:52] Reedy: ottomata you all still deploying things for SWAT? [19:00:59] not for SWAT [19:01:02] :) [19:01:04] I'm just cleaning crap up [19:01:09] but I have some config changes coming trying to be careful [19:01:18] Nothing particularly urgent, just trying to prevent them stacking up too much [19:01:35] i'm trying to remove deprecated and now unused configs [19:01:42] just trying to do it in the right order to not cause logspam [19:02:15] sure, would one of you ping me when you're clear? Got a minor gerrit upgrade to get out. [19:02:30] mine will be I think 3 patches [19:02:38] going to do the first one now, ok Reedy thcipriani ? [19:02:48] (03PS3) 10Ottomata: Remove usages of monolog kafka handler and avro formatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518787 (https://phabricator.wikimedia.org/T226436) [19:02:51] sure [19:03:14] yep [19:03:41] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Halfak) Yes. This is the ticket for requesting shell access. I believe it is tagged as described at https://wikitech.wikimedia.org/wiki/Production_shell_acce... [19:04:15] (03CR) 10Ottomata: [C: 03+2] Remove usages of monolog kafka handler and avro formatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518787 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [19:05:16] that's wikidata entity dumps json, indeed, Reedy [19:05:27] if we see those for every entity it's a problem [19:06:06] if there's a few badly behaved entities somehow or other then we should (wikidata folks should) see what's up with those specific ones sometime [19:06:31] (03CR) 10jenkins-bot: Remove usages of monolog kafka handler and avro formatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518787 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [19:06:47] 10Operations: Add Eric to cpt-leads@wikimedia.org and remove Marko to cpt-leads@wikimedia.org - https://phabricator.wikimedia.org/T226443 (10kchapman) [19:10:51] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Ottomata) I think it is done ok, there was some confusion I think the description caused some confusion with SRE the later comment about analytics servers. @j... [19:11:25] 10Operations, 10ops-ulsfo: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) p:05Triage→03Normal [19:11:34] 10Operations, 10ops-ulsfo: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) [19:11:57] (03PS1) 10RobH: setting ganeti400[123] mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/518792 (https://phabricator.wikimedia.org/T226444) [19:12:17] Reedy: can I scap sync-file multiple files at once? [19:12:28] you can sync-file the parent dir [19:12:33] all of wmf-config? [19:12:55] (03CR) 10RobH: [C: 03+2] setting ganeti400[123] mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/518792 (https://phabricator.wikimedia.org/T226444) (owner: 10RobH) [19:12:59] oh and the tests/ dir i guess. i guess i can do 2 scap sync-files [19:13:05] Yeah [19:13:07] ok [19:13:43] !log otto@deploy1001 sync-file aborted: Remove usages of monolog kafka handler and avro formatter - tests - T226436 (duration: 00m 06s) [19:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:51] T226436: Remove Avro + Kafka support from Mediawiki Monolog configs - https://phabricator.wikimedia.org/T226436 [19:14:41] (03PS1) 10Reedy: Stop using $wgFlaggedRevsAutoReviewNew [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518794 [19:14:45] !log otto@deploy1001 Synchronized tests/loggingTest.php: Remove usages of monolog kafka handler and avro formatter - tests - T226436 (duration: 00m 55s) [19:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:36] (03Abandoned) 10Ottomata: Remove Monolog Kafka handler and configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517874 (https://phabricator.wikimedia.org/T222268) (owner: 10Ottomata) [19:16:02] !log otto@deploy1001 Synchronized wmf-config: Remove usages of monolog kafka handler and avro formatter - wmf-config - T226436 (duration: 00m 56s) [19:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:54] 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 3 others: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10CDanis) Am I alone in feeling like this probably deserves an [[ https://wikitech.wikimedia.org/wiki/Incident_documentation | incident report ]]? I'd b... [19:17:57] (03PS1) 10Ottomata: Remove remaining monolog kafka and avro related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518797 (https://phabricator.wikimedia.org/T226436) [19:18:43] (03CR) 10jerkins-bot: [V: 04-1] Remove remaining monolog kafka and avro related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518797 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [19:19:22] 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 3 others: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10RhinosF1) >>! In T226109#5279992, @CDanis wrote: > Am I alone in feeling like this probably deserves an [[ https://wikitech.wikimedia.org/wiki/Incident... [19:22:45] (03PS2) 10Ottomata: Remove remaining monolog kafka and avro related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518797 (https://phabricator.wikimedia.org/T226436) [19:24:40] (03CR) 10Eevans: Cassandra nodetool repair cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/517377 (https://phabricator.wikimedia.org/T225694) (owner: 10Mathew.onipe) [19:26:08] 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 3 others: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy) >>! In T226109#5279998, @RhinosF1 wrote: > I think there was one at some point Pretty sure there wasn't [19:26:34] (03CR) 10Ottomata: [C: 03+2] Remove remaining monolog kafka and avro related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518797 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [19:26:52] (03CR) 10jenkins-bot: Remove remaining monolog kafka and avro related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518797 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [19:30:37] !log otto@deploy1001 Synchronized tests/TestServices.php: Remove remaining monolog kafka and avro related configs - tests - T226436 (duration: 00m 56s) [19:30:38] 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 3 others: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10RhinosF1) >>! In T226109#5280026, @Reedy wrote: >>>! In T226109#5279998, @RhinosF1 wrote: >> I think there was one at some point > > Pretty sure there... [19:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:43] T226436: Remove Avro + Kafka support from Mediawiki Monolog configs - https://phabricator.wikimedia.org/T226436 [19:31:12] 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 3 others: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy) >>! In T226109#5280036, @RhinosF1 wrote: >>>! In T226109#5280026, @Reedy wrote: >>>>! In T226109#5279998, @RhinosF1 wrote: >>> I think there was... [19:31:26] (03CR) 10Elukey: [C: 03+2] Add hiera overrides for analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/518767 (owner: 10Elukey) [19:31:34] (03PS4) 10Elukey: Add hiera overrides for analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/518767 [19:31:36] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add hiera overrides for analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/518767 (owner: 10Elukey) [19:31:58] !log otto@deploy1001 Synchronized wmf-config: Remove remaining monolog kafka and avro related configs - wmf-config - T226436 (duration: 00m 55s) [19:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:38] 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 3 others: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10RhinosF1) >>! In T226109#5280038, @Reedy wrote: >>>! In T226109#5280036, @RhinosF1 wrote: >>>>! In T226109#5280026, @Reedy wrote: >>>>>! In T226109#527... [19:32:59] !log restart yarn/hdfs on analytics1072 to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/518767/ (broken disk) [19:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:21] (03PS1) 10Reedy: Migrate hewikisource $wgFlaggedRevsTags config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518801 (https://phabricator.wikimedia.org/T226439) [19:33:51] (03PS2) 10Reedy: Stop using $wgFlaggedRevsAutoReviewNew [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518794 [19:33:53] (03PS2) 10Reedy: Migrate hewikisource $wgFlaggedRevsTags config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518801 (https://phabricator.wikimedia.org/T226439) [19:34:43] (03PS2) 10Ottomata: Remove the event-schemas submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517871 (https://phabricator.wikimedia.org/T226436) [19:35:01] Hm, hey Reedy [19:35:07] ja? [19:35:11] do you think there is anything special needed for ^^? [19:35:18] removing a submodule from mediawiki-config [19:35:25] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10JHedden) There's a lot of good information in this task. I'm still catching up, but I wanted to note that it's important to consider the repl... [19:35:44] Shouldn't do [19:36:00] modern git handles it better than older versions [19:36:08] ok [19:36:10] RECOVERY - puppet last run on analytics1072 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:36:25] i don't need to sync-file .gitmodules, right? [19:36:25] just the wmf-config/event-schemas? [19:36:43] you don't need to, but you can for completeness [19:36:46] ok [19:36:49] You'll need to sync-dir wmf-config [19:36:51] i guess i'll sync-file wmf-config/ [19:36:51] right. [19:36:52] ok [19:36:59] So it propagates the delete [19:36:59] (03CR) 10Ottomata: [C: 03+2] Remove the event-schemas submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517871 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [19:37:20] (03CR) 10jenkins-bot: Remove the event-schemas submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517871 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [19:37:23] (03CR) 10Krinkle: [C: 03+1] Migrate hewikisource $wgFlaggedRevsTags config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518801 (https://phabricator.wikimedia.org/T226439) (owner: 10Reedy) [19:41:16] !log otto@deploy1001 Synchronized wmf-config: Remove the event-schemas submodule - wmf-config - T226436 (duration: 00m 55s) [19:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:22] T226436: Remove Avro + Kafka support from Mediawiki Monolog configs - https://phabricator.wikimedia.org/T226436 [19:41:22] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Bstorm) I should point out that the PoC will not be capable of doing anywhere near that much IO. That would be what it would look like if we... [19:43:06] ottomata: how go the deploys? Am I clear for the gerrit upgrade? [19:43:09] !log otto@deploy1001 Synchronized .gitmodules: Remove the event-schemas submodule - .gitmodules - T226436 (duration: 00m 55s) [19:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:21] heh, oh :) [19:43:57] thcipriani: you're spoiling his fun! [19:44:09] it's what I do best [19:44:14] thcipriani: i am done! [19:44:19] just now [19:44:20] go ahead! [19:44:27] cool, thank you! [19:44:31] (03CR) 10Thcipriani: [V: 03+2] Gerrit v2.15.14 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/518039 (owner: 10Thcipriani) [19:45:13] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Halfak) [19:45:19] heh [19:45:23] yeah, turn gerrit off [19:45:27] then we have to stop :P [19:45:58] (03PS1) 10CDanis: dbctl: remove never-implemented 'section get --mediawiki' flag [software/conftool] - 10https://gerrit.wikimedia.org/r/518802 [19:46:12] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@e3695fd]: Gerrit to 2.15.14 (gerrit2001 only) [19:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:24] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@e3695fd]: Gerrit to 2.15.14 (gerrit2001 only) (duration: 00m 11s) [19:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:42] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@e3695fd]: Gerrit to 2.15.14 on cobalt (restart incoming) [19:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:54] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@e3695fd]: Gerrit to 2.15.14 on cobalt (restart incoming) (duration: 00m 12s) [19:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:22] !log restarting gerrit for 2.15.14 update [19:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:21] awww for a brief moment I thought maybe the 2.16 upgrade had been scheduled, approved, and was happening Right Now :-D [19:50:42] !log gerrit back [19:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:12] apergos: heh, Soon™ [19:51:27] (03PS2) 10Reedy: Remove comments about how to "install" FR, very out of date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518778 [19:51:33] (03CR) 10Reedy: [C: 03+2] Remove comments about how to "install" FR, very out of date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518778 (owner: 10Reedy) [19:51:50] I'll take soon. :-) [19:52:30] (03Merged) 10jenkins-bot: Remove comments about how to "install" FR, very out of date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518778 (owner: 10Reedy) [19:52:44] PROBLEM - puppet last run on schema2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:52:44] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Bstorm) Ok, that said, I did write that misreading Mbps for Gbps...but what I said is still true! The PoC won't be anywhere near all that, a... [19:52:46] (03CR) 10jenkins-bot: Remove comments about how to "install" FR, very out of date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518778 (owner: 10Reedy) [19:52:54] (03PS3) 10Reedy: Stop using $wgFlaggedRevsAutoReviewNew [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518794 [19:52:58] (03CR) 10Reedy: [C: 03+2] Stop using $wgFlaggedRevsAutoReviewNew [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518794 (owner: 10Reedy) [19:53:00] PROBLEM - puppet last run on schema1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:53:44] > One or more new SSH keys have been added to Gerrit Code Review at gerrit.wikimedia.org [19:53:56] (03Merged) 10jenkins-bot: Stop using $wgFlaggedRevsAutoReviewNew [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518794 (owner: 10Reedy) [19:54:02] new ssh key email alerts magic working in gerrit \o/ [19:54:03] (03PS3) 10Reedy: Migrate hewikisource $wgFlaggedRevsTags config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518801 (https://phabricator.wikimedia.org/T226439) [19:54:07] (03CR) 10Reedy: [C: 03+2] Migrate hewikisource $wgFlaggedRevsTags config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518801 (https://phabricator.wikimedia.org/T226439) (owner: 10Reedy) [19:54:36] (03CR) 10jenkins-bot: Stop using $wgFlaggedRevsAutoReviewNew [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518794 (owner: 10Reedy) [19:55:11] (03Merged) 10jenkins-bot: Migrate hewikisource $wgFlaggedRevsTags config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518801 (https://phabricator.wikimedia.org/T226439) (owner: 10Reedy) [19:55:30] (03CR) 10jenkins-bot: Migrate hewikisource $wgFlaggedRevsTags config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518801 (https://phabricator.wikimedia.org/T226439) (owner: 10Reedy) [19:55:31] thcipriani that always worked :) [19:55:46] > One or more SSH keys have been deleted on Gerrit Code Review at gerrit.wikimedia.org [19:55:49] also working :) [19:55:50] it was only admins could bypass that which is now fixed [19:56:45] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: rm old comments move more FR config (duration: 00m 52s) [19:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:55] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: Update more fr config (duration: 00m 55s) [19:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] cscott, arlolra, subbu, bearND, and halfak: That opportune time is upon us again. Time for a Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T2000). [20:00:09] 10Operations, 10Community-Relations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Krinkle) [20:00:14] no parsoid deploy today [20:00:21] 10Operations, 10Community-Relations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Krinkle) [20:01:13] 10Operations, 10Community-Relations, 10Traffic, 10Performance, 10Performance-Team (Radar): Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Gilles) [20:01:32] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10Gilles) a:03Gilles [20:02:17] 10Operations, 10Community-Relations, 10Traffic, 10Performance, 10Performance-Team (Radar): Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Krinkle) (The task is titled "European users", but more precisely it a... [20:05:48] 10Operations, 10ops-ulsfo: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) [20:05:58] (03PS1) 10Ottomata: Ensure mediawiki-analytics (avro) Camus job is absent [puppet] - 10https://gerrit.wikimedia.org/r/518807 (https://phabricator.wikimedia.org/T226436) [20:06:04] 10Operations, 10ops-ulsfo: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) notes for my remote update of these systems later today ganeti4001 asw-22-ulsfo:xe-0/0/9 1052 prod 1253 mgmt ganeti4002 asw-23-ulsfo:xe-0/0/9 1050 prod 1254 mgmt ganeti4003 asw-22-ulsfo:xe-0/0/10... [20:06:40] (03CR) 10jerkins-bot: [V: 04-1] Ensure mediawiki-analytics (avro) Camus job is absent [puppet] - 10https://gerrit.wikimedia.org/r/518807 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [20:07:18] (03PS2) 10Ottomata: Ensure mediawiki-analytics (avro) Camus job is absent [puppet] - 10https://gerrit.wikimedia.org/r/518807 (https://phabricator.wikimedia.org/T226436) [20:08:03] (03CR) 10jerkins-bot: [V: 04-1] Ensure mediawiki-analytics (avro) Camus job is absent [puppet] - 10https://gerrit.wikimedia.org/r/518807 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [20:08:30] (03PS3) 10Ottomata: Ensure mediawiki-analytics (avro) Camus job is absent [puppet] - 10https://gerrit.wikimedia.org/r/518807 (https://phabricator.wikimedia.org/T226436) [20:11:02] (03CR) 10Ottomata: [C: 03+2] Ensure mediawiki-analytics (avro) Camus job is absent [puppet] - 10https://gerrit.wikimedia.org/r/518807 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [20:11:57] (03PS1) 10CDanis: dbctl: s/slave/replica/ everywhere [software/conftool] - 10https://gerrit.wikimedia.org/r/518809 [20:12:51] Hola! Can someone here give admin rights to user "Dom walden" on https://de.wikipedia.beta.wmflabs.org/? He's our QA person. [20:13:11] (03PS1) 10Ottomata: Remove mediawiki-analytics (avro) camus job definition [puppet] - 10https://gerrit.wikimedia.org/r/518810 (https://phabricator.wikimedia.org/T226436) [20:13:42] !log rebooting cloudvirt1024 [20:13:45] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Bstorm) So figuring, based on that data, that it may not be impossible to fill the link, it's extremely unlikely that we will (and we still w... [20:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:26] RECOVERY - puppet last run on schema2002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [20:19:44] RECOVERY - puppet last run on schema1001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:21:48] (03PS1) 10Thcipriani: gerrit: Re-enable the use of HTTP auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/518811 (https://phabricator.wikimedia.org/T225308) [20:22:10] zomg [20:22:30] \o/ thcipriani [20:23:01] (03CR) 10Nuria: "Looks like we will be keeping this data in deep storage for 60 days. abandoning" [puppet] - 10https://gerrit.wikimedia.org/r/518769 (https://phabricator.wikimedia.org/T226227) (owner: 10Nuria) [20:23:08] (03Abandoned) 10Nuria: Keeping webrequest_sampled data for 30 days [puppet] - 10https://gerrit.wikimedia.org/r/518769 (https://phabricator.wikimedia.org/T226227) (owner: 10Nuria) [20:24:35] thcipriani: yaaay :)) [20:25:04] thcipriani: thank you (and paladox) for all your work on Gerrit lately [20:25:15] your welcome :) [20:25:19] (03PS2) 10Ottomata: Remove mediawiki-analytics (avro) camus job definition [puppet] - 10https://gerrit.wikimedia.org/r/518810 (https://phabricator.wikimedia.org/T226436) [20:25:43] legoktm: sure thing! this one was definitely a lot of work from paladox: I'm just here pushing buttons :) [20:26:12] (03CR) 10Ottomata: [C: 03+2] Remove mediawiki-analytics (avro) camus job definition [puppet] - 10https://gerrit.wikimedia.org/r/518810 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [20:26:50] (03PS1) 10CDanis: dbctl: s/reason/ro_reason/ in the schema, so section edit is clearer [software/conftool] - 10https://gerrit.wikimedia.org/r/518812 [20:27:25] 10Operations, 10Annual-Report, 10serviceops, 10Patch-For-Review: Redirects for 2018 Annual Report - https://phabricator.wikimedia.org/T226066 (10Varnent) All set - thank you! Apologies for the short notice. :) [20:27:31] (03CR) 10Paladox: [C: 03+1] "As the one who implemented the new notifications (with help from David Pursehouse upstream), i've confirmed this works (users are now noti" [puppet] - 10https://gerrit.wikimedia.org/r/518811 (https://phabricator.wikimedia.org/T225308) (owner: 10Thcipriani) [20:27:34] (03PS1) 10Ayounsi: Allow $NETWORK_INFRA to use syslog/kafka [puppet] - 10https://gerrit.wikimedia.org/r/518813 (https://phabricator.wikimedia.org/T224128) [20:28:49] (03CR) 10Herron: [C: 03+1] Allow $NETWORK_INFRA to use syslog/kafka [puppet] - 10https://gerrit.wikimedia.org/r/518813 (https://phabricator.wikimedia.org/T224128) (owner: 10Ayounsi) [20:29:48] (03PS2) 10CDanis: gerrit: Re-enable the use of HTTP auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/518811 (https://phabricator.wikimedia.org/T225308) (owner: 10Thcipriani) [20:29:51] (03PS1) 10Ottomata: Remove monitoring and alerts for kafka analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/518814 (https://phabricator.wikimedia.org/T183303) [20:29:56] (03CR) 10CDanis: [C: 03+2] gerrit: Re-enable the use of HTTP auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/518811 (https://phabricator.wikimedia.org/T225308) (owner: 10Thcipriani) [20:30:40] (03CR) 10jerkins-bot: [V: 04-1] Remove monitoring and alerts for kafka analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/518814 (https://phabricator.wikimedia.org/T183303) (owner: 10Ottomata) [20:30:53] cdanis: thank you! [20:31:30] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17069/wezen.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/518813 (https://phabricator.wikimedia.org/T224128) (owner: 10Ayounsi) [20:31:40] (03PS2) 10Ayounsi: Allow $NETWORK_INFRA to use syslog/kafka [puppet] - 10https://gerrit.wikimedia.org/r/518813 (https://phabricator.wikimedia.org/T224128) [20:31:46] (03PS2) 10Ottomata: Remove monitoring and alerts for kafka analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/518814 (https://phabricator.wikimedia.org/T183303) [20:32:14] jouncebot: now [20:32:15] For the next 0 hour(s) and 27 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T2000) [20:34:06] yay! [20:36:23] (03CR) 10Ottomata: "Looks ok https://puppet-compiler.wmflabs.org/compiler1002/17071/" [puppet] - 10https://gerrit.wikimedia.org/r/518814 (https://phabricator.wikimedia.org/T183303) (owner: 10Ottomata) [20:40:34] ErrorException from line 5468 of /srv/mediawiki/wmf-config/InitialiseSettings.php: PHP Notice: Undefined index: kafka [20:40:40] 2019-06-24T19:32:07 [20:40:44] ottomata ^^ [20:40:46] that would be me! [20:40:47] ok... [20:40:58] was it solved meanwhile? [20:41:00] Krinkle: in beta? [20:41:02] oh [20:41:02] prod [20:41:20] oh more than an hour ago? [20:41:34] Aye, indeed. [20:41:35] i would have expected my deploy ordering to be ok with that. [20:41:39] any sense then? [20:41:42] 32 hits [20:41:46] various prod page views [20:41:51] until 2019-06-24T19:32:07 [20:42:22] hm, must have missed a step. it was a multi part no-op config deploy, but some configs referred to other things. Seemed to happen in the right order...but perhaps it didin't [20:42:27] anyway it should all be out now [20:42:30] so yeah, likely a syncing issue that no longer happens now but unsure what the impact was. [20:42:36] so if there haven't been any new errors in a while it should be fine [20:42:56] the impact should be nothing. i was removing unused config [20:42:58] ottomata: What would the impact be of this key being null (which is what PHP does in that case, it pretends the key exists as null) [20:43:02] ok [20:43:03] thx [20:43:33] yup, thanks for keeping an eye on it [20:43:39] https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.06.24/mediawiki?id=AWuK-V_wQFnOyvY0gE9_&_g=h@44136fa [20:43:47] ottomata: interesting stack trace, might help to understnad why it happened [20:43:51] regarding ordering [20:44:16] notice how it loads wmfLoadInitialiseSettings mid-request for cross-wiki SiteConfig reads [20:46:20] aye, but I had thought I had deployed the change that accessed the 'kafka' key before the change that removed the kafka key [20:46:42] (03PS1) 10Ayounsi: Rename facility_label to facility [puppet] - 10https://gerrit.wikimedia.org/r/518818 (https://phabricator.wikimedia.org/T224128) [20:47:06] (03CR) 10Herron: [C: 03+1] Rename facility_label to facility [puppet] - 10https://gerrit.wikimedia.org/r/518818 (https://phabricator.wikimedia.org/T224128) (owner: 10Ayounsi) [20:48:00] (03CR) 10Nuria: [C: 03+1] Remove remaining monolog kafka and avro related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518797 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [20:50:44] (03CR) 10Herron: [C: 03+2] Rename facility_label to facility [puppet] - 10https://gerrit.wikimedia.org/r/518818 (https://phabricator.wikimedia.org/T224128) (owner: 10Ayounsi) [20:56:21] 10Operations, 10Community-Relations, 10Traffic, 10Performance, and 2 others: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Legoktm) [21:00:04] bawolff and Reedy: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T2100). [21:00:12] sbassett: ^ [21:00:16] We should add you to the window/template [21:00:52] Reedy: sure, that sounds good. Maybe even one other person from the secteam just in case. [21:01:12] Anyone can take it, it's just who it pings [21:01:37] Ok, that's fine [21:01:43] https://wikitech.wikimedia.org/w/index.php?title=Deployments%2FTemplate&type=revision&diff=1830291&oldid=1830020 [21:01:55] I can deploy https://gerrit.wikimedia.org/r/518350 in a bit, just working on another ps atm [21:01:56] The ping can be useful sometimes when you're busy and lose track of time [21:02:01] Yup :) [21:04:07] !log mobrovac@deploy1001 Started deploy [restbase/deploy@a915f69]: Add /page/media-lint - T226105 - and various other cleanups [21:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:12] T226105: Deploy new media-list endpoint in RESTBase - https://phabricator.wikimedia.org/T226105 [21:08:56] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to deployment hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Halfak) [21:11:44] 10Operations, 10Annual-Report, 10serviceops, 10Patch-For-Review: Redirects for 2018 Annual Report - https://phabricator.wikimedia.org/T226066 (10LTraer) @jijiki Thank you so much! [21:13:20] (03PS2) 10SBassett: Revert "Temporary make account creation limits more restrictive" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518350 (https://phabricator.wikimedia.org/T212667) (owner: 10JJMC89) [21:14:27] (03CR) 10SBassett: [C: 03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518350 (https://phabricator.wikimedia.org/T212667) (owner: 10JJMC89) [21:14:52] (03CR) 10Jforrester: "❤️" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517871 (https://phabricator.wikimedia.org/T226436) (owner: 10Ottomata) [21:16:32] (03CR) 10SBassett: [C: 03+2] Revert "Temporary make account creation limits more restrictive" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518350 (https://phabricator.wikimedia.org/T212667) (owner: 10JJMC89) [21:17:31] (03Merged) 10jenkins-bot: Revert "Temporary make account creation limits more restrictive" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518350 (https://phabricator.wikimedia.org/T212667) (owner: 10JJMC89) [21:18:05] Deploying now: https://gerrit.wikimedia.org/r/518350 [21:22:13] (03CR) 10jenkins-bot: Revert "Temporary make account creation limits more restrictive" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518350 (https://phabricator.wikimedia.org/T212667) (owner: 10JJMC89) [21:23:14] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@a915f69]: Add /page/media-lint - T226105 - and various other cleanups (duration: 19m 08s) [21:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:22] T226105: Deploy new media-list endpoint in RESTBase - https://phabricator.wikimedia.org/T226105 [21:31:53] mobrovac: sbassett done deploying for a bit? I'd like to restart gerrit to pick up a config change. [21:32:14] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2619 MB (5% inode=53%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [21:32:16] yup thcipriani, all good here [21:32:21] thcipriani: Just scapping out file now [21:32:23] oh but wait thcipriani, could you wait 5 mins? [21:32:30] !log sbassett@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Deployed r/518350 - Revert "Temporary make account creation limits more restrictive" (duration: 00m 56s) [21:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:38] Pchelolo wants to get out a change to unblock otto real quick [21:32:49] mobrovac: yep, no problem [21:32:54] thnx [21:33:13] thcipriani: I'm done now (unless fatalmonitor blows up, which it shouldn't) [21:33:25] sbassett: great, thank you :) [21:35:25] thcipriani: what's up with the jenkins bot telling us our repo has been archived? https://gerrit.wikimedia.org/r/#/c/mediawiki/services/change-propagation/deploy/+/518821/ wth? :) [21:36:30] * thcipriani looks [21:37:29] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@17e71b5]: Support .meta.stream as well as .meta.topic T226198 [21:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:36] T226198: Prepare change-prop to consume new-style messages - https://phabricator.wikimedia.org/T226198 [21:38:02] mobrovac it was done in https://phabricator.wikimedia.org/rCICF872b07e07f35542c605bf5b635c24db4ae3b8026 [21:38:42] James_F: ^ ? why? [21:38:46] cp is not archived [21:39:03] nor is graphoid for that matter [21:39:11] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@17e71b5]: Support .meta.stream as well as .meta.topic T226198 (duration: 01m 42s) [21:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:16] thnx paladox [21:39:58] thcipriani: I'm done. Thank you for waiting [21:40:13] Pchelolo: cool thanks for the ping [21:40:57] !log restart gerrit for https://gerrit.wikimedia.org/r/518811/ [21:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:24] mobrovac: Oh, sorry. I thought you were using the pipeline in those repos now? [21:41:36] nah, not yet [21:41:48] James_F: you're being too optimistic :P [21:41:49] Ah, OK, will revert. [21:42:25] I was so happy to kill off the terrible -deploy jobs. [21:42:31] * James_F sighs into his coffee. [21:42:57] !log gerrit back [21:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:09] Wait, even mediawiki/services/change-propagation/deploy? It had no jobs… [21:44:05] [Gerrit Code Review] HTTP password was added or updated \o/ [21:45:13] mobrovac: Oh, I see, you were bypassing all of CI and just manually V+2ing? Tsk. OK, will restore with a comment. [21:45:47] And mediawiki/services/graphoid/deploy hasn't had a new commit in two years. Are you sure it's being used? [21:46:20] service-template-node v0.5.4. Lovely. [21:46:21] James_F: yes, we are trying to get it to k8s, but it's not there yet [21:46:30] OK, will restore that properly. [21:46:39] thank you! [21:46:52] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [21:49:40] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [21:50:18] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [22:11:58] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [22:14:06] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:16:52] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:19:54] (03PS1) 10RobH: adding install params for ganeti400[123] [puppet] - 10https://gerrit.wikimedia.org/r/518831 (https://phabricator.wikimedia.org/T226444) [22:20:26] (03CR) 10jerkins-bot: [V: 04-1] adding install params for ganeti400[123] [puppet] - 10https://gerrit.wikimedia.org/r/518831 (https://phabricator.wikimedia.org/T226444) (owner: 10RobH) [22:22:32] (03PS2) 10RobH: adding install params for ganeti400[123] [puppet] - 10https://gerrit.wikimedia.org/r/518831 (https://phabricator.wikimedia.org/T226444) [22:23:04] (03CR) 10jerkins-bot: [V: 04-1] adding install params for ganeti400[123] [puppet] - 10https://gerrit.wikimedia.org/r/518831 (https://phabricator.wikimedia.org/T226444) (owner: 10RobH) [22:23:41] bah, bad copy paste day. [22:23:41] (03PS3) 10RobH: adding install params for ganeti400[123] [puppet] - 10https://gerrit.wikimedia.org/r/518831 (https://phabricator.wikimedia.org/T226444) [22:24:43] (03CR) 10RobH: [C: 03+2] adding install params for ganeti400[123] [puppet] - 10https://gerrit.wikimedia.org/r/518831 (https://phabricator.wikimedia.org/T226444) (owner: 10RobH) [22:26:44] 10Operations, 10ops-ulsfo, 10Patch-For-Review: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) a:05RobH→03akosiaris Actually, I'm not 100% sure on this. @akosiaris: Do these need public or private IP addresses for their base host network connections? [22:27:29] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Update nodejs10 image to use the latest version of the package - https://phabricator.wikimedia.org/T226346 (10mobrovac) [22:28:18] (03PS1) 10RobH: fix ganeti host entry [puppet] - 10https://gerrit.wikimedia.org/r/518832 (https://phabricator.wikimedia.org/T226444) [22:30:22] (03CR) 10RobH: [C: 03+2] fix ganeti host entry [puppet] - 10https://gerrit.wikimedia.org/r/518832 (https://phabricator.wikimedia.org/T226444) (owner: 10RobH) [22:42:02] PROBLEM - HHVM rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [22:43:20] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 79930 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:43:29] (03PS1) 10RobH: setting ganeti400[123] production dns [dns] - 10https://gerrit.wikimedia.org/r/518834 (https://phabricator.wikimedia.org/T226444) [22:43:42] 10Operations, 10ops-ulsfo: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) [22:43:55] 10Operations, 10ops-ulsfo: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) a:05akosiaris→03RobH [22:44:13] 10Operations, 10ops-ulsfo: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) I set these to internal IP/vlan since other ganeti hosts are that way. [22:46:08] 10Operations, 10Electron-PDFs, 10Core Platform Team Kanban (Done with CPT), 10Services (done): electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 (10Pchelolo) 05Open→03Resolved a:03Pchelolo Electron is not used anymore, closing. [22:49:00] jouncebot: now [22:49:00] For the next 0 hour(s) and 10 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T2100) [22:53:42] !log krinkle@deploy1001: There is an untracked "wmf-config/event-schemas/" directory in the /srv/mediawiki deployment source, ref T226436 [22:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:47] T226436: Remove Avro + Kafka support for Monolog configs - https://phabricator.wikimedia.org/T226436 [22:56:55] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.10/extensions/ProofreadPage/includes/Special/SpecialProofreadPages.php: ed556868f / T225813 (duration: 00m 53s) [22:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:00] T225813: ErrorException from line 125 of ApiQueryQueryPage.php: PHP Notice: Undefined property: stdClass::$value - https://phabricator.wikimedia.org/T225813 [22:57:44] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:58:06] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190624T2300). [23:00:04] CFisch_WMDE: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:16] \o/ [23:01:36] * Krinkle is done with their deployment [23:04:47] Can anyone deploy my backport? :-) [23:05:17] * CFIsch_WMDE is way to inexperienced and tiered to do that myself... -.- [23:07:45] 10Operations, 10ops-ulsfo: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) So while I can ssh into any of the other site switch stacks, or into the mgmt on the servers in ulsfo, I cannot ssh into asw-ulsfo.mgmt.ulsfo.wmnet. @ayounsi: please ping me when you have a moment... [23:08:09] CFIsch_WMDE: If you ask nicely ;) [23:08:28] * CFIsch_WMDE asks nicely ;-) [23:09:46] pleeeeeease 🐐 [23:10:32] CFIsch_WMDE: Um, is it only on .8? [23:10:38] What about .10 and master? [23:11:04] oh, I see [23:11:06] WFM [23:11:09] It's a squashed patch cherry pick from several .10 patches [23:11:39] Adam W. accidentely merged it today and we did not know the exact process [23:11:51] so we reverted it and createt it anew [23:12:02] hopefully without making things worse -.p [23:18:42] (03PS23) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [23:20:30] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [23:24:45] (03PS24) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [23:25:57] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [23:26:08] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2661 MB (5% inode=53%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [23:27:56] (03PS25) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [23:29:19] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [23:30:26] RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [23:39:08] Reedy: It's merged :-) [23:42:06] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.8/extensions/AdvancedSearch/: (no justification provided) (duration: 00m 56s) [23:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:07] (03PS1) 10Reedy: Revert "Further partial revert of copy paste of config into code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518842 [23:47:03] (03PS2) 10Reedy: Revert "Further partial revert of copy paste of config into code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518842 [23:47:56] (03PS26) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [23:48:57] (03PS3) 10Reedy: Rework setup of FR autopromote config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518842 [23:50:28] (03PS27) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [23:51:18] !log twentyafterfour@deploy1001 Synchronized php-1.34.0-wmf.10/extensions/Wikibase/: Sync https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/518782/ refs T220735 (duration: 01m 21s) [23:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:23] T220735: 1.34.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T220735