[00:53:25] PROBLEM - Apache HTTP on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:54:43] RECOVERY - Apache HTTP on mw1324 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:36:19] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:53:19] (03PS1) 10CDanis: WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 [01:53:46] (03CR) 10jerkins-bot: [V: 04-1] WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 (owner: 10CDanis) [01:55:01] (03PS2) 10CDanis: WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 [01:55:50] (03CR) 10jerkins-bot: [V: 04-1] WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 (owner: 10CDanis) [01:57:54] (03PS3) 10CDanis: WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 [01:58:40] (03CR) 10jerkins-bot: [V: 04-1] WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 (owner: 10CDanis) [02:02:29] (03PS4) 10CDanis: WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 [02:03:20] (03CR) 10jerkins-bot: [V: 04-1] WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 (owner: 10CDanis) [02:03:30] (03PS5) 10CDanis: WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 [02:03:35] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:10:18] (03PS6) 10CDanis: WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 [02:12:39] (03PS7) 10CDanis: WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 [02:43:06] (03PS8) 10CDanis: WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 [02:49:39] (03PS9) 10CDanis: WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 [04:36:27] RECOVERY - snapshot of s6 in codfw on db1115 is OK: snapshot for s6 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-07-15 03:18:39 from db2097.codfw.wmnet:3316 (491 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [06:23:39] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:51:29] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [06:57:49] (03PS2) 10Muehlenhoff: Remove sudo user for already removed Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/522449 [07:07:38] (03CR) 10Muehlenhoff: [C: 03+2] Remove sudo user for already removed Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/522449 (owner: 10Muehlenhoff) [07:08:03] (03PS1) 10Elukey: role::swap: add profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/523090 (https://phabricator.wikimedia.org/T170826) [07:08:27] (03PS2) 10Muehlenhoff: Adapt netboot.cfg/DHCP to new names of LDAP replicas in codfw [puppet] - 10https://gerrit.wikimedia.org/r/522498 (https://phabricator.wikimedia.org/T227778) [07:08:55] (03CR) 10Elukey: [C: 04-1] "There are still Spark2 sessions with old "random" port scheme, will need to wait for all users :)" [puppet] - 10https://gerrit.wikimedia.org/r/523090 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [07:12:28] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fix!" [software/conftool] - 10https://gerrit.wikimedia.org/r/522235 (owner: 10CDanis) [07:12:36] (03CR) 10Muehlenhoff: [C: 03+2] Adapt netboot.cfg/DHCP to new names of LDAP replicas in codfw [puppet] - 10https://gerrit.wikimedia.org/r/522498 (https://phabricator.wikimedia.org/T227778) (owner: 10Muehlenhoff) [07:16:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/522510 (owner: 10Jbond) [07:27:17] (03PS1) 10Muehlenhoff: Remove sudo users formerly used by Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/523093 [07:43:25] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10MoritzMuehlenhoff) [07:44:21] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10MoritzMuehlenhoff) [07:44:57] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10MoritzMuehlenhoff) Cole fixed the remaining dashboards. Andrew, can you have a final look whether everything works as expected, then we can close the task? [07:50:36] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17366/" [puppet] - 10https://gerrit.wikimedia.org/r/522406 (owner: 10Muehlenhoff) [07:50:43] (03PS2) 10Muehlenhoff: Remove jessie support from Kafka class [puppet] - 10https://gerrit.wikimedia.org/r/522406 [07:52:45] (03CR) 10Muehlenhoff: [C: 03+2] Remove jessie support from Kafka class [puppet] - 10https://gerrit.wikimedia.org/r/522406 (owner: 10Muehlenhoff) [07:59:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "Might require manual cleanup post merge" [puppet] - 10https://gerrit.wikimedia.org/r/523093 (owner: 10Muehlenhoff) [08:01:06] !log upgrading acme-chief to version 0.19 in acme-chief production instances - T225945 [08:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:16] T225945: acme-chief staging time not working as expected - https://phabricator.wikimedia.org/T225945 [08:05:44] (03PS2) 10Muehlenhoff: Remove sudo users formerly used by Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/523093 [08:07:35] (03CR) 10Muehlenhoff: [C: 03+2] Remove sudo users formerly used by Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/523093 (owner: 10Muehlenhoff) [08:08:15] (03CR) 10Hashar: "recheck" [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/522428 (owner: 10Hashar) [08:10:35] 10Operations, 10netops: IPv6 packet loss registered by the Ripe Atlas anchor in eqsin - https://phabricator.wikimedia.org/T228015 (10elukey) p:05Triage→03High [08:10:45] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [08:13:53] 10Operations, 10netops: IPv6 packet loss registered by the Ripe Atlas anchor in eqsin - https://phabricator.wikimedia.org/T228015 (10elukey) [08:20:12] (03PS3) 10Muehlenhoff: Remove standard::diamond and fold into profile::wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/522436 (https://phabricator.wikimedia.org/T212231) [08:21:55] (03PS2) 10Fsero: registry, swift: some images are not replicated. [puppet] - 10https://gerrit.wikimedia.org/r/521828 (https://phabricator.wikimedia.org/T227570) [08:22:34] !log set oemhp_powerreg=os on ms-be10[16-39] - T225713 [08:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:40] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [08:24:19] (03PS1) 10Fsero: registry: introducing read only mode for maintenances [puppet] - 10https://gerrit.wikimedia.org/r/523100 (https://phabricator.wikimedia.org/T227570) [08:25:19] (03CR) 10jerkins-bot: [V: 04-1] registry: introducing read only mode for maintenances [puppet] - 10https://gerrit.wikimedia.org/r/523100 (https://phabricator.wikimedia.org/T227570) (owner: 10Fsero) [08:26:01] (03CR) 10Filippo Giunchedi: "LGTM, see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/521828 (https://phabricator.wikimedia.org/T227570) (owner: 10Fsero) [08:26:33] (03PS2) 10Fsero: registry: introducing read only mode for maintenances [puppet] - 10https://gerrit.wikimedia.org/r/523100 (https://phabricator.wikimedia.org/T227570) [08:30:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove standard::diamond and fold into profile::wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/522436 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff) [08:35:59] PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:36:09] PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:36:33] (03PS3) 10Fsero: registry, swift: some images are not replicated. [puppet] - 10https://gerrit.wikimedia.org/r/521828 (https://phabricator.wikimedia.org/T227570) [08:37:21] (03PS4) 10Fsero: registry, swift: some images are not replicated. [puppet] - 10https://gerrit.wikimedia.org/r/521828 (https://phabricator.wikimedia.org/T227570) [08:42:37] (03CR) 10Fsero: "ty for the review! addressed comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/521828 (https://phabricator.wikimedia.org/T227570) (owner: 10Fsero) [08:43:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/521828 (https://phabricator.wikimedia.org/T227570) (owner: 10Fsero) [08:44:43] (03CR) 10Fsero: [C: 03+2] registry, swift: some images are not replicated. [puppet] - 10https://gerrit.wikimedia.org/r/521828 (https://phabricator.wikimedia.org/T227570) (owner: 10Fsero) [08:46:49] 10Operations, 10netops: IPv6 packet loss registered by the Ripe Atlas anchor in eqsin - https://phabricator.wikimedia.org/T228015 (10elukey) From my home ipv6 address (removed the first hops): ` [..] 6. AS6939 100ge9-2.core1.par2.he.net 0.0% 10 46.0 49.6 40.9 67.5 9.2 7. AS6939... [08:48:50] !log set oemhp_powerreg=os + reboot for elastic1054 - T225713 [08:48:56] !log T227570 changing container_synchronization on docker_registry_codfw to //docker_registry/eqiad/AUTH_docker/docker_registry_codfw [08:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:59] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [08:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:05] T227570: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 [08:49:17] (03CR) 10Ema: [C: 03+1] lvs: Add ncredir service to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [08:49:42] !log correction: set oemhp_powerreg=os + reboot for elastic1052 (NOT elastic1054) - T225713 [08:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:19] !log creating docker_registry_codfw on eqiad T227570 [08:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:00] 10Operations, 10User-Elukey: memkeys segfaults on Debian Stretch - https://phabricator.wikimedia.org/T223863 (10elukey) Opened https://github.com/bmatheny/memkeys/issues/25 [08:52:43] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/service-checker] (jessie) - 10https://gerrit.wikimedia.org/r/522366 (owner: 10Hashar) [08:52:46] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/service-checker] (stretch) - 10https://gerrit.wikimedia.org/r/522367 (owner: 10Hashar) [08:54:09] PROBLEM - Docker registry HTTPS interface on registry1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string schemaVersion not found on https://registry1001.eqiad.wmnet:443/v2/wikimedia-stretch/manifests/latest - 394 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Docker [08:54:18] ACKNOWLEDGEMENT - ElasticSearch numbers of masters eligible - 9643 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. Gehel reboot of 1052 in progress https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [08:55:12] registry is known [08:55:14] will ack [08:56:58] ACKNOWLEDGEMENT - Docker registry HTTPS interface on registry1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string schemaVersion not found on https://registry1001.eqiad.wmnet:443/v2/wikimedia-stretch/manifests/latest - 394 bytes in 0.072 second response time Fsero T227570 https://wikitech.wikimedia.org/wiki/Docker [08:57:06] (03PS1) 10Filippo Giunchedi: syslog: add temp rsync to copy data [puppet] - 10https://gerrit.wikimedia.org/r/523102 (https://phabricator.wikimedia.org/T200706) [08:59:04] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/17367/" [puppet] - 10https://gerrit.wikimedia.org/r/523102 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [09:00:41] (03CR) 10Volans: Introduce a ldap config in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [09:02:47] (03PS2) 10Filippo Giunchedi: syslog: add temp rsync to copy data [puppet] - 10https://gerrit.wikimedia.org/r/523102 (https://phabricator.wikimedia.org/T200706) [09:03:13] RECOVERY - puppet last run on ms-be1032 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:03:25] RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:09:53] (03PS3) 10Elukey: Introduce a ldap config in hiera [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) [09:11:07] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) > At the moment, engineering resources at the Foundation are committed to other project work,... [09:11:53] (03PS1) 10Vgutierrez: Split langlist helper in two [dns] - 10https://gerrit.wikimedia.org/r/523106 (https://phabricator.wikimedia.org/T133548) [09:12:11] (03CR) 10jerkins-bot: [V: 04-1] Split langlist helper in two [dns] - 10https://gerrit.wikimedia.org/r/523106 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [09:12:16] wonderful :) [09:14:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] Introduce a ldap config in hiera [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [09:14:50] (03PS4) 10Elukey: Introduce a ldap config in hiera [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) [09:15:26] (03CR) 10Elukey: [C: 03+2] Introduce a ldap config in hiera [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [09:16:43] (03PS2) 10Vgutierrez: Split langlist helper in two [dns] - 10https://gerrit.wikimedia.org/r/523106 (https://phabricator.wikimedia.org/T133548) [09:17:07] (03CR) 10jerkins-bot: [V: 04-1] Split langlist helper in two [dns] - 10https://gerrit.wikimedia.org/r/523106 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [09:19:37] (03Restored) 10Alexandros Kosiaris: Jenkins job validation (DO NOT SUBMIT) [software/service-checker] (stretch) - 10https://gerrit.wikimedia.org/r/522367 (owner: 10Hashar) [09:22:12] (03PS3) 10Vgutierrez: Split langlist helper in two [dns] - 10https://gerrit.wikimedia.org/r/523106 (https://phabricator.wikimedia.org/T133548) [09:22:17] (03CR) 10Alexandros Kosiaris: "recheck" [software/service-checker] (stretch) - 10https://gerrit.wikimedia.org/r/522367 (owner: 10Hashar) [09:22:49] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:22:58] akosiaris: :] [09:23:59] hashar: still fails, but for other reasons now [09:24:06] https://integration.wikimedia.org/ci/job/debian-glue-non-voting/2553/console [09:24:09] good! [09:24:26] can't find the unit tests [09:24:34] I also had a dummy change targetting the jessie branch [09:25:05] oh dh pybuild manages to find and run the python tests that is great [09:26:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM. Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/523100 (https://phabricator.wikimedia.org/T227570) (owner: 10Fsero) [09:27:05] (03PS1) 10Elukey: profile::hue|swap: use new ldap hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/523111 (https://phabricator.wikimedia.org/T227611) [09:27:52] (03CR) 10jerkins-bot: [V: 04-1] profile::hue|swap: use new ldap hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/523111 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [09:28:47] (03CR) 10Jbond: [C: 03+2] Edit urbanecm's .profile [puppet] - 10https://gerrit.wikimedia.org/r/522595 (owner: 10Urbanecm) [09:29:16] (03PS3) 10Jbond: Edit urbanecm's .profile [puppet] - 10https://gerrit.wikimedia.org/r/522595 (owner: 10Urbanecm) [09:31:42] (03PS1) 10Muehlenhoff: Remove now obsolete Diamond removal Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/523113 [09:31:44] (03PS22) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [09:31:51] (03PS29) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [09:33:09] (03PS1) 10Vgutierrez: Add a ncredir-parking zone [dns] - 10https://gerrit.wikimedia.org/r/523114 (https://phabricator.wikimedia.org/T133548) [09:33:12] (03PS1) 10Vgutierrez: Switch wikipedia.com to the ncredir-parking DNS zonefile [dns] - 10https://gerrit.wikimedia.org/r/523115 (https://phabricator.wikimedia.org/T133548) [09:33:36] (03PS2) 10Jbond: ntp: move the include of standard::ntp out of role and into profile [puppet] - 10https://gerrit.wikimedia.org/r/522510 [09:34:17] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 229.53 ms [09:34:40] (03CR) 10Jbond: [C: 03+2] ntp: move the include of standard::ntp out of role and into profile [puppet] - 10https://gerrit.wikimedia.org/r/522510 (owner: 10Jbond) [09:38:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/523113 (owner: 10Muehlenhoff) [09:39:42] !log repooling ms-fe2005 T227570 [09:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:48] T227570: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 [09:42:45] (03PS2) 10Elukey: profile::hue|swap: use new ldap hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/523111 (https://phabricator.wikimedia.org/T227611) [09:45:36] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/17370/" [puppet] - 10https://gerrit.wikimedia.org/r/523111 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [09:46:43] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17369/" [puppet] - 10https://gerrit.wikimedia.org/r/523113 (owner: 10Muehlenhoff) [09:46:50] (03PS2) 10Muehlenhoff: Remove now obsolete Diamond removal Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/523113 [09:48:49] (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete Diamond removal Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/523113 (owner: 10Muehlenhoff) [09:53:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/522548 (https://phabricator.wikimedia.org/T216040) (owner: 10Jhedden) [09:54:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Heh, nice catch. Looks like e62fbba10a fixed the called parties mentioned in that comment so no harm done back then, but the docs had not " [puppet] - 10https://gerrit.wikimedia.org/r/522992 (https://phabricator.wikimedia.org/T113783) (owner: 10CDanis) [09:55:03] (03PS1) 10Muehlenhoff: Remove Diamond from standard/wdqs fixtures [puppet] - 10https://gerrit.wikimedia.org/r/523117 [09:56:55] !log cp-eqsin: varnish frontend rolling restarts for 5.1.3-1wm11 upgrades T227672 [09:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:01] T227672: Upgrade Varnish to 5.1.3-1wm11 - https://phabricator.wikimedia.org/T227672 [09:57:51] 10Operations, 10observability: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Diamond is now gone from production. [09:58:07] (03CR) 10Filippo Giunchedi: "See inline, also adding traffic folks" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [09:58:55] (03CR) 10Filippo Giunchedi: [C: 03+1] wdqs: update response time check to new prometheus metrics. [puppet] - 10https://gerrit.wikimedia.org/r/522499 (owner: 10Gehel) [10:00:59] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove Diamond from standard/wdqs fixtures [puppet] - 10https://gerrit.wikimedia.org/r/523117 (owner: 10Muehlenhoff) [10:01:56] (03PS1) 10Hashar: Copy tests fixtures when building the package [software/service-checker] - 10https://gerrit.wikimedia.org/r/523119 [10:06:37] (03PS2) 10Hashar: Copy tests fixtures when building the package [software/service-checker] - 10https://gerrit.wikimedia.org/r/523119 [10:06:38] (03PS1) 10Hashar: debian: extend dpkg-source diff ignore [software/service-checker] - 10https://gerrit.wikimedia.org/r/523121 [10:07:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] Rake: honor rubocop AllCops/Excludes [puppet] - 10https://gerrit.wikimedia.org/r/484410 (owner: 10Hashar) [10:07:12] (03PS8) 10Alexandros Kosiaris: Rake: honor rubocop AllCops/Excludes [puppet] - 10https://gerrit.wikimedia.org/r/484410 (owner: 10Hashar) [10:08:04] akosiaris: if you are in the mood for more puppet merges, I have a few other patches pending in puppet :] [10:09:03] (03CR) 10Elukey: prometheus: wire up prometheus-varnishkafka-exporter for deploy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [10:09:50] (03PS3) 10Alexandros Kosiaris: contint: remove unused contint::packages::python [puppet] - 10https://gerrit.wikimedia.org/r/517092 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [10:10:07] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] contint: remove unused contint::packages::python [puppet] - 10https://gerrit.wikimedia.org/r/517092 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [10:10:29] (03PS4) 10Alexandros Kosiaris: contint: remove several unused packages [puppet] - 10https://gerrit.wikimedia.org/r/517093 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [10:10:38] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] contint: remove several unused packages [puppet] - 10https://gerrit.wikimedia.org/r/517093 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [10:11:15] (03PS4) 10Alexandros Kosiaris: contint: remove unneeded profile::ci::hhvm [puppet] - 10https://gerrit.wikimedia.org/r/517094 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [10:11:19] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] contint: remove unneeded profile::ci::hhvm [puppet] - 10https://gerrit.wikimedia.org/r/517094 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [10:12:11] hashar: with the exception of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480957/ which has a -1 currently I think everything else you sent has been merged [10:13:52] (03CR) 10Hashar: "git buildpackage creates the source tarball from upstream/0.1.5 , but the master branch has since touched servicechecker/swagger.py and dp" [software/service-checker] - 10https://gerrit.wikimedia.org/r/523119 (owner: 10Hashar) [10:13:54] (03CR) 10Alexandros Kosiaris: "Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/522383 (https://phabricator.wikimedia.org/T227778) (owner: 10Muehlenhoff) [10:16:27] akosiaris: yeah there are a few that are not ready yet or need some more caution. I could use a merge of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/518222/ though :] [10:17:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] contint: remove zuul-cloner from Docker agent [puppet] - 10https://gerrit.wikimedia.org/r/518222 (https://phabricator.wikimedia.org/T226233) (owner: 10Hashar) [10:17:11] (03PS2) 10Alexandros Kosiaris: contint: remove zuul-cloner from Docker agent [puppet] - 10https://gerrit.wikimedia.org/r/518222 (https://phabricator.wikimedia.org/T226233) (owner: 10Hashar) [10:17:28] and hopefully oneday puppet.git would have been cleaned up from most of that legacy ci/contint stuff :] [10:17:38] :) [10:18:58] (03PS1) 10Tarrow: Bump Termbox Staging to 2019-07-12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/523124 [10:19:44] (03CR) 10Tarrow: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/523124 (owner: 10Tarrow) [10:19:46] 10Operations, 10Traffic: Wikipedia is unavailable on Symbian phone's browsers - https://phabricator.wikimedia.org/T227828 (10ema) p:05Triage→03Normal [10:23:37] akosiaris: thank you :-] [10:24:03] (03Abandoned) 10Jbond: wdqs: temp ban user agent [puppet] - 10https://gerrit.wikimedia.org/r/518696 (owner: 10Jbond) [10:26:07] (03PS4) 10Jbond: ipmi - pxe: Ensure ipmi is not overriding the boot order [puppet] - 10https://gerrit.wikimedia.org/r/517694 [10:26:52] (03CR) 10Jbond: [C: 03+2] ipmi - pxe: Ensure ipmi is not overriding the boot order [puppet] - 10https://gerrit.wikimedia.org/r/517694 (owner: 10Jbond) [10:27:12] (03PS2) 10Muehlenhoff: Remove Diamond from standard/wdqs fixtures [puppet] - 10https://gerrit.wikimedia.org/r/523117 [10:28:00] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523127 (https://phabricator.wikimedia.org/T128546) [10:28:44] (03CR) 10Muehlenhoff: [C: 03+2] Remove Diamond from standard/wdqs fixtures [puppet] - 10https://gerrit.wikimedia.org/r/523117 (owner: 10Muehlenhoff) [10:28:57] (03CR) 10Volans: "May I suggest to integrate some info here too?" [puppet] - 10https://gerrit.wikimedia.org/r/517694 (owner: 10Jbond) [10:29:53] (03PS1) 10Urbanecm: Enable partial blocks on the Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523128 (https://phabricator.wikimedia.org/T228008) [10:30:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190715T1030). [10:30:20] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523127 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:31:20] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523127 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:31:35] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523127 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:32:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, with some nitpicks." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) (owner: 10Jbond) [10:34:01] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:523127| Bumping portals to master (T128546)]] (duration: 00m 56s) [10:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:07] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:34:16] (03PS1) 10Muehlenhoff: profile::grafana::production: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523129 [10:34:52] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:523127| Bumping portals to master (T128546)]] (duration: 00m 50s) [10:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:48] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "PCC in Toolforge fails with this change: https://puppet-compiler.wmflabs.org/compiler1002/17372/" [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) (owner: 10Jbond) [10:36:11] (03CR) 10Muehlenhoff: [C: 03+1] "Two typos, looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523111 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [10:38:33] (03PS1) 10Ema: ATS: log origin server hostname and Backend-Timing [puppet] - 10https://gerrit.wikimedia.org/r/523130 (https://phabricator.wikimedia.org/T227668) [10:38:35] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/service-checker] (stretch) - 10https://gerrit.wikimedia.org/r/522367 (owner: 10Hashar) [10:41:59] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:44:43] 10Operations, 10Traffic: ATS: log mode cannot depend on log filters being configured - https://phabricator.wikimedia.org/T224397 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [10:46:10] (03PS1) 10Fsero: helmfile,k8s: adding calico-policy into deploy* for manage it in code [puppet] - 10https://gerrit.wikimedia.org/r/523132 [10:50:56] (03CR) 10Fsero: "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1001/17373/deploy1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/523132 (owner: 10Fsero) [10:50:58] (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/517694 (owner: 10Jbond) [10:51:17] thanks jbond42! :) [10:52:12] !log installing ldap-replica200[12] (T227778) [10:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:18] T227778: Create an LDAP replica in codfw (using LVS) - https://phabricator.wikimedia.org/T227778 [10:52:58] np :D [10:59:11] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 229.51 ms [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190715T1100). [11:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] Let's do the needful [11:00:48] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523006 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [11:00:53] I might put up another change for this SWAT [11:00:59] PROBLEM - Docker registry HTTPS interface on registry1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string schemaVersion not found on https://registry1002.eqiad.wmnet:443/v2/wikimedia-stretch/manifests/latest - 394 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Docker [11:01:04] Urbanecm: don’t close it after you’re done, please :) [11:01:12] backports are +2'ed to give time for CI [11:01:13] Lucas_WMDE, ack [11:02:44] (03Merged) 10jenkins-bot: Create image-reviewer for commonswiki with same rights as Image-reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523006 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [11:03:02] (03CR) 10jenkins-bot: Create image-reviewer for commonswiki with same rights as Image-reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523006 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [11:03:28] jan_drewniak, there are uncommited changes to portals submodule on deploy1001 [11:06:08] ACKNOWLEDGEMENT - Docker registry HTTPS interface on registry1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string schemaVersion not found on https://registry1002.eqiad.wmnet:443/v2/wikimedia-stretch/manifests/latest - 394 bytes in 0.091 second response time Fsero know while new swift container is getting populated https://wikitech.wikimedia.org/wiki/Docker [11:06:28] (03PS4) 10Jbond: hiera backends: update hiera.yaml file to work with puppet 4.9 [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) [11:08:04] (03PS2) 10Urbanecm: Enable WikiLove and SandboxLink on sqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523003 (https://phabricator.wikimedia.org/T227970) [11:08:10] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523003 (https://phabricator.wikimedia.org/T227970) (owner: 10Urbanecm) [11:08:27] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[:gerrit:523006|Create image-reviewer for commonswiki with same rights as Image-reviewer]] (T216406) (duration: 00m 52s) [11:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:32] T216406: Rename `Image-reviewer` to `image-reviewer`, then migrate all its members - https://phabricator.wikimedia.org/T216406 [11:09:30] (03Merged) 10jenkins-bot: Enable WikiLove and SandboxLink on sqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523003 (https://phabricator.wikimedia.org/T227970) (owner: 10Urbanecm) [11:09:45] (03CR) 10jenkins-bot: Enable WikiLove and SandboxLink on sqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523003 (https://phabricator.wikimedia.org/T227970) (owner: 10Urbanecm) [11:11:04] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Disallow admins to grant or revoke image reviewer due to migration (T216406) (duration: 00m 50s) [11:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:49] (03PS1) 10Urbanecm: Disallow admins to grant or remove image reviewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523137 (https://phabricator.wikimedia.org/T216406) [11:12:03] (03PS2) 10Urbanecm: Disallow admins to grant or remove image reviewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523137 (https://phabricator.wikimedia.org/T216406) [11:12:13] (03CR) 10Urbanecm: [C: 03+2] "SWAT, already deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523137 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [11:13:16] (03Merged) 10jenkins-bot: Disallow admins to grant or remove image reviewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523137 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [11:13:24] !log Running mwscript migrateUserGroup.php --wiki=commonswiki Image-reviewer image-reviewer for T216406 [11:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:36] (03CR) 10jenkins-bot: Disallow admins to grant or remove image reviewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523137 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [11:15:08] !log Running mwscript extensions/WikimediaMaintenance/createExtensionTables.php sqwiki wikilove for T227970 [11:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:13] T227970: Activate WikiLove and SandboxLink extensions for SqWiki - https://phabricator.wikimedia.org/T227970 [11:16:55] (03PS5) 10Jbond: hiera backends: update hiera.yaml file to work with puppet 4.9 [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) [11:17:44] (03CR) 10jerkins-bot: [V: 04-1] hiera backends: update hiera.yaml file to work with puppet 4.9 [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) (owner: 10Jbond) [11:18:07] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:19:19] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:523003|Enable WikiLove and SandboxLink on sqwiki]] (T227970) (duration: 00m 51s) [11:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:27] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522987 (https://phabricator.wikimedia.org/T227980) (owner: 10Urbanecm) [11:19:37] (03PS6) 10Jbond: hiera backends: update hiera.yaml file to work with puppet 4.9 [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) [11:22:24] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.13/includes/Title.php: SWAT: [[:gerrit:522871|When title contains only slashes, Title::getRootText() shouldnt return false]] (T227816) (duration: 00m 51s) [11:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:45] T227816: Fatal error from page views with invalid titles (instead of "Bad title" message) - https://phabricator.wikimedia.org/T227816 [11:23:07] !log installing python-django security updates on jessie [11:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:37] jan_drewniak, could you please solve the uncommited changes thing, please? [11:23:51] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 229.71 ms [11:24:11] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.13/includes/libs/http/MultiHttpClient.php: SWAT: [[:gerrit:522951|Raise default reqTimeout in MultiHttpClient]] (T226979) (duration: 00m 51s) [11:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:16] T226979: Increase curl timeout for importImages.php - https://phabricator.wikimedia.org/T226979 [11:24:45] (03PS4) 10Urbanecm: Move private and fishbowl overrides from groupOverrides to groupOverrides2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522987 (https://phabricator.wikimedia.org/T227980) [11:24:53] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522987 (https://phabricator.wikimedia.org/T227980) (owner: 10Urbanecm) [11:25:59] (03PS8) 10Urbanecm: Rename `Image-reviewer` to `image-reviewer` for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520283 (https://phabricator.wikimedia.org/T216406) [11:26:13] (03Merged) 10jenkins-bot: Move private and fishbowl overrides from groupOverrides to groupOverrides2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522987 (https://phabricator.wikimedia.org/T227980) (owner: 10Urbanecm) [11:26:20] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10Urbanecm) [11:26:47] (03CR) 10jenkins-bot: Move private and fishbowl overrides from groupOverrides to groupOverrides2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522987 (https://phabricator.wikimedia.org/T227980) (owner: 10Urbanecm) [11:26:59] (03PS2) 10Urbanecm: Enable partial blocks on the Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523128 (https://phabricator.wikimedia.org/T228008) [11:27:05] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523128 (https://phabricator.wikimedia.org/T228008) (owner: 10Urbanecm) [11:28:16] (03Merged) 10jenkins-bot: Enable partial blocks on the Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523128 (https://phabricator.wikimedia.org/T228008) (owner: 10Urbanecm) [11:28:41] (03CR) 10jenkins-bot: Enable partial blocks on the Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523128 (https://phabricator.wikimedia.org/T228008) (owner: 10Urbanecm) [11:28:50] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:522987|Move private and fishbowl overrides from groupOverrides to groupOverrides2]] (T227980) (duration: 00m 51s) [11:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:56] T227980: Make it possible to close private/fishbowl wikis - https://phabricator.wikimedia.org/T227980 [11:31:30] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:523128|Enable partial blocks on the Finnish Wikipedia]] (T228008) (duration: 00m 51s) [11:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:36] T228008: Enable partial blocks on the Finnish Wikipedia - https://phabricator.wikimedia.org/T228008 [11:31:48] jan_drewniak, created T228031 for that problem, seems to affect also other clones [11:31:49] T228031: Upon clonning operations/mediawiki-config, uncommited changes are in portals submodule - https://phabricator.wikimedia.org/T228031 [11:32:06] (03PS9) 10Urbanecm: Rename `Image-reviewer` to `image-reviewer` for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520283 (https://phabricator.wikimedia.org/T216406) [11:32:18] Lucas_WMDE, I'm "done" [11:32:27] need a script to finish, then I'll be able to deploy last change [11:32:46] so once you're done with your change(s), please do not close the window [11:36:00] ok [11:36:01] (03CR) 10Jbond: "PCC WMCS (noop): https://puppet-compiler.wmflabs.org/compiler1001/17376/" [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) (owner: 10Jbond) [11:36:16] I haven’t even uploaded the changes yet (about to), and haven’t yet gotten feedback from the feature owner [11:36:23] so I probably won’t deploy anything after all [11:36:24] aha :) [11:36:30] perhaps this evening (“Morning” SWAT) [11:36:44] ín that case, I'll just wait for the script to complete and then finish :D [11:37:09] (03PS1) 10Lucas Werkmeister (WMDE): Define settings for Citoid+Wikibase integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523139 [11:37:11] (03PS1) 10Lucas Werkmeister (WMDE): Set $wgWBRepoSettings['enableRefTabs'] in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523140 [11:37:13] (03PS1) 10Lucas Werkmeister (WMDE): Configure Citoid+Wikibase integration on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523141 [11:39:09] although, perhaps I could do https://gerrit.wikimedia.org/r/522125… Amir1 do you want to quickly review that one? [11:39:09] (03PS10) 10Urbanecm: Rename `Image-reviewer` to `image-reviewer` for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520283 (https://phabricator.wikimedia.org/T216406) [11:39:29] * Amir1 looking [11:39:30] (that’s https://gerrit.wikimedia.org/r/522125 for dumb IRC client who think the … is part of the link, like mine) [11:40:13] same here :D [11:40:22] Lucas_WMDE, should I wait for your deploy, or deploy my last change now? [11:40:27] (the script just finished) [11:40:34] Urbanecm: go ahead first, just don’t close the window yet [11:40:37] okay [11:40:50] (03PS11) 10Urbanecm: Rename `Image-reviewer` to `image-reviewer` for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520283 (https://phabricator.wikimedia.org/T216406) [11:40:56] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520283 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [11:41:12] (03CR) 10Ladsgroup: Specify $wmgWBRepoConceptBaseUri again (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522125 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [11:42:02] (03Merged) 10jenkins-bot: Rename `Image-reviewer` to `image-reviewer` for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520283 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [11:42:17] (03CR) 10jenkins-bot: Rename `Image-reviewer` to `image-reviewer` for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520283 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [11:44:39] (03CR) 10Lucas Werkmeister (WMDE): Specify $wmgWBRepoConceptBaseUri again (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522125 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [11:45:51] (03PS1) 10Hashar: ci:master: further tweak disk check filter [puppet] - 10https://gerrit.wikimedia.org/r/523142 [11:45:56] (03CR) 10Ladsgroup: Specify $wmgWBRepoConceptBaseUri again (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522125 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [11:46:10] (03PS2) 10Hashar: ci:master: further tweak disk check filter [puppet] - 10https://gerrit.wikimedia.org/r/523142 (https://phabricator.wikimedia.org/T227605) [11:46:54] (03CR) 10Lucas Werkmeister (WMDE): Specify $wmgWBRepoConceptBaseUri again (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522125 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [11:47:33] (03CR) 10Ladsgroup: [C: 03+1] Specify $wmgWBRepoConceptBaseUri again (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522125 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [11:47:53] jouncebot: now [11:47:53] For the next 0 hour(s) and 12 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190715T1100) [11:48:37] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[:gerrit:520283|Rename `Image-reviewer` to `image-reviewer` for Commons]] (1/2, T216406) (duration: 00m 50s) [11:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:43] T216406: Rename `Image-reviewer` to `image-reviewer`, then migrate all its members - https://phabricator.wikimedia.org/T216406 [11:49:36] (03PS2) 10Lucas Werkmeister (WMDE): Specify $wmgWBRepoConceptBaseUri again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522125 (https://phabricator.wikimedia.org/T225212) [11:49:38] (03PS2) 10Lucas Werkmeister (WMDE): Specify $wgWBRepoSettings['conceptBaseUri'] again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522126 (https://phabricator.wikimedia.org/T225212) [11:49:43] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:520283|Rename `Image-reviewer` to `image-reviewer` for Commons]] (2/2, T216406) (duration: 00m 48s) [11:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:29] rebased my patches and added them to the deployment calendar [11:50:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) (owner: 10Jbond) [11:51:49] Urbanecm: are you still deploying? [11:51:51] yes [11:51:52] one last sync [11:51:59] ok [11:52:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Regrant image reviewers on commonswiki the ability to mass upload (T216406) (duration: 00m 50s) [11:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:13] (03PS1) 10Urbanecm: Regrant image-reviewers@commonswiki mass-upload right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523143 (https://phabricator.wikimedia.org/T216406) [11:53:18] Lucas_WMDE, space is clear [11:53:23] ok thanks [11:53:30] I’ll deploy at least the first one, that should be harmless [11:53:35] not sure if enough time for second after that [11:53:50] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523143 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [11:53:59] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522125 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [11:54:01] you'll see :) [11:54:09] (03PS3) 10Lucas Werkmeister (WMDE): Specify $wmgWBRepoConceptBaseUri again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522125 (https://phabricator.wikimedia.org/T225212) [11:54:11] (03CR) 10jenkins-bot: Regrant image-reviewers@commonswiki mass-upload right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523143 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [11:55:00] (03CR) 10Lucas Werkmeister (WMDE): Specify $wmgWBRepoConceptBaseUri again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522125 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [11:55:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522125 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [11:55:52] deploy1001 still has the diff on the portals submodule :/ [11:56:03] yes [11:56:04] (03Merged) 10jenkins-bot: Specify $wmgWBRepoConceptBaseUri again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522125 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [11:56:08] I've created an UBN task for that [11:56:23] T228031 [11:56:23] T228031: Upon pulling operations/mediawiki-config, uncommited changes are in portals submodule - https://phabricator.wikimedia.org/T228031 [11:57:08] testing briefly on mwdebug1002 [11:57:28] (03CR) 10jenkins-bot: Specify $wmgWBRepoConceptBaseUri again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522125 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [12:00:23] (03PS3) 10Lucas Werkmeister (WMDE): Specify $wgWBRepoSettings['conceptBaseUri'] again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522126 (https://phabricator.wikimedia.org/T225212) [12:00:24] !log Running mwscript initSiteStats.php --wiki=commonswiki --update to update Special:Statistics after a big change (T216406) [12:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:30] T216406: Rename `Image-reviewer` to `image-reviewer`, then migrate all its members - https://phabricator.wikimedia.org/T216406 [12:00:40] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/: SWAT: [[gerrit:522125|Specify $wmgWBRepoConceptBaseUri again (T225212)]] (duration: 00m 51s) [12:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:46] T225212: Specify $wgWBRepoSettings['conceptBaseUri'] - https://phabricator.wikimedia.org/T225212 [12:01:31] I’m doing the second change as well [12:01:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522126 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [12:02:12] (03Merged) 10jenkins-bot: Specify $wgWBRepoSettings['conceptBaseUri'] again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522126 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [12:02:31] (03CR) 10jenkins-bot: Specify $wgWBRepoSettings['conceptBaseUri'] again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522126 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [12:02:48] testing on mwdebug1002 [12:05:32] everything looks okay, syncing [12:06:21] !log removing myself from cn=tools.admin (currently not used, was mostly historical for debugging some Toollabs issue in the past) [12:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:15] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:522126|Specify $wgWBRepoSettings['conceptBaseUri'] again (T225212)]] (duration: 00m 50s) [12:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:20] T225212: Specify $wgWBRepoSettings['conceptBaseUri'] - https://phabricator.wikimedia.org/T225212 [12:09:23] oh, I think I’m seeing suspicious errors… [12:10:30] hm, only one occurrence so far though (https://logstash.wikimedia.org/goto/1b070cda346747d3896c785dbee0ea6b) [12:10:57] !log installing ldap-replica200[12] (T227778) [12:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:03] T227778: Create an LDAP replica in codfw (using LVS) - https://phabricator.wikimedia.org/T227778 [12:14:14] is it possible to get a php shell on a specific app server? I get “mwscript: command not found” on mw1319 [12:14:15] (03PS1) 10Hashar: releases: stop using contint for php, use prod profile [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) [12:15:16] Lucas_WMDE: I use mwmaint1002.eqiad.wmnet [12:15:26] hashar: I want to check specifically mw1319 [12:15:35] because that’s where the log message happened [12:16:22] !log update redis on mwlog, pybal-test, maps and rdb* [12:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:31] Lucas_WMDE: I guess you can copy paste the script so ? :-\ [12:16:55] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:17:35] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 229.58 ms [12:18:23] `sudo -u www-data php7.2 multiversion/MWScript.php shell.php wikidatawiki` seems to work [12:18:31] and everything looks fine on that host now [12:18:51] so I guess I’ll keep checking logstash and, if nothing else happens, ignore the one-off error :/ [12:20:18] !log ladsgroup@mwmaint1002:~$ mwscript maintenance/createAndPromote.php --wiki=testwikidatawiki --force --bureaucrat Ladsgroup [12:20:21] oops, now there’s a message in logstash from that shell.php because it can’t write to my psy shell directory ^^ [12:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:47] Lucas_WMDE: probably want to use: sudo --set-home [12:20:50] or sudo -H [12:23:00] actually, regular mwscript has the same issue [12:23:03] I’ll create a task [12:23:24] (03PS1) 10Hashar: contint: remove php packages [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) [12:23:30] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [12:23:48] (03CR) 10Mvolz: [C: 03+1] "Looks fine to me but admittedly I don't know enough about how the config is constructed from this to give that much of an informed opinion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523141 (owner: 10Lucas Werkmeister (WMDE)) [12:23:50] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [12:24:17] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:24:18] (03PS4) 10Gehel: wdqs: update response time check to new prometheus metrics. [puppet] - 10https://gerrit.wikimedia.org/r/522499 [12:25:29] (03CR) 10Gehel: [C: 03+2] wdqs: update response time check to new prometheus metrics. [puppet] - 10https://gerrit.wikimedia.org/r/522499 (owner: 10Gehel) [12:25:47] !log installing openjpeg2 security updates [12:25:50] (03PS1) 10Fsero: k8s,helmfile: added raw chart into releases for being used in helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/523149 [12:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:01] (03PS2) 10Fsero: k8s,helmfile: added raw chart into releases for being used in helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/523149 [12:27:35] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s,helmfile: added raw chart into releases for being used in helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/523149 (owner: 10Fsero) [12:28:37] (03PS1) 10Hashar: contint: apply apt::unattend_upgrade at role level [puppet] - 10https://gerrit.wikimedia.org/r/523150 [12:29:05] (03PS2) 10Hashar: contint: apply apt::unattend_upgrade at role level [puppet] - 10https://gerrit.wikimedia.org/r/523150 (https://phabricator.wikimedia.org/T225735) [12:29:08] https://phabricator.wikimedia.org/T228041 if anyone’s interested in the psy shell issue [12:29:49] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 229.47 ms [12:30:49] logstash looks okay otherwise, so I’ll call that deployment successful [12:31:04] Urbanecm: I just realized I didn’t log the end of EU SWAT yet – you’re done, right? [12:31:31] Lucas_WMDE, was investigating why Image-reviewer group didn't disappear from commons even when deleted from IS.php [12:31:41] hm, ok [12:32:01] and since I now know the cause, I'm going to fix the problem [12:32:23] ok [12:32:27] then I’ll hand the SWAT back to you [12:32:52] thanks [12:32:55] (03PS1) 10Urbanecm: Delete Image-reviewer group from commonswiki for good [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523151 (https://phabricator.wikimedia.org/T216406) [12:33:24] (03CR) 10Urbanecm: [C: 03+2] Delete Image-reviewer group from commonswiki for good [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523151 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [12:34:29] (03Merged) 10jenkins-bot: Delete Image-reviewer group from commonswiki for good [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523151 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [12:35:25] !log reimporting OSM data for maps eqiad cluster - T218097 [12:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:31] T218097: [Bug] Some OSM relations didn't become polygons and are not been served through geoshapes service - https://phabricator.wikimedia.org/T218097 [12:36:04] (03CR) 10MSantos: Disable replicate and admin cron in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/522072 (https://phabricator.wikimedia.org/T215641) (owner: 10MSantos) [12:36:21] (03PS2) 10Gehel: Disable replicate and admin cron in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/522072 (https://phabricator.wikimedia.org/T215641) (owner: 10MSantos) [12:36:29] (03CR) 10jenkins-bot: Delete Image-reviewer group from commonswiki for good [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523151 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [12:36:39] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:08] (03CR) 10Gehel: [C: 03+2] Disable replicate and admin cron in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/522072 (https://phabricator.wikimedia.org/T215641) (owner: 10MSantos) [12:42:15] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 86%, RTA = 229.63 ms [12:42:48] (03PS2) 10Hashar: releases: stop using contint for php, use prod profile [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) [12:42:50] (03PS2) 10Hashar: contint: remove php packages [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) [12:42:52] (03PS3) 10Hashar: contint: apply apt::unattend_upgrade at role level [puppet] - 10https://gerrit.wikimedia.org/r/523150 (https://phabricator.wikimedia.org/T225735) [12:43:07] (03PS1) 10Urbanecm: Revert "Delete Image-reviewer group from commonswiki for good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523154 [12:43:22] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Delete Image-reviewer group from commonswiki for good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523154 (owner: 10Urbanecm) [12:43:34] PROBLEM - Kartotherian LVS eqiad on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [12:43:55] ^that's me, silencing now [12:44:00] ack [12:44:01] tx [12:44:04] tx [12:44:13] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [12:44:41] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [12:45:31] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1003.eqiad.wmnet, maps1002.eqiad.wmnet, maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:46:11] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1003.eqiad.wmnet, maps1002.eqiad.wmnet, maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:46:37] actually, this isn't expected, looks like our procedure needs more work [12:46:45] still, we're on it with mateus [12:47:46] (03CR) 10jenkins-bot: Revert "Delete Image-reviewer group from commonswiki for good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523154 (owner: 10Urbanecm) [12:50:12] (03CR) 10Lucas Werkmeister (WMDE): "> Beta will also need a configuration message which I'll need admin privileges for to edit on beta (Username: Mvolz). I've only got them o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523141 (owner: 10Lucas Werkmeister (WMDE)) [12:50:39] (03PS1) 10Urbanecm: Delete Image-reviewer group from commonswiki for good [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523157 (https://phabricator.wikimedia.org/T216406) [12:50:41] !log restarting kartotherian on maps1002 [12:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:02] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10CDanis) 05Resolved→03Open [12:51:31] (03CR) 10CDanis: [C: 03+2] conftool: add support for --version to all executables [software/conftool] - 10https://gerrit.wikimedia.org/r/522235 (owner: 10CDanis) [12:51:33] PROBLEM - kartotherian endpoints health on maps1002 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received https://wikitech.wikimedia.org/wiki/Servi [12:51:33] rtotherian [12:51:43] (03CR) 10Urbanecm: [C: 03+2] Delete Image-reviewer group from commonswiki for good [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523157 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [12:51:53] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-upload&var-status_type=5 [12:51:59] Urbanecm: sorry I just read the message about uncommitted changes, is that still a problem? [12:52:09] jan_drewniak, yes [12:52:19] login to deploy1001 and run git status in /srv/mediawiki-stagging [12:52:34] (03Merged) 10jenkins-bot: Delete Image-reviewer group from commonswiki for good [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523157 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [12:52:51] (03CR) 10jenkins-bot: Delete Image-reviewer group from commonswiki for good [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523157 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [12:52:55] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: Delete Image-reviewer group from commonswiki for good (T216406) (duration: 00m 51s) [12:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:02] T216406: Rename `Image-reviewer` to `image-reviewer`, then migrate all its members - https://phabricator.wikimedia.org/T216406 [12:53:03] the upload 5xx are maps btw, likely the karthoterian thing cc gehel [12:53:51] godog: yep, definitely me, looks like CPU spiked like crazy when depooling maps1004, still working on it [12:54:07] kk, thanks for the update gehel [12:54:08] * akosiaris around btw [12:54:27] not keen on moving traffic to codfw since it looks load related [12:54:32] (03Merged) 10jenkins-bot: conftool: add support for --version to all executables [software/conftool] - 10https://gerrit.wikimedia.org/r/522235 (owner: 10CDanis) [12:54:33] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [12:54:46] (03CR) 10CDanis: [C: 03+2] nrpe: $critical is a boolean, NOT a string! 😤 [puppet] - 10https://gerrit.wikimedia.org/r/522992 (https://phabricator.wikimedia.org/T113783) (owner: 10CDanis) [12:54:54] !log shutting down tilerator on maps eqiad to free some CPU - [12:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:05] !log shutting down tilerator on maps eqiad to free some CPU - T225713 [12:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:10] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [12:55:25] (03PS6) 10CDanis: nrpe: $critical is a boolean, NOT a string! 😤 [puppet] - 10https://gerrit.wikimedia.org/r/522992 (https://phabricator.wikimedia.org/T113783) [12:56:07] (03CR) 10Elukey: profile::hue|swap: use new ldap hiera configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523111 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [12:56:08] PROBLEM - LVS HTTP IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:57:11] PROBLEM - Check systemd state on maps1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:57:32] (03PS3) 10Elukey: profile::hue|swap: use new ldap hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/523111 (https://phabricator.wikimedia.org/T227611) [12:58:41] PROBLEM - Check systemd state on maps1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:59:11] PROBLEM - Check systemd state on maps1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:59:19] !log re-enabling kartotherian codfw - T225713 [12:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:32] 10Operations, 10serviceops, 10PHP 7.2 support: Socket Errors on PHP7 - https://phabricator.wikimedia.org/T224538 (10jijiki) Removing kafka1018 didn't fix the problem, still looking [12:59:48] !log depooling kartotherian eqiad - T225713 [12:59:49] Urbanecm: seems like I forgot to do "git submodule update portals" - so just did that. It's ok that I run the scap sync now for that right? [12:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:12] jan_drewniak, AFAICS, no one should be deploying, so yes, that should be fine IMO. [13:00:28] RECOVERY - LVS HTTP IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1288 bytes in 5.227 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:01:17] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:01:43] RECOVERY - kartotherian endpoints health on maps1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:01:49] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [13:01:52] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:523127| Bumping portals to master (T128546)]] (duration: 00m 50s) [13:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:57] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [13:02:01] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:02:04] RECOVERY - Kartotherian LVS eqiad on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [13:02:07] RECOVERY - Check systemd state on maps1001 is OK: OK - running: The system is fully operational [13:02:35] maps seems to be recovering now that traffic is sent to codfw, still not sure why the load on eqiad climbed that much, it does not make sense yet to me [13:02:42] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:523127| Bumping portals to master (T128546)]] (duration: 00m 50s) [13:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:41] thanks jan_drewniak, looks sane to me now [13:03:48] Urbanecm: ok re-deployed that, sorry for that! [13:04:09] 10Operations, 10observability: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10Bstorm) @MoritzMuehlenhoff How were the Cloud NFS servers handled? They won't remove the diamond software unless told to I imagine. I don't see that here, though? [13:04:45] 10Operations, 10observability: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10Bstorm) I was on vacation last week, so I wasn't following the code reviews. [13:04:57] (03CR) 10Hashar: "Puppet compiler fails and I have filled T228047" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [13:06:06] thanks jan_drewniak [13:06:11] (03PS1) 10Hashar: Test compiler for releases1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/523161 (https://phabricator.wikimedia.org/T228047) [13:06:45] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/523161 (https://phabricator.wikimedia.org/T228047) (owner: 10Hashar) [13:06:50] 10Operations, 10observability: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10MoritzMuehlenhoff) It got removed from all production hosts (i.e. including cloudstore*) in fcd6990165c7ec8922a531d11782e21f1a5de04f and made specific to Cloud VPS instances with 3afb8303f164ced695dd597... [13:10:09] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:10:27] 10Operations, 10observability: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10Bstorm) Thank you!! [13:10:45] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-upload&var-status_type=5 [13:12:10] (03PS3) 10Muehlenhoff: Create two LDAP replicas in codfw [puppet] - 10https://gerrit.wikimedia.org/r/522102 (https://phabricator.wikimedia.org/T227669) [13:14:29] 10Operations, 10Continuous-Integration-Infrastructure, 10puppet-compiler, 10Patch-For-Review: puppet compiler fails on releases1001.eqiad.wmnet due to lack of Service[bacula-director] - https://phabricator.wikimedia.org/T228047 (10hashar) [13:15:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/523111 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [13:16:36] !log repooling maps eqiad - T218097 [13:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:43] T218097: [Bug] Some OSM relations didn't become polygons and are not been served through geoshapes service - https://phabricator.wikimedia.org/T218097 [13:17:34] (03PS10) 10CDanis: WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 [13:22:09] (03PS4) 10Alexandros Kosiaris: Update my obsolete YubiKey-stored SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/519941 (https://phabricator.wikimedia.org/T227638) (owner: 10Aaron Schulz) [13:22:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] Update my obsolete YubiKey-stored SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/519941 (https://phabricator.wikimedia.org/T227638) (owner: 10Aaron Schulz) [13:23:13] !log Running mwscript importImages.php --wiki=commonswiki --user=Meisam /home/urbanecm/T223052 [13:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:07] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Apply updated YubiKey SSH keys for aaron - https://phabricator.wikimedia.org/T227638 (10akosiaris) 05Open→03Resolved a:03akosiaris Thanks for filling the task. Keys double checked, change merged. Should have propagated to the entirety of the flee... [13:27:35] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:33:09] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:53] (03PS1) 10Jbond: kerberos/rsync_secrets_file: add file [labs/private] - 10https://gerrit.wikimedia.org/r/523165 [13:34:44] ACKNOWLEDGEMENT - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 98 probes of 437 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map CDanis https://phabricator.wikimedia.org/T228015 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:35:42] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/522102 (https://phabricator.wikimedia.org/T227669) (owner: 10Muehlenhoff) [13:35:54] ACKNOWLEDGEMENT - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% CDanis https://phabricator.wikimedia.org/T227967 [13:36:34] (03CR) 10Jbond: [V: 03+2 C: 03+2] kerberos/rsync_secrets_file: add file [labs/private] - 10https://gerrit.wikimedia.org/r/523165 (owner: 10Jbond) [13:37:01] PROBLEM - Check systemd state on maps1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:37:54] (03PS1) 10Alexandros Kosiaris: Add jakob to deployers [puppet] - 10https://gerrit.wikimedia.org/r/523166 (https://phabricator.wikimedia.org/T227193) [13:39:27] 10Operations, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223698 (10akosiaris) @alaa_wmde. Gentle reminder about generating and posting a separate SSH key... [13:41:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/523166 (https://phabricator.wikimedia.org/T227193) (owner: 10Alexandros Kosiaris) [13:41:36] godog, could you please have a look at T226937#5330884? [13:41:37] T226937: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 [13:41:54] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10akosiaris) 05Open→03Stalled p:05Normal→03Low Setting stalled and low priority per comments above. @sbassett feel free t... [13:41:57] 10Operations, 10Security-Team: Establish secteam production norms - https://phabricator.wikimedia.org/T224886 (10akosiaris) [13:46:07] (03PS2) 10Ema: ATS: log origin server hostname and Backend-Timing [puppet] - 10https://gerrit.wikimedia.org/r/523130 (https://phabricator.wikimedia.org/T227668) [13:46:09] (03PS1) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) [13:46:21] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10akosiaris) @DLynch Could you please look at @Nuria's comment above ? Thank you. [13:46:50] (03CR) 10jerkins-bot: [V: 04-1] ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [13:47:09] (03PS3) 10Ema: ATS: log origin server hostname and Backend-Timing [puppet] - 10https://gerrit.wikimedia.org/r/523130 (https://phabricator.wikimedia.org/T227668) [13:47:18] (03CR) 10Bstorm: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) (owner: 10BryanDavis) [13:48:41] Urbanecm: yes, on my to look at list [13:48:47] thanks [13:48:59] 10Operations, 10LDAP: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650 (10MoritzMuehlenhoff) p:05Triage→03Normal [13:50:17] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 229.52 ms [13:50:39] (03PS4) 10Ema: ATS: log origin server hostname and Backend-Timing [puppet] - 10https://gerrit.wikimedia.org/r/523130 (https://phabricator.wikimedia.org/T227668) [13:50:41] (03PS2) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) [13:51:02] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10akosiaris) Adding @Nuria as the manager for analytics clusters. A comment I have is that maybe [[ https://turnilo.wikimedia.org | turnilo ]] or `analytics-users` wo... [13:51:18] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10akosiaris) p:05Triage→03Normal [13:51:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add jakob to deployers [puppet] - 10https://gerrit.wikimedia.org/r/523166 (https://phabricator.wikimedia.org/T227193) (owner: 10Alexandros Kosiaris) [13:52:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] Create two LDAP replicas in codfw [puppet] - 10https://gerrit.wikimedia.org/r/522102 (https://phabricator.wikimedia.org/T227669) (owner: 10Muehlenhoff) [13:53:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10akosiaris) 05Open→03Resolved a:03akosiaris Task has been opened for the required amount of days... [13:54:08] (03PS2) 10Elukey: role::swap: add profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/523090 (https://phabricator.wikimedia.org/T170826) [13:54:45] !log otto@deploy1001 Started deploy [analytics/refinery@3296aab] (notebook): (no justification provided) [13:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:00] !log otto@deploy1001 Finished deploy [analytics/refinery@3296aab] (notebook): (no justification provided) (duration: 00m 15s) [13:55:01] (03CR) 10Elukey: [C: 03+2] role::swap: add profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/523090 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [13:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:38] !log otto@deploy1001 Started deploy [analytics/refinery@3296aab] (notebook): (no justification provided) [13:55:43] !log enable profile::base::firewall on notebook100[3,4] [13:55:44] !log otto@deploy1001 Finished deploy [analytics/refinery@3296aab] (notebook): (no justification provided) (duration: 00m 06s) [13:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:05] gehel: are you aware of tilerator on maps1001? [13:59:44] 10Operations, 10Continuous-Integration-Infrastructure, 10puppet-compiler, 10Patch-For-Review: puppet compiler fails on releases1001.eqiad.wmnet due to lack of Service[bacula-director] - https://phabricator.wikimedia.org/T228047 (10akosiaris) 05Open→03Invalid ` Warning: Could not find resource 'Service[... [14:00:21] volans: yep, sorry, failed to downtime that one [14:00:45] no prob, just making sure was not overlooked ;) [14:01:09] ACKNOWLEDGEMENT - Check systemd state on maps1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel OSM reimport in progress - https://phabricator.wikimedia.org/T218097 [14:01:14] in times of stress, always better to check twice! [14:01:33] :) [14:03:42] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10DLynch) @Nuria I am indeed a permanent employee of the foundation, and believe that I have a NDA on file. (Sorry, I was taking some vacation last week, and missed the... [14:04:52] !log otto@deploy1001 Started deploy [analytics/refinery@3296aab] (notebook): (no justification provided) [14:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:57] !log otto@deploy1001 Finished deploy [analytics/refinery@3296aab] (notebook): (no justification provided) (duration: 00m 05s) [14:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:09] (03PS4) 10Muehlenhoff: Create two LDAP replicas in codfw [puppet] - 10https://gerrit.wikimedia.org/r/522102 (https://phabricator.wikimedia.org/T227669) [14:05:13] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:31] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Mailing-lists: rename mailing list "ri-team" to "product-infrastructure" - https://phabricator.wikimedia.org/T227698 (10akosiaris) [14:06:16] !log otto@deploy1001 Started deploy [analytics/refinery@3296aab] (notebook): (no justification provided) [14:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:22] !log otto@deploy1001 Finished deploy [analytics/refinery@3296aab] (notebook): (no justification provided) (duration: 00m 07s) [14:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:32] (03CR) 10Muehlenhoff: [C: 03+2] Create two LDAP replicas in codfw [puppet] - 10https://gerrit.wikimedia.org/r/522102 (https://phabricator.wikimedia.org/T227669) (owner: 10Muehlenhoff) [14:07:49] !log otto@deploy1001 Started deploy [analytics/refinery@3296aab] (notebook): (no justification provided) [14:07:50] !log otto@deploy1001 Finished deploy [analytics/refinery@3296aab] (notebook): (no justification provided) (duration: 00m 01s) [14:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:56] !log otto@deploy1001 Started deploy [analytics/refinery@3296aab] (notebook): (no justification provided) [14:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:57] !log otto@deploy1001 Finished deploy [analytics/refinery@3296aab] (notebook): (no justification provided) (duration: 00m 01s) [14:07:59] !log otto@deploy1001 Started deploy [analytics/refinery@3296aab] (notebook): (no justification provided) [14:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:06] !log otto@deploy1001 Finished deploy [analytics/refinery@3296aab] (notebook): (no justification provided) (duration: 00m 06s) [14:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:37] (03PS5) 10Ema: ATS: log origin server hostname and Backend-Timing [puppet] - 10https://gerrit.wikimedia.org/r/523130 (https://phabricator.wikimedia.org/T227668) [14:09:39] (03PS3) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) [14:09:49] (03PS1) 10Alexandros Kosiaris: Rename ri-team lists to product-infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/523175 (https://phabricator.wikimedia.org/T227698) [14:10:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] Rename ri-team lists to product-infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/523175 (https://phabricator.wikimedia.org/T227698) (owner: 10Alexandros Kosiaris) [14:13:53] PROBLEM - puppet last run on ldap-replica2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/ldap] [14:15:53] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Mailing-lists, 10Patch-For-Review: rename mailing list "ri-team" to "product-infrastructure" - https://phabricator.wikimedia.org/T227698 (10akosiaris) 05Open→03Resolved a:03akosiaris List renamed, resolving. [14:16:41] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 229.48 ms [14:18:39] (03PS6) 10Ema: ATS: log origin server hostname and Backend-Timing [puppet] - 10https://gerrit.wikimedia.org/r/523130 (https://phabricator.wikimedia.org/T227668) [14:18:41] (03PS4) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) [14:27:14] (03PS1) 10Ottomata: Produce revision-visibility-change event to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523180 (https://phabricator.wikimedia.org/T211248) [14:28:52] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:30] 10Operations, 10Wikimedia-Mailing-lists: Subscribe Urbanecm to ops@lists.wikimedia.org - https://phabricator.wikimedia.org/T228061 (10Urbanecm) [14:33:13] (03CR) 10Ottomata: [C: 03+1] profile::hue|swap: use new ldap hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/523111 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [14:33:56] (03PS7) 10Ema: ATS: log origin server hostname and Backend-Timing [puppet] - 10https://gerrit.wikimedia.org/r/523130 (https://phabricator.wikimedia.org/T227668) [14:34:24] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 229.61 ms [14:36:00] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10fgiunchedi) >>! In T218544#5329233, @Cmjohnson wrote: > This is a dell server, I will try and put in a ticket with Dell but all h/w is showing that there isn't... [14:41:44] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:42:15] (03PS4) 10Elukey: profile::hue|swap: use new ldap hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/523111 (https://phabricator.wikimedia.org/T227611) [14:43:24] (03PS1) 10Muehlenhoff: Allow the new LDAP replicas in codfe to access acmechief [puppet] - 10https://gerrit.wikimedia.org/r/523190 [14:43:44] (03CR) 10Elukey: [C: 03+2] profile::hue|swap: use new ldap hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/523111 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [14:46:51] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/523190 (owner: 10Muehlenhoff) [14:47:28] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 86%, RTA = 229.87 ms [14:47:42] 10Operations, 10Analytics, 10LDAP-Access-Requests: Add Jan Dittrich to the ldap/wmde and ldap/nda groups - https://phabricator.wikimedia.org/T227774 (10akosiaris) 05Open→03Resolved a:03akosiaris WMDE-jand is already part of `wmde` ldap group. To be added to the `nda` ldap group, having signed the NDA i... [14:49:30] (03PS2) 10Muehlenhoff: Allow the new LDAP replicas in codfe to access acmechief [puppet] - 10https://gerrit.wikimedia.org/r/523190 [14:50:19] 10Operations, 10Wikimedia-Mailing-lists: Subscribe Urbanecm to ops@lists.wikimedia.org - https://phabricator.wikimedia.org/T228061 (10akosiaris) 05Open→03Resolved a:03akosiaris I 've just subscribed you. Resolving, feel free to reopen is something is amiss [14:50:56] 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10akosiaris) p:05Triage→03Normal [14:51:09] 10Operations, 10ops-eqiad: Degraded RAID on analytics1032 - https://phabricator.wikimedia.org/T227940 (10akosiaris) p:05Triage→03Normal [14:51:20] (03CR) 10Muehlenhoff: [C: 03+2] Allow the new LDAP replicas in codfe to access acmechief [puppet] - 10https://gerrit.wikimedia.org/r/523190 (owner: 10Muehlenhoff) [14:54:54] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:21] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) [15:01:38] RECOVERY - puppet last run on ldap-replica2001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:04:31] (03PS1) 10EBernhardson: Increase services proxy connect timeout to 5s [puppet] - 10https://gerrit.wikimedia.org/r/523194 (https://phabricator.wikimedia.org/T228063) [15:05:22] 10Operations, 10Analytics, 10LDAP-Access-Requests: Add Jan Dittrich to the ldap/wmde and ldap/nda groups - https://phabricator.wikimedia.org/T227774 (10Jan_Dittrich) thanks, that worked. [15:06:01] 10Operations, 10Wikimedia-Mailing-lists: Subscribe Urbanecm to ops@lists.wikimedia.org - https://phabricator.wikimedia.org/T228061 (10Urbanecm) 05Resolved→03Open I'm sorry, should've noted that explicitly in the task's description. Looks you subscribed my WMF email (murbanec-ctr@wikimedia.org). However, I... [15:06:20] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 86%, RTA = 229.99 ms [15:10:15] 10Operations, 10Wikimedia-Mailing-lists: Subscribe Urbanecm to ops@lists.wikimedia.org - https://phabricator.wikimedia.org/T228061 (10akosiaris) Sure, just done. I also removed the @wikimedia.org one, I hope that's what you wanted. [15:11:01] 10Operations, 10Wikimedia-Mailing-lists: Subscribe Urbanecm to ops@lists.wikimedia.org - https://phabricator.wikimedia.org/T228061 (10Urbanecm) 05Open→03Resolved Yes, thanks! [15:16:09] (03CR) 10Andrew Bogott: [C: 03+1] "> I'd vote against a big default on the class itself (vs. disabling by" [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) (owner: 10BryanDavis) [15:17:43] (03PS1) 10Arturo Borrero Gonzalez: sssd: fix path for pam-auth-update [puppet] - 10https://gerrit.wikimedia.org/r/523197 [15:18:40] (03CR) 10Andrew Bogott: [C: 03+1] "I wonder why this worked..." [puppet] - 10https://gerrit.wikimedia.org/r/523197 (owner: 10Arturo Borrero Gonzalez) [15:19:43] (03PS1) 10Fsero: deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 [15:20:49] (03CR) 10jerkins-bot: [V: 04-1] deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 (owner: 10Fsero) [15:22:28] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/523197 (owner: 10Arturo Borrero Gonzalez) [15:22:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sssd: fix path for pam-auth-update [puppet] - 10https://gerrit.wikimedia.org/r/523197 (owner: 10Arturo Borrero Gonzalez) [15:25:41] (03PS2) 10Fsero: deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 [15:26:41] (03CR) 10jerkins-bot: [V: 04-1] deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 (owner: 10Fsero) [15:27:49] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable WelcomeSurvey for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523202 (https://phabricator.wikimedia.org/T226221) [15:35:56] (03PS3) 10Fsero: deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 [15:36:23] (03CR) 10jerkins-bot: [V: 04-1] deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 (owner: 10Fsero) [15:53:11] (03PS8) 10Ema: ATS: log origin server hostname and Backend-Timing [puppet] - 10https://gerrit.wikimedia.org/r/523130 (https://phabricator.wikimedia.org/T227668) [15:53:13] (03PS1) 10Ema: ATS: add 'notvarnishcheck' log filter to labs configuration [puppet] - 10https://gerrit.wikimedia.org/r/523209 [15:58:45] (03CR) 10Ema: [C: 03+2] ATS: log origin server hostname and Backend-Timing [puppet] - 10https://gerrit.wikimedia.org/r/523130 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [15:58:59] (03CR) 10Ema: [C: 03+2] ATS: add 'notvarnishcheck' log filter to labs configuration [puppet] - 10https://gerrit.wikimedia.org/r/523209 (owner: 10Ema) [15:59:05] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:59:35] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 64.29% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:59:54] (03PS4) 10Fsero: deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 [16:00:51] (03PS1) 10Urbanecm: Revert "Delete Image-reviewer group from commonswiki for good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523214 (https://phabricator.wikimedia.org/T228073) [16:00:53] (03CR) 10jerkins-bot: [V: 04-1] deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 (owner: 10Fsero) [16:01:18] (03PS2) 10Urbanecm: Revert "Delete Image-reviewer group from commonswiki for good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523214 (https://phabricator.wikimedia.org/T228073) [16:02:45] (03PS5) 10Fsero: deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 [16:03:26] (03CR) 10jerkins-bot: [V: 04-1] deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 (owner: 10Fsero) [16:04:49] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 230.35 ms [16:05:02] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: kubeadm: now using external etcd servers [puppet] - 10https://gerrit.wikimedia.org/r/523220 (https://phabricator.wikimedia.org/T215531) [16:05:49] (03PS6) 10Fsero: deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 [16:06:58] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: kubeadm: now using external etcd servers [puppet] - 10https://gerrit.wikimedia.org/r/523220 (https://phabricator.wikimedia.org/T215531) [16:08:27] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.491 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [16:12:15] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:12:45] (03PS3) 10Arturo Borrero Gonzalez: toolforge: k8s: kubeadm: now using external etcd servers [puppet] - 10https://gerrit.wikimedia.org/r/523220 (https://phabricator.wikimedia.org/T215531) [16:14:29] (03PS1) 10Fsero: deploy,helmfile: added fake secrets data for admin_services [labs/private] - 10https://gerrit.wikimedia.org/r/523222 [16:15:15] (03PS4) 10Arturo Borrero Gonzalez: toolforge: k8s: kubeadm: now using external etcd servers [puppet] - 10https://gerrit.wikimedia.org/r/523220 (https://phabricator.wikimedia.org/T215531) [16:15:18] (03CR) 10Alexandros Kosiaris: [C: 03+1] Bump Termbox Staging to 2019-07-12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/523124 (owner: 10Tarrow) [16:15:41] (03CR) 10Alexandros Kosiaris: [C: 03+1] "@tarrow, you should have already access to +2 and merge, let me know if you don't" [deployment-charts] - 10https://gerrit.wikimedia.org/r/523124 (owner: 10Tarrow) [16:15:47] (03CR) 10Fsero: [V: 03+2 C: 03+2] deploy,helmfile: added fake secrets data for admin_services [labs/private] - 10https://gerrit.wikimedia.org/r/523222 (owner: 10Fsero) [16:20:58] (03CR) 10Fsero: "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1001/17387/deploy1001.eqiad.wmnet/change.deploy1001.eqiad.wmnet.pson" [puppet] - 10https://gerrit.wikimedia.org/r/523198 (owner: 10Fsero) [16:23:43] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 80%, RTA = 229.75 ms [16:27:59] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10hashar) Have you planned the cloudvirt yet? I guess that is a bit more challenging since instances would have to be moved ahead of time, but I am genuinely interested in seeing whether that improves the b... [16:28:07] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10Gehel) Oops, the 3 logs above about maps shoudl have been on T218097 [16:31:13] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:24] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10Cmjohnson) Thanks, @godog is there any way you can put some stress on that disk? It's hard for me to justify to Dell that we need a disk replacement when it sho... [16:33:03] (03CR) 10SBassett: [C: 03+1] "people.w.o also seems to have a pretty trusted group of a few hundred wmf, wmde and volunteer users, with 71 public_html dirs." [puppet] - 10https://gerrit.wikimedia.org/r/522991 (https://phabricator.wikimedia.org/T224068) (owner: 10Gergő Tisza) [16:36:34] 10Operations, 10Continuous-Integration-Infrastructure, 10puppet-compiler, 10Patch-For-Review: puppet compiler fails on releases1001.eqiad.wmnet due to lack of Service[bacula-director] - https://phabricator.wikimedia.org/T228047 (10hashar) *facepalm* I was debugging using the `production` branch catalog in... [16:36:57] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 229.53 ms [16:37:40] (03CR) 10Hashar: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/224/releases1001.eqiad.wmnet/change.releases1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [16:37:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of comments inline, LGTM however" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523198 (owner: 10Fsero) [16:42:36] (03PS5) 10Elukey: Add Ipv6 PTR/AAAA records for an-worker* and an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/520767 (https://phabricator.wikimedia.org/T225296) [16:43:38] (03PS7) 10Fsero: deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 [16:44:06] (03CR) 10Elukey: [C: 03+2] Add Ipv6 PTR/AAAA records for an-worker* and an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/520767 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [16:44:14] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10jcrespo) [16:44:17] (03CR) 10Arturo Borrero Gonzalez: "This patch is currently live-hacked into the toolsbeta puppetmaster. We apparently need to specify the certs that the k8s control plane wi" [puppet] - 10https://gerrit.wikimedia.org/r/523220 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [16:44:30] (03PS2) 10Fsero: helmfile,k8s: adding calico-policy into deploy* for manage it in code [puppet] - 10https://gerrit.wikimedia.org/r/523132 [16:44:32] (03PS8) 10Fsero: deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 [16:44:41] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:45:28] (03CR) 10Fsero: "ty for the review, addressed comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523198 (owner: 10Fsero) [16:46:46] 10Operations, 10Analytics, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10EBernhardson) I currently have three use cases for this functionality: 1) Export bulk data updates for all wikis from an... [16:49:32] jouncebot: next [16:49:32] In 0 hour(s) and 10 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190715T1700) [16:49:59] (03CR) 10Ottomata: [C: 03+2] Produce revision-visibility-change event to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523180 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [16:51:13] (03Merged) 10jenkins-bot: Produce revision-visibility-change event to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523180 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [16:51:30] (03CR) 10jenkins-bot: Produce revision-visibility-change event to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523180 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [16:51:43] 10Operations: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10RobH) [16:53:41] (03PS1) 10Elukey: Allow the use of Ipv6 in the Hadoop Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) [16:55:47] (03PS2) 10Elukey: Allow the use of Ipv6 in the Hadoop Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) [16:56:06] (03CR) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [16:56:42] (03PS2) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) [16:56:57] (03PS3) 10Elukey: Allow the use of Ipv6 in the Hadoop Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) [16:57:07] (03CR) 10jerkins-bot: [V: 04-1] prometheus: wire up prometheus-varnishkafka-exporter for deploy [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [16:57:13] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Produce revision-visibility-change stream to eventgate-main - T211248 (duration: 00m 49s) [16:57:18] (03CR) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [16:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:19] T211248: Modern Event Platform: Stream Intake Service: Migrate Mediawiki Eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 [16:58:20] !log setting labsdb1009/10/11 to performance scaling_governor T225713 [16:58:21] (03PS1) 10RobH: puppetmaster1003 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/523231 (https://phabricator.wikimedia.org/T201342) [16:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:26] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [16:58:29] (03PS1) 10Ottomata: Produce revision-create stream to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523232 (https://phabricator.wikimedia.org/T211248) [16:58:46] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster1003 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/523231 (https://phabricator.wikimedia.org/T201342) (owner: 10RobH) [17:00:04] gehel and onimisionipe: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190715T1700). [17:00:10] (03CR) 10Elukey: prometheus: wire up prometheus-varnishkafka-exporter for deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [17:00:21] jouncebot: WDQS deployment will be delayed today [17:00:54] Hello, is it already known that channels aren't working anymore on logstash since 12:30? See e.g. https://logstash.wikimedia.org/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(_source),index:'logstash-*',interval:auto,query:(query_string:(query:'channel:StashEdit')),sort:!('@timestamp',desc)) [17:01:31] (03PS2) 10RobH: puppetmaster1003 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/523231 (https://phabricator.wikimedia.org/T201342) [17:01:54] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster1003 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/523231 (https://phabricator.wikimedia.org/T201342) (owner: 10RobH) [17:02:15] And it seems not to be about channels only [17:02:33] (03PS3) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) [17:03:17] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:03:30] hrmm [17:03:31] (03PS3) 10RobH: puppetmaster1003 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/523231 (https://phabricator.wikimedia.org/T201342) [17:03:54] robh: see eqsin mails [17:04:06] ahh [17:04:25] jynus: thx! [17:04:26] it is still not great, but someone is working on it [17:04:35] its oob so as long as its known its fine [17:04:38] =] [17:04:46] i just didnt realize that was what that email thread was for [17:05:01] I asked too, to be fair [17:05:03] https://logstash.wikimedia.org/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(_source),index:'logstash-*',interval:auto,query:(query_string:(query:'type:mediawiki')),sort:!('@timestamp',desc)) wtf? [17:05:04] (03CR) 10RobH: [C: 03+2] puppetmaster1003 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/523231 (https://phabricator.wikimedia.org/T201342) (owner: 10RobH) [17:05:41] Daimona, please use the short url feature, if you can :) [17:06:03] at least I have to copy the url to access it [17:06:10] Urbanecm: Huh, thanks, I've never noticed that little link right there [17:06:11] 10Operations: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10RobH) [17:06:19] yw Daimona [17:07:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 (owner: 10Fsero) [17:07:13] Daimona, if you need something, fresh logs seems to be on mwlog1001, so I can give you logentries you need if wanted [17:07:33] No, nothing specific, but thanks anyway :-) [17:07:54] I just wanted to make it sure people know about this, but given the last link, I guess they do [17:08:06] yw [17:08:59] (03CR) 10Bstorm: [C: 03+1] "My only concern is if it will work on all three servers. I can check that myself on the livehacked version if its a 3-node cluster. Will" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523220 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [17:09:01] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 229.55 ms [17:14:03] (03PS1) 10Kosta Harlan: GrowthExperiments: Remove reference to non-existent feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523233 [17:17:57] (03PS2) 10Kosta Harlan: GrowthExperiments: Enable WelcomeSurvey A/B test for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523202 (https://phabricator.wikimedia.org/T226221) [17:18:01] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523233 (owner: 10Kosta Harlan) [17:19:46] (03CR) 10Urbanecm: [C: 03+1] "LGTM. Note the most of variables in IS.php are sorted alphabetically." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523202 (https://phabricator.wikimedia.org/T226221) (owner: 10Kosta Harlan) [17:19:52] (03CR) 10Cwhite: [C: 03+1] "LGTM (untested)" [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) (owner: 10Jbond) [17:22:13] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:22:21] 10Operations: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10RobH) [17:22:46] (03PS1) 10RobH: install parameters for puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/523236 (https://phabricator.wikimedia.org/T201342) [17:23:05] jouncebot, next [17:23:05] In 0 hour(s) and 36 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190715T1800) [17:23:59] (03PS2) 10RobH: install parameters for puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/523236 (https://phabricator.wikimedia.org/T201342) [17:24:02] 10Operations, 10DC-Ops, 10Office-IT: Request for hard drives - https://phabricator.wikimedia.org/T227800 (10HMarcus) 05Open→03Resolved Thank you for the quick follow up Papaul, will go ahead and close this out. [17:25:42] (03CR) 10RobH: [C: 03+2] install parameters for puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/523236 (https://phabricator.wikimedia.org/T201342) (owner: 10RobH) [17:27:59] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 86%, RTA = 229.49 ms [17:28:22] 10Operations: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10RobH) [17:29:19] 10Operations: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10RobH) a:05RobH→03jbond Discussed this in chat with @jbond this AM Per that chat I've updated install server files and dns, but not yet installed the system, handing off to him for completion. [17:32:31] 10Operations, 10ops-eqiad, 10Analytics: Broken disk on analytics1072 - https://phabricator.wikimedia.org/T226467 (10Cmjohnson) Disks is on it's way [17:34:18] !log downtime mr1-eqsin.oob IPv6 for 20h T227967 [17:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:27] T227967: mr1-eqsin.oob IPv6 connectivity flapping - https://phabricator.wikimedia.org/T227967 [17:42:17] PROBLEM - Disk space on mwmaint1002 is CRITICAL: DISK CRITICAL - free space: / 3611 MB (2% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:48:53] hey Urbanecm -- is mwmaint1002 the machine you were doing swift image upload testing on? [17:49:09] cdanis, yes [17:49:16] (03CR) 10MSantos: "this patch needs to be abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/508841 (owner: 10Gehel) [17:49:44] Urbanecm: is the disk space alert you, then? :) https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=mwmaint1002&panelId=12&fullscreen&from=now-12h&to=now [17:50:20] cdanis, yes, that's most probably me :( [17:50:32] I'll delete the files as soon as they are in commons [17:50:57] ok! just as long as the disk doesn't fill up, like, any further [17:51:03] sorry for all the issues with swift, i haven't had time to dig into it [17:52:05] cdanis, wget finished (just in time :)), deleted one uploaded file, so the free space should be increasing for this timebeing :) [17:52:59] (03CR) 10Muehlenhoff: "The netboot.cfg part is wrong, it's for labpuppetmaster instead of puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/523236 (https://phabricator.wikimedia.org/T201342) (owner: 10RobH) [17:54:26] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10Cmjohnson) @godog, no worries about the earlier comment. Dell approved the disk replacement. I will update task once it's been replaced. [17:56:55] RECOVERY - Disk space on mwmaint1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190715T1800). [18:00:04] wdoran and kostajh: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:12] I can SWAT today! [18:00:14] here [18:00:37] (03PS3) 10Urbanecm: GrowthExperiments: Enable WelcomeSurvey A/B test for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523202 (https://phabricator.wikimedia.org/T226221) (owner: 10Kosta Harlan) [18:00:43] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523202 (https://phabricator.wikimedia.org/T226221) (owner: 10Kosta Harlan) [18:01:08] cdanis: thx [18:01:25] (03PS4) 10Thcipriani: blubberoid: Add policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/517573 (https://phabricator.wikimedia.org/T215319) [18:01:26] (03PS2) 10Thcipriani: Blubberoid: enable policy, bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/522561 [18:01:44] (03Merged) 10jenkins-bot: GrowthExperiments: Enable WelcomeSurvey A/B test for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523202 (https://phabricator.wikimedia.org/T226221) (owner: 10Kosta Harlan) [18:02:00] (03CR) 10jenkins-bot: GrowthExperiments: Enable WelcomeSurvey A/B test for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523202 (https://phabricator.wikimedia.org/T226221) (owner: 10Kosta Harlan) [18:02:46] kostajh, should be live on mwdebug1002 [18:03:20] looking [18:03:59] wdoran, around? [18:04:07] yep [18:04:09] hi [18:04:32] cool! [18:05:13] I'm going to +2 your backport, waiting for CI to do its job [18:05:54] great, thanks [18:07:46] !log syncing puppetmaster1001 facts to compiler1001/1002 [18:07:47] Urbanecm: almost done [18:07:51] ack [18:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:02] Urbanecm: I think it's good to go [18:09:09] kostajh, okay, going to sync [18:10:02] 10Operations, 10decommission: Decommission analytics10[28-41] - https://phabricator.wikimedia.org/T227485 (10RobH) a:03elukey [18:11:02] 10Operations, 10decommission, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821 (10RobH) a:05RobH→03wiki_willy @wiki_willy has actually been working on this for weeks now, so assigning this to him. [18:11:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:523202|GrowthExperiments: Enable WelcomeSurvey A/B test for arwiki]] (T226221) (duration: 01m 02s) [18:11:14] (03PS1) 10CDanis: nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 [18:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:18] T226221: Setup Welcome Survey for Arabic Wikipedia - https://phabricator.wikimedia.org/T226221 [18:11:20] kostajh, should be deployed. [18:11:28] Urbanecm: thanks [18:11:30] yw [18:11:48] (03CR) 10jerkins-bot: [V: 04-1] nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [18:12:59] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Remove spam mitigations (T200104) (duration: 00m 50s) [18:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:51] (03PS2) 10CDanis: WIP nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 [18:16:21] (03CR) 10Jbond: [C: 03+1] "LGTM just a question" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [18:17:21] (03CR) 10CDanis: WIP nrpe: support dashboard_links in nrpe::check_service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [18:17:32] wdoran, patch is merged [18:17:33] PROBLEM - MariaDB Slave Lag: x1 on db2045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:17:45] Urbanecm: great [18:18:14] wdoran, your patch is on mwdebug1002. Please test and let me know if I can deploy it. [18:18:47] Urbanecm: will do, now [18:18:51] thanks [18:19:08] (03CR) 10Andrew Bogott: "crap, I fixed the bug but not the docs :( Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/522992 (https://phabricator.wikimedia.org/T113783) (owner: 10CDanis) [18:20:19] 10Operations, 10media-storage: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 (10ayounsi) p:05Triage→03High [18:20:21] andrewbogott: no problem <3 [18:21:03] (03CR) 10Jbond: [C: 03+1] "have rechecked and lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [18:21:06] 10Operations, 10media-storage: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 (10CDanis) This is very likely related to {T226937} [18:22:03] jbond42: out of curiosity I'm going to do a little looking through catalogs before submitting [18:22:12] jbond42: you also might appreciate https://phabricator.wikimedia.org/P8744 [18:22:24] which is what i used to validate work on https://gerrit.wikimedia.org/r/c/operations/puppet/+/522992 [18:22:59] cdanis: Logstash is not showing any messages from MediaWIki since 13:00 UTC , 6 hours ago [18:23:04] wdoran, status? [18:23:12] Urbanecm: Logstash isn't able to show any warnings or errors [18:23:16] Krinkle, that was mentioned here before, should be known [18:23:21] which means Scap has no confidence in deployments [18:23:24] and we can't verify reliably [18:23:26] I'd recommend aborting [18:23:35] Urbanecm: I'm afraid it's still broken, we'll have to put it back into our team [18:23:36] thanks [18:23:46] Urbanecm: wait, it was known? why were we deploying? [18:24:15] Krinkle, probably because I didn't realize scap checks with logstash, not with mwlog (which is still working) [18:24:23] wdoran, thanks [18:24:40] Urbanecm: ah okay, good to know. so it's from the logstash side [18:24:52] yeah, it's also common for verifiers to use logstash/mwdebug to check for potential regressions and new errors [18:24:55] which is now suspiciously blank [18:25:33] * Urbanecm is going to revert the backport that doesn't work and to close the window [18:26:29] thanks [18:26:40] Urbanecm: were was it mentioned, is there a ticket? [18:26:52] ticket wasn't linked [18:26:57] it was mentioned in this chan [18:27:21] !log Morning SWAT done [18:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:05] Krinkle, https://phabricator.wikimedia.org/P8748 has the relevant conversation [18:32:36] hmm Krinkle was going to do a config change, should I hold? [18:33:30] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10Krinkle) [18:33:35] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10Krinkle) p:05Triage→03Unbreak! [18:33:41] (03PS1) 10Nuria: Add wikishared DB to databases available in superset [puppet] - 10https://gerrit.wikimedia.org/r/523252 [18:34:01] ottomata: depends on how confident you are :) [18:34:43] (03PS2) 10Nuria: Add wikishared DB to databases available in superset [puppet] - 10https://gerrit.wikimedia.org/r/523252 [18:35:24] ottomata: if it's the patch to enable more event traffic being emitted, that seems to have some potential for regressions as there's a lot of extra code involved with that that will get run from various code paths, which we'd have no visibility on if it causes (indirect) issues. [18:35:56] indeed. [18:36:10] i would rely on logstash there to know if there were problems [18:36:11] will hold. [18:36:45] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10Krinkle) [18:46:24] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10Urbanecm) @jcrespo said on #wikimedia-operations " robh: see eqsin mails" when talking about this issue, see P8748. [18:48:26] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10jcrespo) I never talked about this issue, and had not idea why @Urbanecm thoughout I was talking about this while I was having a private conversation with ot... [18:49:25] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 4 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Eevans) 05Open→03Resolved All instances bootstrapped, and cleanups in corresponding rack... [18:51:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10wiki_willy) a:03jijiki Assigning to @jijiki for now. Hi Effie - let us know when it would be ok to take this server down to reseat the DIMM, and then assign the task back to @Cmjohnson... [18:52:10] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10Urbanecm) I'm sorry, I thought so because said conversation directly followed the report from @Daimona. [18:55:28] (03CR) 10Ottomata: [C: 03+2] Add wikishared DB to databases available in superset [puppet] - 10https://gerrit.wikimedia.org/r/523252 (owner: 10Nuria) [18:59:05] (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs2001. [puppet] - 10https://gerrit.wikimedia.org/r/523266 [19:00:04] thcipriani and paladox: #bothumor My software never has bugs. It just develops random features. Rise for Gerrit Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190715T1900). [19:00:19] * thcipriani here! [19:00:23] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10Daimona) Huh, when I first reported I thought someone already knew about this. Anyway. Looking at [[https://logstash.wikimedia.org/goto/6c9e93c200ea693b5a295... [19:00:23] * paladox is here [19:02:59] (03PS3) 10Hashar: releases: use php production profile instead of contint [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) [19:03:20] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [19:04:39] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] Gerrit v2.15.14 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/522133 (owner: 10Paladox) [19:05:48] !log restarting logstash on logstash1008 [19:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:37] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@40d88dc]: Bump gerrit version to 2.15.14 (gerrit2001) [19:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:49] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@40d88dc]: Bump gerrit version to 2.15.14 (gerrit2001) (duration: 00m 12s) [19:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:50] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@40d88dc]: Bump gerrit version to 2.15.14 (cobalt - restart incoming) [19:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:00] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@40d88dc]: Bump gerrit version to 2.15.14 (cobalt - restart incoming) (duration: 00m 10s) [19:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:34] !log gerrit restart for v2.15.14 [19:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:49] !log gerrit back [19:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:34] anomie: beware that logstash isn't working, so you'll want to tail on mwlog instead to check deployment (and scap won't detect issues from canary boxes) [19:12:53] Krinkle: I'm not doing any deployments today, too many meetings [19:13:16] oh, gerrit just told me you merged a wmf.13 patch [19:13:19] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] [19:13:49] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [19:14:14] anomie: I assumed that meant you're deploying it :) [19:14:16] (03PS4) 10Hashar: releases: use php production profile instead of contint [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) [19:14:35] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [19:16:46] Ok, I thought it was a master patch. [19:16:50] s/Ok/Oh/ [19:17:16] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10Krinkle) >>! In T228089#5334366, @Urbanecm wrote: > @jcrespo said on #wikimedia-operations " robh: see eqsin mails" when talking about this issue, see... [19:17:20] (03CR) 10Hashar: "Well I guess we need a specific set since we would miss a few packages such as:" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [19:17:38] (03Abandoned) 10Hashar: releases: use php production profile instead of contint [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [19:17:56] (03Abandoned) 10Hashar: Test compiler for releases1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/523161 (https://phabricator.wikimedia.org/T228047) (owner: 10Hashar) [19:19:05] anomie: I'd consider rolling it out anyway, but holding off given the logstash situation and other on-going issues that lower general confidence and increased confusion. [19:19:14] 10Operations, 10ops-eqiad: Degraded RAID on analytics1032 - https://phabricator.wikimedia.org/T227940 (10wiki_willy) a:03wiki_willy @Cmjohnson - looks like this server is out of warranty and just past the 5yr mark, but is also tied to a refresh order last Q2 in FY19-20 under T204177. Also, seems like it's b... [19:19:31] Krinkle: Will you handle that? I really don't have time today :( [19:19:57] anomie: yeah, I'll revert before the next person deploys unless that person is me deploying it. [19:20:04] Thanks! [19:21:49] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10wiki_willy) a:03Cmjohnson [19:22:36] 10Operations, 10ops-eqiad: (OoW) Degraded RAID on analytics1032 - https://phabricator.wikimedia.org/T227940 (10wiki_willy) [19:24:48] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10Krinkle) Meanwhile, back on topic. Some graphs that have been mentioned in the IRC conversation about this. [Dashboard: kafka-consumer-lag](https://grafana.... [19:25:16] (03Restored) 10Hashar: releases: use php production profile instead of contint [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [19:25:24] 10Operations, 10ops-eqiad: (OoW) Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10wiki_willy) a:03elukey [19:26:04] (03CR) 10Bstorm: [C: 03+1] "After messing with it, I see our current config skips client cert auth...and I think it will work with this client cert setup even if we e" [puppet] - 10https://gerrit.wikimedia.org/r/523220 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [19:26:17] (03PS5) 10Bstorm: toolforge: k8s: kubeadm: now using external etcd servers [puppet] - 10https://gerrit.wikimedia.org/r/523220 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [19:26:27] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@df6322a]: Rename error field in deduplication logs [19:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:42] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests, and 2 others: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10wiki_willy) a:03Cmjohnson [19:27:08] (03PS5) 10Hashar: releases: inline php packages installation [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) [19:27:42] 10Operations, 10ops-eqiad, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10wiki_willy) a:03Cmjohnson [19:27:55] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@df6322a]: Rename error field in deduplication logs (duration: 01m 28s) [19:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission frav1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T222109 (10wiki_willy) a:03RobH [19:29:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission frav1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T222109 (10RobH) a:05RobH→03Cmjohnson [19:30:09] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@fd0a41a]: Change the name of the error log field for deduplicatio [19:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:07] 10Operations, 10ops-eqiad, 10decommission: decommission thulium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T203520 (10wiki_willy) a:03Cmjohnson [19:31:09] (03PS6) 10Hashar: releases: inline php packages installation [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) [19:31:14] (03PS3) 10Hashar: contint: remove php packages [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) [19:31:21] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [19:31:22] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@fd0a41a]: Change the name of the error log field for deduplicatio (duration: 01m 13s) [19:31:23] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [19:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:25] (03CR) 10Smalyshev: [C: 03+1] wdqs: introduced tuned journal options to wdqs2001. [puppet] - 10https://gerrit.wikimedia.org/r/523266 (owner: 10Gehel) [19:33:39] 10Operations, 10ops-esams, 10Traffic: cp3035 PS Redundancy Lost - https://phabricator.wikimedia.org/T225035 (10wiki_willy) Server will be refreshed in late Q1 / early Q2, along with a hardware refresh of the entire site. [19:36:55] 10Operations, 10ops-codfw, 10DBA: (OoW) db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10wiki_willy) [19:39:33] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10wiki_willy) [19:39:43] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10wiki_willy) a:03Papaul [19:40:35] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [19:41:05] RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [19:41:11] 10Operations, 10ops-codfw: PDUs with Infeed < 0.5Amps - https://phabricator.wikimedia.org/T222464 (10wiki_willy) a:03Papaul [19:43:18] 10Operations, 10ops-codfw: (OoW) Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10wiki_willy) [19:45:21] 10Operations, 10ops-codfw: (OoW) lvs2002 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T148017 (10wiki_willy) [19:45:22] (03CR) 10Smalyshev: [C: 03+1] "I thought you wanted to start with 2004 though?" [puppet] - 10https://gerrit.wikimedia.org/r/523266 (owner: 10Gehel) [19:46:44] (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs2004. [puppet] - 10https://gerrit.wikimedia.org/r/523294 [19:48:47] (03CR) 10Smalyshev: [C: 03+1] wdqs: introduced tuned journal options to wdqs2004. [puppet] - 10https://gerrit.wikimedia.org/r/523294 (owner: 10Gehel) [19:49:07] (03PS2) 10Gehel: wdqs: introduced tuned journal options to wdqs2004. [puppet] - 10https://gerrit.wikimedia.org/r/523294 [19:49:17] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:50:08] (03CR) 10Gehel: [C: 03+2] wdqs: introduced tuned journal options to wdqs2004. [puppet] - 10https://gerrit.wikimedia.org/r/523294 (owner: 10Gehel) [19:50:29] (03CR) 10Urbanecm: [C: 04-2] "Do not merge before Wednesday, July 17 / https://gerrit.wikimedia.org/r/c/mediawiki/extensions/UploadWizard/+/523208 is in production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523214 (https://phabricator.wikimedia.org/T228073) (owner: 10Urbanecm) [19:50:45] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:51:53] (03PS3) 10Urbanecm: Revert "Delete Image-reviewer group from commonswiki for good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523214 (https://phabricator.wikimedia.org/T228098) [19:53:02] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:09] SMalyshev: ^^ [19:54:18] SMalyshev: processes to watch for: [19:54:21] https://www.irccloud.com/pastebin/GsXePDej/ [19:55:16] keep your fingers crossed! [19:55:30] * gehel is off for today, but scream if you need me! [19:59:49] (03PS1) 10MaxSem: Remove $wgPageTriageNoIndexTemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523295 [19:59:54] 10Operations, 10ops-eqsin: rack/setup/install ganeti500[123].eqsin.wmnet - https://phabricator.wikimedia.org/T228099 (10RobH) p:05Triage→03Normal [20:00:03] 10Operations, 10ops-eqsin: rack/setup/install ganeti500[123].eqsin.wmnet - https://phabricator.wikimedia.org/T228099 (10RobH) [20:00:04] cscott, arlolra, subbu, bearND, and halfak: (Dis)respected human, time to deploy Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190715T2000). Please do the needful. [20:00:18] no parsoid deploy today [20:00:38] (03CR) 10MaxSem: [C: 04-2] "Waiting for the dependency to be live everywhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523295 (owner: 10MaxSem) [20:01:14] 10Operations, 10media-storage: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 (10ayounsi) If we narrow it down to ms-fe* hosts they regularly spike between 5% and 15% which is a bit more worrying. https://grafana.wikimedia.org/d/SxmTH3IZk/arzhels-playground?orgId=1&panelId=3&fullscr... [20:02:10] !log reducing consistency of db2045 to avoid lag at T227862 [20:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:18] T227862: (OoW) db2045 failed battery - https://phabricator.wikimedia.org/T227862 [20:03:18] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10colewhite) We decided to drop logs from cpjobqueue and changeprop at the logstash layer with the following config: 89-filter_drop_cpjobque_changeprop.conf:... [20:06:28] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10bd808) Related to {T150106} if the root problem is type collisions in the Elasticsearch index [20:07:37] 10Operations, 10media-storage: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 (10ayounsi) The same thing started to happen around the same time for labstore1007: https://grafana.wikimedia.org/d/SxmTH3IZk/arzhels-playground?orgId=1&panelId=2&fullscreen&from=now-30d&to=now (temporary... [20:10:53] Hey folks. I can't ssh into bast1002.wikimedia.org. Is something going on? [20:11:14] It just hangs. Getting debug output... [20:11:44] debug1: Connecting to bast1002.wikimedia.org [2620:0:861:3:208:80:154:86] port 22. [20:11:45] PROBLEM - Disk space on mw1293 is CRITICAL: DISK CRITICAL - free space: /tmp 1288 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [20:11:48] Just hangs after that [20:12:42] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10jcrespo) Once the backlog is processed, https://grafana.wikimedia.org/d/000000102/production-logging?refresh=5m&panelId=8&fullscreen&orgId=1 This can be lowe... [20:13:44] halfak: WFM [20:14:13] Thanks jynus. It's also not working for accraze but he is getting password prompt. [20:14:17] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team: rack/setup/install cloudmon100[123] - https://phabricator.wikimedia.org/T228102 (10RobH) p:05Triage→03Normal [20:14:30] mac? [20:14:37] halfak: works for me too, can you retry now? I'm tailing the logs [20:14:41] RECOVERY - Disk space on mw1293 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [20:14:50] Could be my university wifi. [20:15:09] Just did, volans. [20:15:15] I'm suspecting the university now though. [20:15:27] Might try tethering if you don't see anything in the logs. [20:15:39] maye your ipv6 isn't properly routed or something [20:15:51] I see a bunch of connection from and then 'Did not receive identification string from ' $IP [20:16:27] OK I'll switch to tether and see if that does anything different. [20:20:05] I was able to get in but it's hanging for minutes at a time. [20:20:13] ^ on connection [20:30:12] !log deactivate HE peering in eqsin - T228015 [20:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:20] T228015: IPv6 packet loss registered by the Ripe Atlas anchor in eqsin - https://phabricator.wikimedia.org/T228015 [20:33:21] volans, it looks like accraze is having a different problem -- a password prompt. Is there a good way for us to check if the public key we expect is in fact authorized? [20:34:21] halfak: sure, give me a sec [20:34:31] thanks. [20:35:00] halfak: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/admin/data/data.yaml#3205 [20:35:11] I can confirm that's on the host [20:36:20] (03PS6) 10Bstorm: toolforge: k8s: kubeadm: now using external etcd servers [puppet] - 10https://gerrit.wikimedia.org/r/523220 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [20:36:23] halfak: could he paste an "ssh -vv $HOSTNAME" somewhere? [20:37:11] ah, didn't realise accraze is in the channel actually :-) [20:37:28] ssh stat1006.eqiad.wmnet -vv [20:37:28] OpenSSH_7.9p1, LibreSSL 2.7.3 [20:37:28] debug1: Reading configuration data /Users/acraze/.ssh/config [20:37:30] debug1: /Users/acraze/.ssh/config line 1: Applying options for * [20:37:32] debug1: /Users/acraze/.ssh/config line 36: Applying options for *.wmnet [20:37:34] Woops [20:37:34] debug1: Reading configuration data /etc/ssh/ssh_config [20:37:36] debug1: /etc/ssh/ssh_config line 48: Applying options for * [20:37:38] debug1: /etc/ssh/ssh_config line 52: Applying options for * [20:37:40] debug1: Executing proxy command: exec ssh -W stat1006.eqiad.wmnet:22 bastion.wmf [20:37:42] debug1: identity file /Users/acraze/.ssh/id_rsa type 0 [20:37:44] debug1: identity file /Users/acraze/.ssh/id_rsa-cert type -1 [20:37:46] debug1: identity file /Users/acraze/.ssh/id_ed25519 type 3 [20:37:48] debug1: identity file /Users/acraze/.ssh/id_ed25519-cert type -1 [20:37:50] debug1: Local version string SSH-2.0-OpenSSH_7.9 [20:37:58] lol [20:38:40] OK well there it is. [20:38:44] bastion.wmf? [20:38:52] your bastion seems off [20:38:52] * halfak told accraze about the wonders of paste services. [20:39:08] that should be something like bast1002.wikimedia.org or bast4002.wikimedia.org [20:39:22] That's probably because he copied from my config. That points to bast1002.wikimedia.org [20:39:36] see https://wikitech.wikimedia.org/wiki/Production_shell_access#Setting_up_your_SSH_config [20:39:47] (03PS7) 10Hashar: releases: inline php packages installation [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) [20:40:02] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [20:40:14] FWIW, this *was* working for him last week :\ [20:40:39] (03CR) 10Hashar: [V: 03+1] "Noop on production host has expected: https://puppet-compiler.wmflabs.org/compiler1002/230/contint1001.wikimedia.org/ :]" [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [20:40:39] I just want you to know we're not asking you to help us set this up without trying the docs first :| [20:41:58] 10Operations, 10ops-codfw: (OoW) lvs2002 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T148017 (10wiki_willy) a:03Papaul [20:42:47] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic: (OoW) lvs2006 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T192082 (10wiki_willy) [20:42:58] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic: (OoW) lvs2006 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T192082 (10wiki_willy) a:03Papaul [20:43:05] accraze: can you try connecting to the bastion itself, i.e. "ssh bast1002.wikimedia.org"? [20:43:11] (03CR) 10Hashar: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/231/releases1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [20:43:23] moritzm, my thought too. We're on it. [20:43:53] ack :-) [20:44:00] Aha! We ran into a "too open of permissions" error when going directly to bast1002 [20:44:08] Fixing perms on the private key [20:44:28] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@bc3a2fd]: Update mobileapps to 7fd39da (T227907) [20:44:29] Weird that we both had a problem at the same time. But it seems they were unrelated. [20:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:34] T227907: [Bug] mobile-html: DOMRect object from InteractionHandler response cannot be decoded on iOS - https://phabricator.wikimedia.org/T227907 [20:44:35] New paste incoming. [20:44:37] input_userauth_request: invalid user acraze [preauth] [20:44:37] 10Operations, 10ops-codfw: mc2023 / mc2025 fail to mount root partition within 90 seconds using Linux 4.9 - https://phabricator.wikimedia.org/T170152 (10wiki_willy) a:03Papaul [20:44:52] Aha! [20:45:08] How did all of this change? WHy did it work last week? /me puts head in sand. [20:45:36] are you sending the correct key? maybe mixed with the development one for gerrit/wmcs? [20:45:46] ah, acraze [20:46:00] missing a 'c' [20:46:06] Yeah. We checked that. I think it was ultimately the perms on the private key [20:46:17] But it only gave that error when we were not proxying. [20:46:18] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (OoW) lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10wiki_willy) [20:46:31] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (OoW) lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10wiki_willy) a:03Papaul [20:46:40] cool, enjoy stat1006, then :-) [20:46:55] Thanks! [20:47:31] (03CR) 10Bstorm: [C: 03+2] toolforge: k8s: kubeadm: now using external etcd servers [puppet] - 10https://gerrit.wikimedia.org/r/523220 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [20:47:35] 10Operations, 10ops-codfw: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10wiki_willy) [20:47:39] 10Operations, 10ops-codfw: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10wiki_willy) a:03Papaul [20:47:49] !log add `as-path HE ".* 6939 .*"` to AVOID-PATH in eqsin - T228015 [20:47:55] * Krinkle is staging https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/522608/ on mwdebug1002 for deploy soon [20:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:56] T228015: IPv6 packet loss registered by the Ripe Atlas anchor in eqsin - https://phabricator.wikimedia.org/T228015 [20:50:19] !log deploy1001: Unable to fetch git commits from Gerrit for php-1.34.0-wmf.13 due to "error: cannot update the ref 'refs/remotes/origin/fundraising/REL1_31': unable to append to '.git/logs/refs/remotes/origin/fundraising/REL1_31': Permission denied" [20:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:39] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [20:51:08] 10Operations, 10ops-codfw, 10Patch-For-Review: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10wiki_willy) 05Open→03Resolved a:03Papaul Looks like things are resolved here, so I'm going to resolve the task, but feel free to reopen if there's still something that needs to be comp... [20:52:13] 10Operations, 10ops-codfw: (OoW) rdb2002 correctable memory errors - https://phabricator.wikimedia.org/T194171 (10wiki_willy) [20:52:21] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@bc3a2fd]: Update mobileapps to 7fd39da (T227907) (duration: 07m 53s) [20:52:25] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@bc3a2fd]: Update mobileapps to 7fd39da (T227907) [20:52:25] 10Operations, 10ops-codfw: (OoW) rdb2002 correctable memory errors - https://phabricator.wikimedia.org/T194171 (10wiki_willy) a:03Papaul [20:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:28] T227907: [Bug] mobile-html: DOMRect object from InteractionHandler response cannot be decoded on iOS - https://phabricator.wikimedia.org/T227907 [20:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:50] 10Operations, 10ops-codfw: (OoW) wtp2013 memory correctable errors - https://phabricator.wikimedia.org/T194174 (10wiki_willy) [20:54:02] 10Operations, 10ops-codfw: (OoW) wtp2013 memory correctable errors - https://phabricator.wikimedia.org/T194174 (10wiki_willy) a:03Papaul [20:54:49] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@bc3a2fd]: Update mobileapps to 7fd39da (T227907) (duration: 02m 24s) [20:54:55] 10Operations, 10ops-codfw: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10wiki_willy) [20:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:02] 10Operations, 10ops-codfw: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10wiki_willy) a:03Papaul [20:55:48] 10Operations, 10ops-codfw: (OoW) wtp2011 memory correctable errors - https://phabricator.wikimedia.org/T200678 (10wiki_willy) [20:56:01] 10Operations, 10ops-codfw: (OoW) wtp2011 memory correctable errors - https://phabricator.wikimedia.org/T200678 (10wiki_willy) a:03Papaul [20:57:54] 10Operations, 10ops-codfw: (OoW) wtp2019 shows error messages in the racadm getsel's output - https://phabricator.wikimedia.org/T221572 (10wiki_willy) [20:58:04] 10Operations, 10ops-codfw: (OoW) wtp2019 shows error messages in the racadm getsel's output - https://phabricator.wikimedia.org/T221572 (10wiki_willy) a:03Papaul [20:58:09] (03PS1) 10Jforrester: tests/Defines.php: Re-synchronise from MW core master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523308 [20:59:00] (03CR) 10jerkins-bot: [V: 04-1] tests/Defines.php: Re-synchronise from MW core master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523308 (owner: 10Jforrester) [20:59:41] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.13/includes/Title.php: T227700 / T227700: getSubpage should not lose the interwiki prefix (duration: 00m 52s) [20:59:57] 10Operations, 10netops: IPv6 packet loss registered by the Ripe Atlas anchor in eqsin - https://phabricator.wikimedia.org/T228015 (10ayounsi) Seems like HE in eqsin is having a bad time. I depref all AS paths that go through HE and packet loss stopped. Emailed HE's NOC. [21:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] Reedy and sbassett: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190715T2100). [21:00:04] T227700: Fatal on some Special:MyLanguage urls: MWException "Can't determine talk page associated with interwiki link" - https://phabricator.wikimedia.org/T227700 [21:01:32] 10Operations, 10netops: mr1-eqsin.oob IPv6 connectivity flapping - https://phabricator.wikimedia.org/T227967 (10ayounsi) > So far I don't think there is a link between the ripe alerts and the oob alerts. Well, seems like they are, as the return path from mr1 -> icinga1001 goes through HE, nothing we can do th... [21:02:25] 10Operations, 10decommission: Decommission old server wmf4077 - https://phabricator.wikimedia.org/T190086 (10wiki_willy) a:03Cmjohnson [21:05:31] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 32 probes of 437 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [21:06:25] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:06:25] RECOVERY - MariaDB Slave Lag: x1 on db2045 is OK: OK slave_sql_lag Replication lag: 19.86 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [21:06:44] !log rollback `as-path HE ".* 6939 .*"` to AVOID-PATH in eqsin - T228015 [21:06:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:50] T228015: IPv6 packet loss registered by the Ripe Atlas anchor in eqsin - https://phabricator.wikimedia.org/T228015 [21:09:10] 10Operations, 10netops: IPv6 packet loss registered by the Ripe Atlas anchor in eqsin - https://phabricator.wikimedia.org/T228015 (10ayounsi) 05Open→03Resolved a:03ayounsi They were very quick to reply and fix the issue. >RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK [21:12:21] 10Operations, 10netops: mr1-eqsin.oob IPv6 connectivity flapping - https://phabricator.wikimedia.org/T227967 (10ayounsi) 05Open→03Resolved Seems like fixing T228015 fixed that issue as well. [21:12:26] 10Operations, 10ops-eqiad, 10cloud-services-team, 10procurement: RAID Battery Failure on cloudvirt1006 (HP DL380p Gen8) - https://phabricator.wikimedia.org/T228105 (10wiki_willy) [21:14:45] PROBLEM - Disk space on mw1293 is CRITICAL: DISK CRITICAL - free space: /tmp 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [21:16:12] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:16:13] RECOVERY - Disk space on mw1293 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [21:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:29] PROBLEM - High lag on wdqs1010 is CRITICAL: 5007 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:17:13] PROBLEM - High lag on wdqs2004 is CRITICAL: 5052 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:17:21] SMalyshev: ^^^ [21:18:09] gehel: so it's done? great [21:18:13] ACKNOWLEDGEMENT - High lag on wdqs1010 is CRITICAL: 5096 ge 3600 Gehel catching up after data reload https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:18:13] ACKNOWLEDGEMENT - High lag on wdqs2004 is CRITICAL: 5052 ge 3600 Gehel catching up after data reload https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:18:16] I'll watch it [21:18:33] SMalyshev: thanks! [21:30:35] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 268.3 ge 130 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [21:39:54] Hey all - going to scap out a /private change now (unless I shouldn't...) [21:46:30] !log sbassett@deploy1001 Synchronized private/PrivateSettings.php: Add more severe rate limits for eswikiquote (T227416) (duration: 00m 50s) [21:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:11] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1092 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:50:37] (03PS3) 10Jforrester: Introduce wmgEnableJsonConfigDataMode so we can scrap wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522530 [21:50:41] (03PS2) 10Tarrow: Bump Termbox Staging to 2019-07-12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/523124 [21:51:06] jouncebot: now [21:51:06] For the next 1 hour(s) and 8 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190715T2100) [21:51:41] (03PS3) 10Jforrester: Use wmgEnableJsonConfigDataMode instead of wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522531 [21:51:52] (03PS3) 10Jforrester: Drop wmgEnableTabularData and wmgEnableMapData, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522532 [21:51:54] (03CR) 10Tarrow: "I do have +2 here; I was going to wait for someone from the termbox team to +2. Or is this a de-facto self +2 ok place?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/523124 (owner: 10Tarrow) [21:52:05] (03PS3) 10Jforrester: Stop setting wgNonincludableNamespaces to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522535 [21:52:33] (03CR) 10Jforrester: [C: 03+2] Introduce wmgEnableJsonConfigDataMode so we can scrap wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522530 (owner: 10Jforrester) [21:53:36] (03Merged) 10jenkins-bot: Introduce wmgEnableJsonConfigDataMode so we can scrap wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522530 (owner: 10Jforrester) [21:54:14] (03CR) 10Jforrester: [C: 03+2] Use wmgEnableJsonConfigDataMode instead of wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522531 (owner: 10Jforrester) [21:55:01] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add wmgEnableJsonConfigDataMode to IS (duration: 00m 55s) [21:55:08] (03Merged) 10jenkins-bot: Use wmgEnableJsonConfigDataMode instead of wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522531 (owner: 10Jforrester) [21:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:24] (03CR) 10Jforrester: [C: 03+2] Drop wmgEnableTabularData and wmgEnableMapData, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522532 (owner: 10Jforrester) [21:55:58] !log Depool mw1239 for maintenance - T227867 [21:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:06] T227867: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 [21:56:20] (03Merged) 10jenkins-bot: Drop wmgEnableTabularData and wmgEnableMapData, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522532 (owner: 10Jforrester) [21:56:23] (03CR) 10jenkins-bot: Introduce wmgEnableJsonConfigDataMode so we can scrap wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522530 (owner: 10Jforrester) [21:56:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10jijiki) a:05jijiki→03Cmjohnson [21:57:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10jijiki) Thank you! [21:57:33] (03CR) 10Krinkle: "I think such a setting is worth keeping as explicit even if it matches the default." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522536 (owner: 10Jforrester) [21:58:16] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Use wmgEnableJsonConfigDataMode instead of wmgEnableTabularData and wmgEnableMapData (duration: 00m 56s) [21:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:32] (03CR) 10Jforrester: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522536 (owner: 10Jforrester) [21:58:36] (03CR) 10Jforrester: [C: 03+2] Stop setting wgNonincludableNamespaces to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522535 (owner: 10Jforrester) [21:59:39] (03Merged) 10jenkins-bot: Stop setting wgNonincludableNamespaces to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522535 (owner: 10Jforrester) [22:00:11] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Drop wmgEnableTabularData and wmgEnableMapData, unused (duration: 00m 55s) [22:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:11] RECOVERY - High lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 1190 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:01:34] (03CR) 10Jforrester: [C: 04-1] "For Thursday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522536 (owner: 10Jforrester) [22:01:38] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wgNonincludableNamespaces, default, never varied (duration: 00m 52s) [22:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:40] (03PS1) 10Bstorm: toolforge: kubeadm master nodes shouldn't use client certs for etcd [puppet] - 10https://gerrit.wikimedia.org/r/523328 (https://phabricator.wikimedia.org/T215531) [22:03:46] (03PS6) 10Jforrester: Even more invariant config moved over to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 [22:04:13] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [22:04:51] (03PS2) 10Jforrester: tests/Defines.php: Re-synchronise from MW core master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523308 [22:06:18] (03CR) 10Jforrester: [C: 03+2] tests/Defines.php: Re-synchronise from MW core master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523308 (owner: 10Jforrester) [22:07:08] (03Merged) 10jenkins-bot: tests/Defines.php: Re-synchronise from MW core master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523308 (owner: 10Jforrester) [22:14:01] 10Operations, 10ops-codfw, 10netops: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10ayounsi) p:05Triage→03Normal [22:14:15] 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10ayounsi) [22:14:17] 10Operations, 10ops-codfw, 10netops: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10ayounsi) [22:15:45] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10ayounsi) [22:15:47] 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10ayounsi) [22:19:42] 10Operations, 10ops-codfw, 10netops: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10ayounsi) [22:20:35] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:21:57] (03PS2) 10Catrope: GrowthExperiments: Remove reference to non-existent feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523233 (owner: 10Kosta Harlan) [22:24:07] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:28:01] (03PS1) 10BryanDavis: striker: Update package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/523335 [22:33:08] (03CR) 10Catrope: "> Patch Set 4:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508476 (https://phabricator.wikimedia.org/T222539) (owner: 10Catrope) [22:34:21] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:34:42] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops-radar, 10Core Platform Team (Mainstash Multi-DC), and 3 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Krinkle) [22:37:04] 10Operations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10ayounsi) > Finally, the more user-visible issue that we have right now is that we're underutilizing eqord: we currently do not announce our supernets from eqord. The reason for this is that I hadn't... [22:43:42] (03CR) 10Jhedden: [C: 03+2] striker: Update package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/523335 (owner: 10BryanDavis) [22:43:59] PROBLEM - Disk space on mw1293 is CRITICAL: DISK CRITICAL - free space: /tmp 516 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [22:52:45] RECOVERY - Disk space on mw1293 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [23:00:04] MaxSem, RoanKattouw, and Niharika: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190715T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:20] Also me.. looks like I put the deployment request in the wrong place again.. fixing not [23:00:26] * fixing now [23:01:07] done @RoanKattouw [23:03:49] I see it, +2ed. Will deploy mine first while we wait for yours too go through CI [23:04:12] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Remove reference to non-existent feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523233 (owner: 10Kosta Harlan) [23:05:20] (03Merged) 10jenkins-bot: GrowthExperiments: Remove reference to non-existent feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523233 (owner: 10Kosta Harlan) [23:06:08] thanks RoanKattouw [23:07:11] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: GrowthExperiments: Remove reference to non-existent feature flag (duration: 00m 51s) [23:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:48] 10Operations, 10ops-codfw, 10decommission: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10RobH) [23:18:38] RoanKattouw: PHAN issues [23:18:50] Ugh looking [23:19:04] what to do? https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/MinervaNeue/+/523260/ doesn't seem related [23:19:25] 16:05:30 Package mediawiki/phan-taint-check-plugin at version 1.5.0 has a PHP requirement incompatible with your PHP version (7.2.16) [23:19:26] Guys... [23:19:29] James_F: ---^^ [23:19:43] (in wmf.13) [23:20:26] I think this is because phan was updated this week and phan jobs for PHP 7.2 enabled, but the update may not have hit wmf.13 [23:21:22] It would be unfortunate if this broke phan for all wmf.13 cherry-picks [23:25:08] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10Krinkle) [23:26:12] (03PS1) 10RobH: decom db2042 [puppet] - 10https://gerrit.wikimedia.org/r/523350 (https://phabricator.wikimedia.org/T225090) [23:27:12] (03PS1) 10RobH: decom db2042 [dns] - 10https://gerrit.wikimedia.org/r/523353 (https://phabricator.wikimedia.org/T225090) [23:27:23] (03CR) 10RobH: [C: 03+2] decom db2042 [puppet] - 10https://gerrit.wikimedia.org/r/523350 (https://phabricator.wikimedia.org/T225090) (owner: 10RobH) [23:27:27] jdlrobson: OK I think I figured it out, running Jenkins on it again now [23:29:12] (03CR) 10RobH: [C: 03+2] decom db2042 [dns] - 10https://gerrit.wikimedia.org/r/523353 (https://phabricator.wikimedia.org/T225090) (owner: 10RobH) [23:30:13] thanks RoanKattouw [23:31:07] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [23:31:12] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [23:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:18] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `db2042.codfw.wmnet` - db2042.codfw.wmnet - Removed from Puppet master and PuppetDB... [23:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:47] 10Operations, 10ops-codfw, 10decommission: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10RobH) [23:35:00] 10Operations, 10ops-codfw, 10decommission: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10RobH) a:05RobH→03Papaul [23:36:18] (03PS1) 10Catrope: Enable GrowthExperiments help panel on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523354 (https://phabricator.wikimedia.org/T226729) [23:36:21] (03PS1) 10Catrope: Enable help panel for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523355 (https://phabricator.wikimedia.org/T226729) [23:38:45] RoanKattouw: still not merging.. :/ [23:40:06] jdlrobson: That looks like an intermittent/flaky npm error, but also it was the on-submit Jenkins run, not the pre-merge one [23:40:33] That one is still running and almost done, see gate-and-submit-swat at https://integration.wikimedia.org/zuul/ [23:42:29] 10Operations, 10Traffic, 10Patch-For-Review, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10CDanis) I think we likely want to revisit this. * Right now the `guest` user has access to `/eventlogging` which I don't think we actual... [23:47:00] Aaargh but I removed my +2 accidentally because I got confused, so now I have to rerun it [23:48:13] 10Operations, 10ops-eqiad, 10cloud-services-team: (OoW) cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10wiki_willy) subtask opened up with procurement to order raid battery. ~willy [23:48:39] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10CDanis) I've started an incident document at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190715-logstash and would appreciate more contribut... [23:49:45] 10Operations, 10Wikimedia-Logstash, 10observability, 10Wikimedia-Incident: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 (10CDanis) 05Open→03Resolved a:03CDanis The backlog in Kafka should clear in just a few more minutes. Closing this; separate issues to be opened later fo... [23:55:22] PROBLEM - Disk space on mw1296 is CRITICAL: DISK CRITICAL - free space: /tmp 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [23:55:27] !log rotate network-root password [23:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:40] (03PS2) 10Catrope: Enable GrowthExperiments help panel on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523354 (https://phabricator.wikimedia.org/T226729) [23:55:42] (03PS2) 10Catrope: Enable help panel for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523355 (https://phabricator.wikimedia.org/T226729) [23:56:30] RECOVERY - Disk space on mw1296 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [23:57:06] (03PS1) 10Catrope: Enable GrowthExperiments homepage on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523362 (https://phabricator.wikimedia.org/T228120) [23:57:08] (03PS1) 10Catrope: Enable homepage for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523363 (https://phabricator.wikimedia.org/T228120) [23:58:12] (03CR) 10Cwhite: [C: 03+1] netbox : Add Hiera data for automatic LibreNMS Netbox report [puppet] - 10https://gerrit.wikimedia.org/r/522562 (owner: 10CRusnov) [23:59:29] jouncebot: next [23:59:29] In 11 hour(s) and 0 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190716T1100)