[00:00:04] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190418T0000). [00:00:26] 10Operations, 10netops: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10ayounsi) > I have checked the core and the information, we did not find any PR related to this, please give us a few days to analyze the core. [00:10:09] (03CR) 10Alex Monk: [C: 04-1] acme_chief: Prevalidate CN/SNI list (037 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [00:23:31] 10Operations, 10DC-Ops, 10LDAP-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10faidon) [00:25:06] 10Operations, 10DC-Ops, 10LDAP-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10faidon) Let's add Willy to the group `datacenter-ops`. I don't think he needs to necessarily be in the group `ops` (which is really a misnomer at this point), for now. [00:39:18] 10Operations, 10Traffic, 10Goal, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Krenair) >>! In T133548#5121135, @Krinkle wrote: > @Dzahn Assuming that with Let's Encrypt, HTTPS will work in mode... [00:40:04] 10Operations, 10DC-Ops, 10LDAP-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10wiki_willy) [00:54:44] (03PS1) 10Alex Monk: git-sync-upstream: Rebase on top of prod's copy of the puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/504825 (https://phabricator.wikimedia.org/T219390) [00:57:13] 10Puppet, 10Patch-For-Review, 10cloud-services-team (Kanban): Have puppet-merge on puppetmaster1001 publish the official sha1 after merging - https://phabricator.wikimedia.org/T219390 (10Krenair) I think for this to make sense we should require labs/private repository to also exist on prod puppetmasters and... [01:04:25] PROBLEM - Device not healthy -SMART- on db2047 is CRITICAL: cluster=mysql device=cciss,11 instance=db2047:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2047&var-datasource=codfw+prometheus/ops [03:32:09] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:58:45] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:05:13] PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:14:45] PROBLEM - puppet last run on ores1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:14:47] PROBLEM - puppet last run on db1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:18:27] 10Operations, 10cloud-services-team: labpuppetmaster logs 'cannot collect exported resources without storeconfigs being set' - https://phabricator.wikimedia.org/T221115 (10bd808) I'm pretty sure this is expected. We don't have puppetdb or another exported resource collector system on this puppetmaster. Doing s... [04:31:47] RECOVERY - puppet last run on kubestagetcd1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:41:21] RECOVERY - puppet last run on db1061 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [04:46:41] RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [05:17:05] PROBLEM - puppet last run on cp1081 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:35:58] (03CR) 10Santhosh: [C: 03+1] Remove ExternalGuidanceEnableContextDetection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504261 (https://phabricator.wikimedia.org/T219819) (owner: 10KartikMistry) [05:43:45] RECOVERY - puppet last run on cp1081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:37:11] !log rolling reboots of Swift backends in eqiad for combined kernel/glibc/OpenSSL update [06:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:33] 10Operations, 10Traffic, 10Patch-For-Review: Removal of If-Cached VCL support - https://phabricator.wikimedia.org/T220510 (10ema) 05Open→03Resolved [06:58:47] !log restarting icinga on icinga1001 (T196336) [06:58:48] (03PS1) 10Elukey: profile::analytics::cluster::packages::statistics: add python3-venv [puppet] - 10https://gerrit.wikimedia.org/r/504829 [06:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:53] T196336: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 [07:01:58] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15865/" [puppet] - 10https://gerrit.wikimedia.org/r/504829 (owner: 10Elukey) [07:19:25] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [07:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:03] !log installing libssh2 security updates [07:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:23] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.25; 2019-04-09), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) 05Open→03Stalled I had a conv... [07:32:38] (03PS1) 10Muehlenhoff: Add library hint for libssh2 [puppet] - 10https://gerrit.wikimedia.org/r/504830 [07:34:31] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libssh2 [puppet] - 10https://gerrit.wikimedia.org/r/504830 (owner: 10Muehlenhoff) [07:41:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/504590 (https://phabricator.wikimedia.org/T221143) (owner: 10Herron) [07:53:03] (03PS1) 10Elukey: profile::mediawiki::nutcracker: make memcached configuration optional [puppet] - 10https://gerrit.wikimedia.org/r/504831 (https://phabricator.wikimedia.org/T214275) [07:53:37] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus2004.codfw.wmnet [07:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:46] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:50] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:28] (03CR) 10Elukey: "No op: https://puppet-compiler.wmflabs.org/compiler1002/15866/" [puppet] - 10https://gerrit.wikimedia.org/r/504831 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [07:59:08] 10Operations, 10monitoring, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) Both prometheus1004 and prometheus2004 are now in service with Prometheus v2! So far no issues, syncing the whole storage from their counterparts took... [07:59:30] (03CR) 10Elukey: "The idea would be to remove mediawiki_memcached_servers from deployment-prep's hiera config and let it bake for a bit, before proceeding t" [puppet] - 10https://gerrit.wikimedia.org/r/504831 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [08:01:33] !log restarting mw1261-mw1265 to pick up new libssh2 [08:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:51] (03CR) 10Elukey: "A broader set of No ops: https://puppet-compiler.wmflabs.org/compiler1002/15869/" [puppet] - 10https://gerrit.wikimedia.org/r/504831 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [08:14:36] !log installing libssh2 security updates on jessie [08:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:40] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [08:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:16] PROBLEM - High lag on wdqs1009 is CRITICAL: 3893 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:27:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "overall LGTM, a few comments that should be addressed." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [08:29:12] ACKNOWLEDGEMENT - High lag on wdqs1009 is CRITICAL: 3801 ge 3600 Gehel catching up after data transfer - https://phabricator.wikimedia.org/T220830 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:29:12] ACKNOWLEDGEMENT - High lag on wdqs1010 is CRITICAL: 3916 ge 3600 Gehel catching up after data transfer - https://phabricator.wikimedia.org/T220830 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:31:34] (03CR) 10Elukey: "Would you guys mind to do another round of reviews? Just to be sure that the latest version is ok for everybody :)" [puppet] - 10https://gerrit.wikimedia.org/r/498399 (https://phabricator.wikimedia.org/T215171) (owner: 10Elukey) [08:33:30] ^ elukey having a look [08:35:44] thanks! [08:42:52] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM. All my comments are just really nitpicks." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [08:45:34] (03CR) 10Muehlenhoff: [C: 03+1] "One nit in the comments, but feel free to merge" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498399 (https://phabricator.wikimedia.org/T215171) (owner: 10Elukey) [08:48:20] (03CR) 10Vgutierrez: [C: 03+2] tlsproxy: Ensure OCSP stapling nginx reload hook present for acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/504571 (https://phabricator.wikimedia.org/T221171) (owner: 10Alex Monk) [08:48:28] (03PS3) 10Vgutierrez: tlsproxy: Ensure OCSP stapling nginx reload hook present for acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/504571 (https://phabricator.wikimedia.org/T221171) (owner: 10Alex Monk) [08:53:20] !log reboot kafka10[12-23] (old Analytics cluster) for kernel + openjdk upgrades [08:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:35] 10Operations, 10Puppet, 10Packaging: puppet fails to run in cp1008 - https://phabricator.wikimedia.org/T221343 (10Vgutierrez) [08:54:43] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [08:54:43] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [08:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:53] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [08:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:56] 10Operations, 10Puppet, 10Packaging: puppet fails to run in cp1008 - https://phabricator.wikimedia.org/T221343 (10Vgutierrez) so... this is caused by my locales: `vgutierrez@cp1008:~$ unset LC_CTYPE vgutierrez@cp1008:~$ sudo -i puppet agent -t Warning: Support for ruby version 2.1.5 is deprecated and will be... [08:55:59] elukey: if you pass -t/--task-id it will now update also teh task ;) [08:56:08] up to you if you want the additional spam or not :D [09:00:54] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:00:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:23] volans: ah yes I forgot! [09:02:18] PROBLEM - Host ms-be1013 is DOWN: PING CRITICAL - Packet loss = 100% [09:02:29] 10Operations, 10Puppet, 10Packaging: puppet fails to run in cp1008 under certain conditions - https://phabricator.wikimedia.org/T221343 (10Vgutierrez) p:05Triage→03Low [09:02:39] 1013 is known, silence expired [09:02:45] 10Operations, 10Puppet, 10Packaging: puppet fails to run in cp1008 under certain conditions - https://phabricator.wikimedia.org/T221343 (10Vgutierrez) for the record, LC_CTYPE=UTF-8 [09:03:05] (03CR) 10Giuseppe Lavagetto: [C: 04-1] cache: add profile::cache::varnish::frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:04:00] volans: there's no option for sre.hosts.downtime to not log to IRC or did I miss it? [09:09:05] (03CR) 10Vgutierrez: [C: 03+1] "PCC looks happy: https://puppet-compiler.wmflabs.org/compiler1002/15870/" [puppet] - 10https://gerrit.wikimedia.org/r/501890 (owner: 10Alex Monk) [09:09:25] 10Operations, 10Puppet, 10Packaging: puppet fails to run in cp1008 under certain conditions - https://phabricator.wikimedia.org/T221343 (10MoritzMuehlenhoff) >>! In T221343#5121922, @Vgutierrez wrote: > but this was working as expected before For completeness/context: Previously cp1008 was using facter 2.4.... [09:10:55] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:12] (03CR) 10DCausse: "> This should be updated to match the latest change at https://gerrit-review.googlesource.com/c/gerrit/+/220263 (same for the 2.16 change)" [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/502487 (owner: 10DCausse) [09:12:32] volans: one thing that might be good would be having the reason of the downtime logged by the bot [09:12:47] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Add security::access::config on passive host in cloud [puppet] - 10https://gerrit.wikimedia.org/r/497430 (owner: 10Alex Monk) [09:12:58] (03PS6) 10Vgutierrez: acme_chief: Add security::access::config on passive host in cloud [puppet] - 10https://gerrit.wikimedia.org/r/497430 (owner: 10Alex Monk) [09:18:03] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: data reimport on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T220830 (10Gehel) Data transfer completed with the new cookbook, everything seems fine. [09:21:21] (03CR) 10Vgutierrez: [C: 03+2] role::swift::proxy: Move TLS stuff out into profile [puppet] - 10https://gerrit.wikimedia.org/r/501890 (owner: 10Alex Monk) [09:21:30] (03PS2) 10Vgutierrez: role::swift::proxy: Move TLS stuff out into profile [puppet] - 10https://gerrit.wikimedia.org/r/501890 (owner: 10Alex Monk) [09:21:52] (03PS1) 10Elukey: admin: add the analytics system user to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) [09:23:36] elukey: indeed, see T221212#5118068 [09:23:37] T221212: spicerack/cookbook: add additional arguments IRC/SAL logging - https://phabricator.wikimedia.org/T221212 [09:23:53] for the explanation why is not yet ;) [09:27:19] (03PS14) 10Vgutierrez: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 (owner: 10Alex Monk) [09:28:18] 10Operations, 10Puppet, 10Documentation, 10Patch-For-Review, 10patch-welcome: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797 (10Joe) It could be useful to try to enforce good practices for new code. One way I could think of is running some puppet-lint plugins l... [09:29:49] (03PS1) 10Gilles: Run CPU benchmark for all samples on eswiki/ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504841 (https://phabricator.wikimedia.org/T216597) [09:30:24] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1131 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:30:59] (03CR) 10jerkins-bot: [V: 04-1] Run CPU benchmark for all samples on eswiki/ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504841 (https://phabricator.wikimedia.org/T216597) (owner: 10Gilles) [09:31:36] RECOVERY - High lag on wdqs1009 is OK: (C)3600 ge (W)1200 ge 717.1 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:32:05] (03PS2) 10Gilles: Run CPU benchmark for all samples on eswiki/ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504841 (https://phabricator.wikimedia.org/T216597) [09:34:35] (03CR) 10Jbond: [C: 03+1] "minor nitpicks but not blocking" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/498399 (https://phabricator.wikimedia.org/T215171) (owner: 10Elukey) [09:35:51] (03CR) 10Gilles: [C: 03+2] Run CPU benchmark for all samples on eswiki/ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504841 (https://phabricator.wikimedia.org/T216597) (owner: 10Gilles) [09:36:56] (03Merged) 10jenkins-bot: Run CPU benchmark for all samples on eswiki/ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504841 (https://phabricator.wikimedia.org/T216597) (owner: 10Gilles) [09:38:01] (03PS1) 10ArielGlenn: for abstract dumps, skip any processing of pages not in main namespace [dumps] - 10https://gerrit.wikimedia.org/r/504842 (https://phabricator.wikimedia.org/T220940) [09:38:23] (03CR) 10jenkins-bot: Run CPU benchmark for all samples on eswiki/ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504841 (https://phabricator.wikimedia.org/T216597) (owner: 10Gilles) [09:40:50] (03PS3) 10Volans: icinga: generate config for Icinga meta-monitoring [puppet] - 10https://gerrit.wikimedia.org/r/503945 [09:41:10] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T216597 Run CPU benchmark for all samples on eswiki/ruwiki (duration: 01m 06s) [09:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:16] T216597: Event Timing origin trial - https://phabricator.wikimedia.org/T216597 [09:42:16] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10akosiaris) Just noting that at 10:41 UTC the circuit was still down per ` akosiaris@cr3-ulsfo> show interfaces descriptions | match 313592 xe-0/1/1 up down Transport: cr2-eqord:xe-0/1/1... [09:43:36] (03CR) 10Volans: [C: 03+2] "Compiler looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/503945 (owner: 10Volans) [09:44:02] PROBLEM - PHP7 rendering on mw1274 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 919 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:44:03] !log update grafana service/ dashboard to have user, system, throttled CPU metrics under the CPU saturation row [09:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:29] (03PS1) 10Vgutierrez: add fake secrets for mcrouter to unbreak puppet compile on mw2151.codfw.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/504843 [09:47:11] (03PS3) 10Alexandros Kosiaris: Add kubernetes[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/504342 (https://phabricator.wikimedia.org/T220822) [09:47:13] (03PS1) 10Alexandros Kosiaris: kubelet: Remove noop CADVISOR_PORT comment [puppet] - 10https://gerrit.wikimedia.org/r/504844 [09:47:15] (03PS1) 10Alexandros Kosiaris: Fix typo with schema2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/504845 (https://phabricator.wikimedia.org/T219556) [09:47:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fix typo with schema2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/504845 (https://phabricator.wikimedia.org/T219556) (owner: 10Alexandros Kosiaris) [09:47:42] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] add fake secrets for mcrouter to unbreak puppet compile on mw2151.codfw.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/504843 (owner: 10Vgutierrez) [09:48:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] kubelet: Remove noop CADVISOR_PORT comment [puppet] - 10https://gerrit.wikimedia.org/r/504844 (owner: 10Alexandros Kosiaris) [09:48:58] (03PS2) 10Mathew.onipe: Add maps postgres init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [09:50:03] (03PS2) 10Alexandros Kosiaris: Fix typo with schema2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/504845 (https://phabricator.wikimedia.org/T219556) [09:50:08] onimisionipe: is this WIP or ready to review? [09:50:19] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix typo with schema2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/504845 (https://phabricator.wikimedia.org/T219556) (owner: 10Alexandros Kosiaris) [09:50:26] volans: Its ready to for review [09:50:48] (03PS3) 10Filippo Giunchedi: aptrepo: validate deb822 files [puppet] - 10https://gerrit.wikimedia.org/r/503025 [09:50:55] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: validate deb822 files [puppet] - 10https://gerrit.wikimedia.org/r/503025 (owner: 10Filippo Giunchedi) [09:52:49] (03CR) 10Vgutierrez: [C: 03+1] "PCC looks happy: https://puppet-compiler.wmflabs.org/compiler1002/15873/" [puppet] - 10https://gerrit.wikimedia.org/r/500406 (owner: 10Alex Monk) [09:56:03] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 5 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10akosiaris) >>! In T219556#5120409, @Ottomata wrote: > ganeti1001, 1002 and 2001 have been installed. I dunno what's up with ganet... [09:58:45] (03PS29) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [09:58:47] (03PS20) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [10:01:54] (03CR) 10Filippo Giunchedi: [C: 03+1] deployment-prep: Add stretch storage hosts [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503714 (owner: 10Alex Monk) [10:02:32] (03CR) 10Filippo Giunchedi: [C: 03+1] Fix broken profile::swift::storage::labs [puppet] - 10https://gerrit.wikimedia.org/r/503707 (https://phabricator.wikimedia.org/T220895) (owner: 10Alex Monk) [10:03:14] Krenair: I'll merge your swift/deployment-prep patches, thanks for working on that! [10:03:46] (03CR) 10Filippo Giunchedi: [C: 03+2] Fix broken profile::swift::storage::labs [puppet] - 10https://gerrit.wikimedia.org/r/503707 (https://phabricator.wikimedia.org/T220895) (owner: 10Alex Monk) [10:03:53] (03PS2) 10Filippo Giunchedi: Fix broken profile::swift::storage::labs [puppet] - 10https://gerrit.wikimedia.org/r/503707 (https://phabricator.wikimedia.org/T220895) (owner: 10Alex Monk) [10:05:01] (03CR) 10Vgutierrez: [C: 04-1] "looks good, please fix the script permissions" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk) [10:05:05] (03CR) 10Filippo Giunchedi: [C: 03+1] deployment-prep: Update with live files [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503544 (owner: 10Alex Monk) [10:05:36] Krenair: and https://gerrit.wikimedia.org/r/c/operations/software/swift-ring/+/503544 ok to merge before https://gerrit.wikimedia.org/r/c/operations/software/swift-ring/+/503714 but merge both? [10:05:53] (03CR) 10Ladsgroup: [C: 04-1] "This should be merged in a week from now" [puppet] - 10https://gerrit.wikimedia.org/r/504776 (owner: 10Ladsgroup) [10:07:15] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: don't require Prometheus::Server when writing k8s token [puppet] - 10https://gerrit.wikimedia.org/r/490834 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [10:07:23] (03PS5) 10Filippo Giunchedi: prometheus: don't require Prometheus::Server when writing k8s token [puppet] - 10https://gerrit.wikimedia.org/r/490834 (https://phabricator.wikimedia.org/T187987) [10:07:43] (03Abandoned) 10Filippo Giunchedi: WIP: define haproxy service for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/465185 (https://phabricator.wikimedia.org/T187765) (owner: 10Filippo Giunchedi) [10:17:24] (03CR) 10Ema: cache: add profile::cache::varnish::frontend (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:18:17] 10Operations, 10serviceops: Renew certs for mcrouter on all application servers. - https://phabricator.wikimedia.org/T221346 (10Joe) [10:18:53] 10Operations, 10Puppet, 10Packaging: puppet fails to run in cp1008 under certain conditions - https://phabricator.wikimedia.org/T221343 (10jbond) I had a quick look at this and was unable to recreate it, i did come across the following though and wonder if the work around there may work https://stackoverflo... [10:20:04] 10Operations, 10serviceops, 10User-Elukey: Renew certs for mcrouter on all application servers. - https://phabricator.wikimedia.org/T221346 (10elukey) [10:21:21] !log installing jasper updates on jessie hosts [10:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:48] 10Operations, 10Puppet, 10Packaging: puppet fails to run in cp1008 under certain conditions - https://phabricator.wikimedia.org/T221343 (10jbond) I have noticed you can tirgger this bug by using a locale not present on the server ` for loc in $(locale -a) en_GB.utf8 ; do echo ${loc}; sudo env LC_ALL=${loc... [10:32:39] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Documentation, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797 (10hashar) [10:34:19] (03PS6) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [10:35:03] !log installing rails security updates on jessie hosts [10:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:57] (03CR) 10Volans: "Some comments inline, I might miss some context" (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [10:40:30] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:51] PROBLEM - puppet last run on proton1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:43:23] (03PS2) 10Alexandros Kosiaris: kubelet: Remove noop CADVISOR_PORT comment [puppet] - 10https://gerrit.wikimedia.org/r/504844 [10:43:25] (03PS4) 10Alexandros Kosiaris: Add kubernetes[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/504342 (https://phabricator.wikimedia.org/T220822) [10:43:27] (03PS1) 10Alexandros Kosiaris: Support node labels and taints in kubelet [puppet] - 10https://gerrit.wikimedia.org/r/504849 (https://phabricator.wikimedia.org/T220821) [10:43:29] (03PS1) 10Alexandros Kosiaris: Add failure-domain node labels for kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/504850 [10:43:31] (03PS1) 10Alexandros Kosiaris: Give roles to the new kubernetes[12]00[56] VMs [puppet] - 10https://gerrit.wikimedia.org/r/504851 (https://phabricator.wikimedia.org/T220822) [10:52:22] (03CR) 10Volans: [C: 03+1] "LGTM, I've put some real nitpick and optional comments inline and a couple of questions." (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [10:52:46] (03PS2) 10Alexandros Kosiaris: Support node labels and taints in kubelet [puppet] - 10https://gerrit.wikimedia.org/r/504849 (https://phabricator.wikimedia.org/T220821) [10:52:48] (03PS2) 10Alexandros Kosiaris: Add failure-domain node labels for kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/504850 [10:52:50] (03PS5) 10Alexandros Kosiaris: Add kubernetes[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/504342 (https://phabricator.wikimedia.org/T220822) [10:52:52] (03PS2) 10Alexandros Kosiaris: Give roles to the new kubernetes[12]00[56] VMs [puppet] - 10https://gerrit.wikimedia.org/r/504851 (https://phabricator.wikimedia.org/T220822) [10:57:40] (03PS7) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [10:58:07] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [10:58:39] (03PS3) 10Alexandros Kosiaris: Support node labels and taints in kubelet [puppet] - 10https://gerrit.wikimedia.org/r/504849 (https://phabricator.wikimedia.org/T220821) [10:58:41] (03PS3) 10Alexandros Kosiaris: Add failure-domain node labels for kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/504850 [10:58:43] (03PS6) 10Alexandros Kosiaris: Add kubernetes[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/504342 (https://phabricator.wikimedia.org/T220822) [10:58:45] (03PS3) 10Alexandros Kosiaris: Give roles to the new kubernetes[12]00[56] VMs [puppet] - 10https://gerrit.wikimedia.org/r/504851 (https://phabricator.wikimedia.org/T220822) [10:59:05] (03PS8) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [10:59:22] I wonder if anyone here can actually see if there are queued mass messages? https://meta.wikimedia.org/wiki/Special:Log/massmessage didn't give me errors but yet again, my messages weren't delivered :( [11:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190418T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:21] (03PS2) 10ArielGlenn: for abstract dumps, skip any processing of pages not in main namespace [dumps] - 10https://gerrit.wikimedia.org/r/504842 (https://phabricator.wikimedia.org/T220940) [11:05:12] (03PS9) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [11:05:33] PROBLEM - puppet last run on kubestagetcd1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:06:13] PROBLEM - Apache HTTP on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:07:31] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:07:53] PROBLEM - Host kafka1023 is DOWN: PING CRITICAL - Packet loss = 100% [11:08:04] 10Operations, 10Growth-Team, 10Notifications: Discussion: Explore push notifications options - https://phabricator.wikimedia.org/T221265 (10Volans) I agree that it would be nice to have push notifications for iOS and Android available as an option in addition to the paging system but I have some doubts/quest... [11:08:21] RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [11:08:45] (03CR) 10Vgutierrez: "PCC shows a NOOP for cp1071, I'm not 100% happy with this but we can start the discussion from here: https://puppet-compiler.wmflabs.org/c" [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [11:09:25] RECOVERY - Host kafka1023 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [11:09:44] kafka is me --^ [11:09:48] downtime has probably expired [11:09:54] (last host to reboot) [11:10:11] 10Operations, 10serviceops, 10User-Elukey: Renew certs for mcrouter on all application servers. - https://phabricator.wikimedia.org/T221346 (10Joe) So the public CA cert will expire as well at the end of May. The solution I found is the following: # Disable puppet on all hosts that include class 'mcrouter'... [11:11:59] 10Operations: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088 (10faidon) Thanks @colewhite for raising (and re-raising!) this issue. This is a tricky but important problem to solve for sure! From the various conversations here and on-task it does not appear that we have consensus... [11:13:59] (03CR) 10Ema: trafficserver: Provide support for multiple ATS instances (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [11:16:25] 10Operations, 10serviceops, 10User-Elukey: Renew certs for mcrouter on all application servers. - https://phabricator.wikimedia.org/T221346 (10Joe) a:03fsero [11:17:48] (03PS10) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [11:19:09] 10Operations, 10serviceops, 10User-Elukey: Renew certs for mcrouter on all application servers. - https://phabricator.wikimedia.org/T221346 (10Joe) p:05Triage→03High [11:20:45] (03PS11) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [11:21:00] (03CR) 10Giuseppe Lavagetto: profile::mediawiki::php: tweak ini settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502986 (https://phabricator.wikimedia.org/T211488) (owner: 10Giuseppe Lavagetto) [11:22:14] (03PS10) 10Jbond: Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 [11:23:20] (03PS12) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [11:23:41] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::php: tweak ini settings [puppet] - 10https://gerrit.wikimedia.org/r/502986 (https://phabricator.wikimedia.org/T211488) [11:23:49] (03CR) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [11:24:25] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::php: tweak ini settings [puppet] - 10https://gerrit.wikimedia.org/r/502986 (https://phabricator.wikimedia.org/T211488) (owner: 10Giuseppe Lavagetto) [11:32:33] (03CR) 10Jbond: [C: 03+2] "Sorry seems i have some old comments which never got sent here" (0322 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [11:32:38] (03PS11) 10Jbond: Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 [11:32:40] 10Operations: Discussion: Explore push notifications options - https://phabricator.wikimedia.org/T221265 (10kostajh) [11:33:41] PROBLEM - Host kafka1014 is DOWN: PING CRITICAL - Packet loss = 100% [11:34:42] this is me --^ [11:34:59] RECOVERY - Host kafka1014 is UP: PING WARNING - Packet loss = 80%, RTA = 0.29 ms [11:37:17] RECOVERY - puppet last run on kubestagetcd1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:37:22] (03PS1) 10Elukey: Replace the 'hdfs' user with 'analytics' in Hadoop's job launchers [puppet] - 10https://gerrit.wikimedia.org/r/504861 (https://phabricator.wikimedia.org/T220971) [11:38:57] (03PS4) 10Alexandros Kosiaris: Add failure-domain node labels for kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/504850 [11:38:59] (03PS7) 10Alexandros Kosiaris: Add kubernetes[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/504342 (https://phabricator.wikimedia.org/T220822) [11:39:01] (03PS4) 10Alexandros Kosiaris: Give roles to the new kubernetes[12]00[56] VMs [puppet] - 10https://gerrit.wikimedia.org/r/504851 (https://phabricator.wikimedia.org/T220822) [11:45:24] (03CR) 10jenkins-bot: Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [11:46:07] (03PS30) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [11:46:09] (03PS21) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [11:48:37] (03PS8) 10Alexandros Kosiaris: Add kubernetes[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/504342 (https://phabricator.wikimedia.org/T220822) [11:48:39] (03PS4) 10Alexandros Kosiaris: Support node labels and taints in kubelet [puppet] - 10https://gerrit.wikimedia.org/r/504849 (https://phabricator.wikimedia.org/T220821) [11:48:41] (03PS5) 10Alexandros Kosiaris: Add failure-domain node labels for kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/504850 [11:48:43] (03PS5) 10Alexandros Kosiaris: Give roles to the new kubernetes[12]00[56] VMs [puppet] - 10https://gerrit.wikimedia.org/r/504851 (https://phabricator.wikimedia.org/T220822) [11:49:57] [XLhkPgpAAEUAAGfWeQAAAACF] 2019-04-18 11:49:20: Fatal exception of type "ConfigException" on Watchlist on enwiki [11:50:12] happens once in a few or dozen minutes [11:51:05] 10Operations, 10Analytics, 10User-Elukey: Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created - https://phabricator.wikimedia.org/T220784 (10elukey) 05Open→03Stalled Upstream told me that they are already working on a more generic version of my pull request (see gh issue), and t... [11:51:09] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) [11:54:09] (03PS31) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [11:54:11] (03PS22) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [11:55:00] filed https://phabricator.wikimedia.org/T221358 [11:55:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add kubernetes[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/504342 (https://phabricator.wikimedia.org/T220822) (owner: 10Alexandros Kosiaris) [12:08:47] (03PS2) 10Muehlenhoff: redis::instance: Remove support for Ubuntu/Upstart [puppet] - 10https://gerrit.wikimedia.org/r/500415 [12:13:37] (03PS2) 10Chico Venancio: InitialiseSettings.php: add years to wgNamespacesWithSubpages for WikimaniaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504798 (https://phabricator.wikimedia.org/T221297) [12:15:51] (03PS32) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [12:15:53] (03PS23) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [12:16:46] (03CR) 10Urbanecm: [C: 04-1] "Adding to default is better IMO, see alternative https://gerrit.wikimedia.org/r/504018." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504798 (https://phabricator.wikimedia.org/T221297) (owner: 10Chico Venancio) [12:19:48] !log installing Java security updates on WDQS autodeploy/test hosts [12:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:57] (03PS13) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [12:20:27] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [12:20:33] FFS :) [12:21:09] !log restarting blazegraph + updater on wdqs1009 / wdqs1010 for jvm upgrade [12:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:17] RECOVERY - PHP7 rendering on mw1274 is OK: HTTP OK: HTTP/1.1 200 OK - 74150 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:22:24] (03PS14) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [12:22:47] !log installing Java security updates on restbase-dev hosts (along with Cassandra restarts) [12:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:03] (03CR) 10Chico Venancio: Enable subpages by default for a few more extra namespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504018 (https://phabricator.wikimedia.org/T220950) (owner: 10Urbanecm) [12:30:25] !log restarting blazegraph + updater on wdqs* for jvm upgrade [12:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:52] (03Abandoned) 10Chico Venancio: InitialiseSettings.php: add years to wgNamespacesWithSubpages for WikimaniaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504798 (https://phabricator.wikimedia.org/T221297) (owner: 10Chico Venancio) [12:33:39] PROBLEM - puppet last run on cloudvirt1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:36:23] !log Ran `php7adm /opcache-free` on mw1274 to test a theory related to T221347. The log entries related to that task stopped immediately. [12:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:28] T221347: Fatal exception of type "ConfigException" - https://phabricator.wikimedia.org/T221347 [12:37:35] (03CR) 10Urbanecm: Enable subpages by default for a few more extra namespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504018 (https://phabricator.wikimedia.org/T220950) (owner: 10Urbanecm) [12:39:15] (03CR) 10Volans: coherence report: General improvements and rack checks (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [12:41:00] (03PS33) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [12:41:02] (03PS24) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [12:44:02] (03PS34) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [12:44:04] (03PS25) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [12:46:17] PROBLEM - puppet last run on etcd1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:47:04] (03CR) 10Chico Venancio: "I'm not familiar enough with mediawiki deployments to know with overriding the default (setting these namespaces to have subpages) will cr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504018 (https://phabricator.wikimedia.org/T220950) (owner: 10Urbanecm) [12:51:03] @_joe_ filing task now [12:52:49] (03CR) 10Ema: [C: 03+2] cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [12:53:01] (03CR) 10Ema: [C: 03+2] cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [12:53:38] (03PS1) 10Reedy: Revert "group1 wikis to 1.34.0-wmf.1 refs T220726" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504870 [12:54:02] (03CR) 10Reedy: [C: 03+2] Revert "group1 wikis to 1.34.0-wmf.1 refs T220726" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504870 (owner: 10Reedy) [12:54:06] _joe_: ^ [12:54:19] 10Operations: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Elitre) [12:54:22] <_joe_> Reedy: thanks [12:54:28] <_joe_> ok there it is :) [12:54:29] LMK if you want it reverting back further when that's deployed [12:54:33] 10Operations: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Elitre) p:05Triage→03Unbreak! [12:54:40] <_joe_> Reedy: that should be enough [12:54:43] (03PS5) 10Alexandros Kosiaris: Support node labels and taints in kubelet [puppet] - 10https://gerrit.wikimedia.org/r/504849 (https://phabricator.wikimedia.org/T220821) [12:54:45] (03PS6) 10Alexandros Kosiaris: Add failure-domain node labels for kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/504850 [12:54:47] (03PS6) 10Alexandros Kosiaris: Give roles to the new kubernetes[12]00[56] VMs [puppet] - 10https://gerrit.wikimedia.org/r/504851 (https://phabricator.wikimedia.org/T220822) [12:54:48] 10Operations: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Elitre) (setting prio per IRC chat) [12:55:08] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.34.0-wmf.1 refs T220726" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504870 (owner: 10Reedy) [12:55:35] 10Operations, 10MassMessage: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Elitre) [12:56:07] 10Operations, 10MassMessage: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Volans) [12:56:18] (03CR) 10Fsero: "i miss staging nodes, besides that really nice!" [puppet] - 10https://gerrit.wikimedia.org/r/504850 (owner: 10Alexandros Kosiaris) [12:56:59] (03CR) 10Fsero: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/504849 (https://phabricator.wikimedia.org/T220821) (owner: 10Alexandros Kosiaris) [12:58:41] !log reedy@deploy1001 rebuilt and synchronized wikiversions files: group1 back to .25 [12:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:02] (03PS4) 10Vgutierrez: dns: Move DNS operations to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504511 [13:00:04] (03PS4) 10Vgutierrez: acme_chief: Prevalidate CN/SNI list [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) [13:00:07] RECOVERY - puppet last run on cloudvirt1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:01:09] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.34.0-wmf.1 refs T220726" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504870 (owner: 10Reedy) [13:02:37] 10Operations, 10MassMessage: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Elitre) Testing now... [13:03:01] (03PS2) 10BBlack: Revert "wiktionary: test with zone-local CNAME->DYNA" [dns] - 10https://gerrit.wikimedia.org/r/504587 (https://phabricator.wikimedia.org/T208263) [13:03:03] (03PS2) 10BBlack: wikipedia.org: test with zone-local CNAME->DYNA [dns] - 10https://gerrit.wikimedia.org/r/504588 (https://phabricator.wikimedia.org/T208263) [13:03:26] <_joe_> uhm [13:03:32] <_joe_> doesn't seem to have solved the problem [13:04:01] (03CR) 10Vgutierrez: "Thanks for the review!" (033 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504511 (owner: 10Vgutierrez) [13:04:30] _joe_: I don't know if the jobs in the queue are going to necessarily magically start working [13:04:41] If the data stored in them is wrong, rather than the code executing them [13:04:47] <_joe_> you're right [13:04:50] <_joe_> the data is wrong [13:04:50] Might need to wait for them to all disappear and new ones to start inserting [13:04:56] <_joe_> yeah, it will take time [13:05:05] <_joe_> and I can confirm it's solved [13:05:06] 10Operations, 10MassMessage: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Elitre) This one went through: https://sv.wikipedia.org/wiki/Användardiskussion:Johan_(WMF)#Test_for_https://phabricator.wikimedia.org/T221365 [13:05:32] (03PS1) 10Alexandros Kosiaris: Fixup for kubernetes-node-virtual.cfg [puppet] - 10https://gerrit.wikimedia.org/r/504872 (https://phabricator.wikimedia.org/T220822) [13:05:41] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:05:42] (03CR) 10Alexandros Kosiaris: "Staging nodes added" [puppet] - 10https://gerrit.wikimedia.org/r/504850 (owner: 10Alexandros Kosiaris) [13:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:48] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fixup for kubernetes-node-virtual.cfg [puppet] - 10https://gerrit.wikimedia.org/r/504872 (https://phabricator.wikimedia.org/T220822) (owner: 10Alexandros Kosiaris) [13:08:58] !log roll restart of cassandra on aqs* to pick up new openjdk upgrades [13:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:46] 10Operations: Discussion: Explore push notifications options - https://phabricator.wikimedia.org/T221265 (10jbond) > I have some doubts/questions about the specific choice of Prowl. For the record im not stuck with this as a solution its just something i have used before > In particular: > * It's an external se... [13:10:06] 10Operations, 10MassMessage: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Joe) @Elitre the problem seems to be a regression in MediaWiki, tracked at T221368. @Reedy reverted the group 1 wikis (including meta) to the previous version and now the messages can be re-sent. I don't thin... [13:10:14] (03PS1) 10Alexandros Kosiaris: backup1001: Use 10g NIC in DHCP requests [puppet] - 10https://gerrit.wikimedia.org/r/504873 (https://phabricator.wikimedia.org/T196478) [13:10:34] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] backup1001: Use 10g NIC in DHCP requests [puppet] - 10https://gerrit.wikimedia.org/r/504873 (https://phabricator.wikimedia.org/T196478) (owner: 10Alexandros Kosiaris) [13:12:49] RECOVERY - puppet last run on etcd1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:15:01] (03PS1) 10Ema: cache: remove profile::cache::{text,upload} [puppet] - 10https://gerrit.wikimedia.org/r/504874 (https://phabricator.wikimedia.org/T219967) [13:15:03] (03PS1) 10Ema: cache: remove cacheproxy::instance_pair [puppet] - 10https://gerrit.wikimedia.org/r/504875 (https://phabricator.wikimedia.org/T219967) [13:15:08] (03CR) 10Volans: "replies to last pending comments/questions inline" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [13:16:24] 10Operations, 10MassMessage: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Elitre) >>! In T221365#5122559, @Joe wrote: > @Elitre the problem seems to be a regression in MediaWiki, tracked at T221368. @Reedy reverted the group 1 wikis (including meta) to the previous version and now... [13:20:03] (03CR) 10Fsero: [C: 03+1] Add failure-domain node labels for kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/504850 (owner: 10Alexandros Kosiaris) [13:20:11] (03CR) 10Fsero: [C: 03+2] Add failure-domain node labels for kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/504850 (owner: 10Alexandros Kosiaris) [13:21:08] (03CR) 10Fsero: Give roles to the new kubernetes[12]00[56] VMs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504851 (https://phabricator.wikimedia.org/T220822) (owner: 10Alexandros Kosiaris) [13:21:28] (03PS5) 10Vgutierrez: acme_chief: Prevalidate CN/SNI list [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) [13:22:05] (03CR) 10Ema: [C: 03+2] cache: remove profile::cache::{text,upload} [puppet] - 10https://gerrit.wikimedia.org/r/504874 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:22:11] (03CR) 10Vgutierrez: "You're right regarding the -1, I'm marking this as WIP. Thanks for the review" (034 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [13:22:15] (03CR) 10Ema: [C: 03+2] cache: remove cacheproxy::instance_pair [puppet] - 10https://gerrit.wikimedia.org/r/504875 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:22:19] 10Operations, 10MassMessage: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Elitre) Actually, tried again and it seems to have worked ([[ https://meta.wikimedia.org/wiki/User_talk:Smalyshev_(WMF) | example ]]). [13:24:14] (03CR) 10Ottomata: "THANK YOU" [puppet] - 10https://gerrit.wikimedia.org/r/504845 (https://phabricator.wikimedia.org/T219556) (owner: 10Alexandros Kosiaris) [13:24:57] PROBLEM - cassandra-b CQL 10.64.32.190:9042 on aqs1005 is CRITICAL: connect to address 10.64.32.190 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:25:04] (03CR) 10Ottomata: profile::analytics::cluster::packages::statistics: add python3-venv (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504829 (owner: 10Elukey) [13:26:06] (03CR) 10Ottomata: "The user needs to be created by puppet somewhere, ya?" [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [13:26:17] RECOVERY - cassandra-b CQL 10.64.32.190:9042 on aqs1005 is OK: TCP OK - 0.000 second response time on 10.64.32.190 port 9042 https://phabricator.wikimedia.org/T93886 [13:26:20] (03CR) 10Ottomata: "Oh, other patch ok!" [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [13:26:22] (03CR) 10Elukey: [C: 03+2] ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504829 (owner: 10Elukey) [13:27:06] the cassandra spam on AQS is due to my roll restart [13:27:08] (03CR) 10Ottomata: "Oh, nope not there. Ya, somewhere there will need to be a user { 'analytics': declared. See the analytics-search user." [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [13:27:29] (03PS6) 10Vgutierrez: acme_chief: Prevalidate CN/SNI list [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) [13:27:35] (03CR) 10Ottomata: "Ohhhk!" [puppet] - 10https://gerrit.wikimedia.org/r/504829 (owner: 10Elukey) [13:27:47] (03CR) 10Elukey: "Ack will do!" [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [13:28:00] (03CR) 10Vgutierrez: "This change is ready for review." (035 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [13:30:36] !log rolling updates of ruby2.1 on jessie [13:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:57] (03CR) 10Alex Monk: [C: 03+2] dns: Move DNS operations to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504511 (owner: 10Vgutierrez) [13:33:03] godog, hi [13:33:29] (03Merged) 10jenkins-bot: dns: Move DNS operations to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504511 (owner: 10Vgutierrez) [13:34:44] (03PS15) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [13:35:02] (03CR) 10jenkins-bot: dns: Move DNS operations to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504511 (owner: 10Vgutierrez) [13:37:45] godog, thanks for reviewing my stuff. I was wondering what the next step for the swift ring deployment-prep commits was [13:38:04] one of those is pretty much just committing stuff that is already live so [13:42:04] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10CDanis) similar again `Apr 17 21:20:17 icinga1001 icinga[26588]: Reloading icinga monitoring daemon configuration files: icinga. Apr 17 21:20:17... [13:45:25] (03PS16) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [13:49:01] (03PS1) 10Alexandros Kosiaris: More fixes to kubernetes-node-virtual.cfg [puppet] - 10https://gerrit.wikimedia.org/r/504877 (https://phabricator.wikimedia.org/T220822) [13:49:26] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] More fixes to kubernetes-node-virtual.cfg [puppet] - 10https://gerrit.wikimedia.org/r/504877 (https://phabricator.wikimedia.org/T220822) (owner: 10Alexandros Kosiaris) [13:52:41] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:52:53] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:53:15] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:54:43] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:55:15] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:55:15] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:55:35] well that's better than "CRITICAL: Traceback (most recent call last):" but we still can't do much about it 🙃 [13:55:37] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:55:39] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:55:59] (03CR) 10Elukey: Puppetize schema.wikimedia.org and refactor eventschemas module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [13:56:28] (03PS1) 10Alexandros Kosiaris: Update autoinstall params for backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/504879 (https://phabricator.wikimedia.org/T196478) [13:57:03] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Update autoinstall params for backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/504879 (https://phabricator.wikimedia.org/T196478) (owner: 10Alexandros Kosiaris) [14:00:51] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 14 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:00:51] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 13 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:00:58] 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: Fatal exception of type "ConfigException" - https://phabricator.wikimedia.org/T221347 (10Anomie) Looking into this, I found that all instances of the "Kogo" exception were on mw1274 and all were using php 7.2. Having seen https://wikitech.wikimed... [14:01:03] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 448 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:01:03] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 14 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:01:09] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 7 probes of 448 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:03:23] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 4 probes of 448 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:03:35] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 5 probes of 448 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:03:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 14 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:07:01] (03PS1) 10Muehlenhoff: Puppetise initial KDC config [puppet] - 10https://gerrit.wikimedia.org/r/504880 [14:07:48] (03CR) 10jerkins-bot: [V: 04-1] Puppetise initial KDC config [puppet] - 10https://gerrit.wikimedia.org/r/504880 (owner: 10Muehlenhoff) [14:17:30] 10Operations, 10Beta-Cluster-Infrastructure, 10Mathoid, 10Core Platform Team Backlog (Watching / External), and 2 others: remove mathoid from scb - https://phabricator.wikimedia.org/T200832 (10akosiaris) >>! In T200832#5120806, @Krenair wrote: >>>! In T200832#5061718, @akosiaris wrote: >>>>! In T200832#505... [14:20:03] (03PS2) 10Muehlenhoff: Puppetise initial KDC config [puppet] - 10https://gerrit.wikimedia.org/r/504880 [14:22:31] (03PS5) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) [14:23:17] (03CR) 10jerkins-bot: [V: 04-1] Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [14:26:06] (03PS6) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) [14:26:57] (03CR) 10jerkins-bot: [V: 04-1] Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [14:28:20] (03PS7) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) [14:29:07] (03CR) 10jerkins-bot: [V: 04-1] Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [14:29:18] (03CR) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [14:30:32] (03PS8) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) [14:31:26] (03CR) 10jerkins-bot: [V: 04-1] Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [14:31:32] 10Operations, 10ops-eqiad: Broken network connection on ms-be1044 after reboot - https://phabricator.wikimedia.org/T221376 (10MoritzMuehlenhoff) [14:33:10] (03CR) 10Elukey: Puppetize schema.wikimedia.org and refactor eventschemas module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [14:34:01] 10Operations, 10MassMessage, 10Patch-For-Review: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Reedy) [14:34:09] (03PS9) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) [14:34:56] (03CR) 10jerkins-bot: [V: 04-1] Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [14:37:17] (03PS10) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) [14:39:34] (03CR) 10Elukey: "One question - is the service going to use TLS or just plain HTTP?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [14:40:03] !log push firewall change to pfw3-eqiad - T221278 [14:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:21] (03CR) 10Elukey: [C: 03+1] Puppetise initial KDC config [puppet] - 10https://gerrit.wikimedia.org/r/504880 (owner: 10Muehlenhoff) [14:44:24] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10ayounsi) Got an email 1h ago saying the onsite crew was still splicing hard. > This is to inform you that splicing activity on the east side is still ongoing and we will keep you updated with work... [14:53:33] ACKNOWLEDGEMENT - Host ms-be1044 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T221376 [14:54:12] (03PS4) 10Giuseppe Lavagetto: profile::mediawiki::php: tweak ini settings [puppet] - 10https://gerrit.wikimedia.org/r/502986 (https://phabricator.wikimedia.org/T211488) [14:54:56] (03PS6) 10Andrew Bogott: cloud VMS: move primary resolvers to cloud-recursor0/1 [puppet] - 10https://gerrit.wikimedia.org/r/504580 (https://phabricator.wikimedia.org/T221183) [14:55:10] (03PS3) 10Muehlenhoff: Puppetise initial KDC config [puppet] - 10https://gerrit.wikimedia.org/r/504880 [14:55:34] !log synchronizing docker_registry_codfw swift container from docker_registry [14:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:38] (03PS6) 10Alexandros Kosiaris: Support node labels and taints in kubelet [puppet] - 10https://gerrit.wikimedia.org/r/504849 (https://phabricator.wikimedia.org/T220821) [14:55:39] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Support node labels and taints in kubelet [puppet] - 10https://gerrit.wikimedia.org/r/504849 (https://phabricator.wikimedia.org/T220821) (owner: 10Alexandros Kosiaris) [14:55:41] (03CR) 10Andrew Bogott: [C: 03+2] cloud VMS: move primary resolvers to cloud-recursor0/1 [puppet] - 10https://gerrit.wikimedia.org/r/504580 (https://phabricator.wikimedia.org/T221183) (owner: 10Andrew Bogott) [14:55:48] (03PS7) 10Andrew Bogott: cloud VMS: move primary resolvers to cloud-recursor0/1 [puppet] - 10https://gerrit.wikimedia.org/r/504580 (https://phabricator.wikimedia.org/T221183) [14:56:00] (03PS7) 10Alexandros Kosiaris: Add failure-domain node labels for kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/504850 [14:56:06] (03CR) 10Alexandros Kosiaris: [V: 03+2] Add failure-domain node labels for kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/504850 (owner: 10Alexandros Kosiaris) [14:56:09] (03CR) 10Alexandros Kosiaris: [V: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/504850 (owner: 10Alexandros Kosiaris) [14:57:18] (03PS8) 10Andrew Bogott: cloud VMS: move primary resolvers to cloud-recursor0/1 [puppet] - 10https://gerrit.wikimedia.org/r/504580 (https://phabricator.wikimedia.org/T221183) [14:59:22] (03PS4) 10Muehlenhoff: Puppetise initial KDC config [puppet] - 10https://gerrit.wikimedia.org/r/504880 [14:59:38] (03CR) 10jerkins-bot: [V: 04-1] Puppetise initial KDC config [puppet] - 10https://gerrit.wikimedia.org/r/504880 (owner: 10Muehlenhoff) [15:00:04] James_F: #bothumor I � Unicode. All rise for Structured Data on Commons Phase II deployment deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190418T1500). [15:00:21] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/504880 (owner: 10Muehlenhoff) [15:01:07] I'm holding because of the train / UBN situation, CC Reedy. [15:01:27] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10akosiaris) Host is up and running but as @Cmjohnson points out in T196478#4976375 ` akosiaris@backup1001:~$ sudo megacli -PDList -a0 Adapter #0... [15:03:03] (03CR) 10Muehlenhoff: [C: 03+2] Puppetise initial KDC config [puppet] - 10https://gerrit.wikimedia.org/r/504880 (owner: 10Muehlenhoff) [15:07:39] (03PS3) 10Volans: flake8: enforce import order and adopt W504 [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 [15:07:41] (03PS1) 10Volans: prometheus: fix PytestDeprecationWarning [software/spicerack] - 10https://gerrit.wikimedia.org/r/504892 [15:08:17] (03CR) 10Volans: "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 (owner: 10Volans) [15:12:27] (03PS1) 10Ema: cache: remove unused hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/504895 (https://phabricator.wikimedia.org/T219967) [15:16:35] (03PS1) 10Alexandros Kosiaris: Kubernetes node labels followup [puppet] - 10https://gerrit.wikimedia.org/r/504897 (https://phabricator.wikimedia.org/T220821) [15:17:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] Kubernetes node labels followup [puppet] - 10https://gerrit.wikimedia.org/r/504897 (https://phabricator.wikimedia.org/T220821) (owner: 10Alexandros Kosiaris) [15:19:26] (03CR) 10Alex Monk: acme-chief: Add script for Designate integration (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk) [15:19:29] Update: We've temporarily pushed back the SDC launch to wait for Commons to get back to wmf.1. Will review in an hour's time. [15:19:48] (03PS1) 10CDanis: icinga: pause nsca on reloads [puppet] - 10https://gerrit.wikimedia.org/r/504898 [15:20:27] (03PS2) 10CDanis: icinga: pause nsca on reloads [puppet] - 10https://gerrit.wikimedia.org/r/504898 (https://phabricator.wikimedia.org/T196336) [15:21:17] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:21:47] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10elukey) My idea going forward is to: 1) get https://gerrit.wikimedia.org/r/504831 reviewed/m... [15:21:56] (03PS10) 10Alex Monk: acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) [15:22:09] (03CR) 10jerkins-bot: [V: 04-1] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk) [15:22:11] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 74123 bytes in 1.889 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:22:32] (03PS2) 10Ema: cache: remove unused purge-related hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/504895 (https://phabricator.wikimedia.org/T219967) [15:24:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/504892 (owner: 10Volans) [15:25:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 (owner: 10Volans) [15:28:57] (03PS11) 10Alex Monk: acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) [15:29:45] (03CR) 10jerkins-bot: [V: 04-1] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk) [15:32:03] (03PS12) 10Alex Monk: acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) [15:32:57] (03PS1) 10Thcipriani: gerrit: increase projects cache [puppet] - 10https://gerrit.wikimedia.org/r/504912 (https://phabricator.wikimedia.org/T221026) [15:33:30] (03PS3) 10DCausse: Add a new extension point SshExecuteCommandInterceptor [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/502764 [15:33:47] (03CR) 10Paladox: [C: 03+1] gerrit: increase projects cache [puppet] - 10https://gerrit.wikimedia.org/r/504912 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [15:34:31] (03CR) 10Vgutierrez: [C: 03+1] "nice, I'll get this merged on Tuesday EU morning" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk) [15:34:35] (03CR) 10Ema: "noop: https://puppet-compiler.wmflabs.org/compiler1002/15893/" [puppet] - 10https://gerrit.wikimedia.org/r/504895 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [15:35:16] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1002/15894/tools-sgeexec-0905.tools.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/504817 (https://phabricator.wikimedia.org/T221225) (owner: 10BryanDavis) [15:36:50] (03CR) 10Bstorm: [C: 03+1] "It looks right to me. Another set of eyes wouldn't hurt, but I'd also like to get this in place before Friday in case it blows up." [puppet] - 10https://gerrit.wikimedia.org/r/504817 (https://phabricator.wikimedia.org/T221225) (owner: 10BryanDavis) [15:44:45] PROBLEM - Memory correctable errors -EDAC- on db1068 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [15:50:18] (03PS2) 10Alex Monk: archiva::proxy: remove old letsencrypt module stuff [puppet] - 10https://gerrit.wikimedia.org/r/504648 (https://phabricator.wikimedia.org/T221268) [15:52:35] 10Operations, 10ops-codfw, 10Traffic, 10decommission: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286 (10RobH) a:03RobH [15:55:12] 10Operations, 10ops-codfw, 10Traffic, 10decommission: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for acamar.wikimedia.org and performed the following actions: - Revoked Puppet certificate - Removed from Pupp... [15:55:39] 10Operations, 10ops-codfw, 10Traffic, 10decommission: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for achernar.wikimedia.org and performed the following actions: - Revoked Puppet certificate - Removed from Pu... [15:56:41] Krenair: heya, shdubsh and I are having a look into T221288 and curious what “#2019041710004636” from the description was referring to? [15:56:42] T221288: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 [15:57:14] herron, you might not have access, it's https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom;TicketNumber=2019041710004636 [15:57:41] ah, indeed I don’t think I do [15:57:46] I can probably forward the email to your wikimedia.org account [15:57:54] that would be great, I’m kherron@ [16:00:04] godog and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190418T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:37] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:01:45] herron, done [16:02:03] you can copy it to shdubsh too [16:02:41] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:03:13] 10Operations, 10ops-codfw, 10Traffic, 10decommission: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286 (10RobH) >>! In T198286#5014740, @MoritzMuehlenhoff wrote: > @RobH I'd like to use one of the hosts for some installer tests in the next weeks, can we hold decommissioning thes... [16:03:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1003 with 10G interfaces - https://phabricator.wikimedia.org/T221139 (10Andrew) [16:03:24] Krenair: got it, thanks! [16:03:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10Andrew) [16:03:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1006 with 10G interfaces - https://phabricator.wikimedia.org/T221048 (10Andrew) [16:04:58] (03CR) 10Cwhite: [C: 03+1] prometheus: fix PytestDeprecationWarning [software/spicerack] - 10https://gerrit.wikimedia.org/r/504892 (owner: 10Volans) [16:05:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10Andrew) This host has a broken disk 300Gb SAS drive -- that needs to be replaced before we can re-image. [16:05:32] (03PS1) 10RobH: decommission of acamar and achernar [puppet] - 10https://gerrit.wikimedia.org/r/504918 (https://phabricator.wikimedia.org/T198286) [16:07:07] (03PS1) 10RobH: decommission production dns for acamar and achernar [dns] - 10https://gerrit.wikimedia.org/r/504919 (https://phabricator.wikimedia.org/T198286) [16:07:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Andrew) The raid config tool on this host is not cooperating. With luck a bios update will get us past this. [16:09:29] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286 (10RobH) [16:09:39] (03CR) 10BBlack: [C: 03+2] Revert "wiktionary: test with zone-local CNAME->DYNA" [dns] - 10https://gerrit.wikimedia.org/r/504587 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [16:12:08] 10Operations, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10ayounsi) 05Open→03Resolved p:05Triage→03Low [16:13:30] 10Operations, 10hardware-requests, 10serviceops: setup/install WMF7426 as phab1002.wikimedia.org - https://phabricator.wikimedia.org/T221389 (10RobH) p:05Triage→03Normal [16:13:33] !log rollback dhcp option 82 test from asw2-b-eqiad [16:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:35] !log remove peering to 63199 in eqsin (down for 1 month, no reply to emails) [16:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:32] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 263, down: 0, shutdown: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:19:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM. Don't forget to create the mcrouter certs before merging" [puppet] - 10https://gerrit.wikimedia.org/r/504791 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [16:19:42] (03CR) 10BBlack: [C: 03+2] wikipedia.org: test with zone-local CNAME->DYNA [dns] - 10https://gerrit.wikimedia.org/r/504588 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [16:20:42] !log Experimental DNS-level changes deploying for wikipedia.org domain - if wikipedia.org DNS problems appear, revert https://gerrit.wikimedia.org/r/c/operations/dns/+/504588 - T208263 [16:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:47] T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 [16:20:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] conftool: add mw2151 as a jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/504793 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [16:23:06] (03CR) 10Giuseppe Lavagetto: "see a small comment. Do not forget to add the mcrouter certs." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504794 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [16:24:34] PROBLEM - EDAC syslog messages on db1068 is CRITICAL: 4.026 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [16:25:17] (03CR) 10Hashar: [C: 03+1] "Indeed :)" [puppet] - 10https://gerrit.wikimedia.org/r/504912 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [16:25:19] 10Operations, 10hardware-requests, 10serviceops: setup/install WMF7426 as phab1003.wikimedia.org - https://phabricator.wikimedia.org/T221389 (10RobH) [16:26:32] 10Operations, 10hardware-requests, 10serviceops: setup/install WMF7426 as phab1003.wikimedia.org - https://phabricator.wikimedia.org/T221389 (10RobH) [16:27:19] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10RobH) [16:27:39] 10Operations, 10hardware-requests, 10serviceops: setup/install WMF7426 as phab1003.wikimedia.org - https://phabricator.wikimedia.org/T221389 (10Krenair) Did you mean phab1003.eqiad.wmnet? Existing phab* hosts are internal [16:28:30] 10Operations, 10hardware-requests, 10Patch-For-Review: request to assign wmf6937 (mw1298, former imagescaler) (now: wmf4727) as phab1002 - https://phabricator.wikimedia.org/T195623 (10RobH) [16:28:32] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019 (10RobH) [16:28:35] 10Operations, 10hardware-requests, 10serviceops: requesting WMF7426 as phabricator system in eqiad - https://phabricator.wikimedia.org/T215335 (10RobH) 05Open→03Resolved Granted and resolving, setup is on T221389. [16:28:40] 10Operations, 10ops-eqiad, 10serviceops: setup/install WMF7426 as phab1003.wikimedia.org - https://phabricator.wikimedia.org/T221389 (10RobH) [16:29:10] Anybody able to have a look at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/504018? [16:36:26] 10Operations, 10ops-eqiad, 10serviceops: apply hostname label for WMF7426 as phab1003 - https://phabricator.wikimedia.org/T221392 (10RobH) p:05Triage→03Normal [16:36:47] 10Operations, 10ops-eqiad, 10serviceops: setup/install WMF7426 as phab1003.wikimedia.org - https://phabricator.wikimedia.org/T221389 (10RobH) [16:42:36] 10Operations, 10ops-eqiad, 10serviceops: setup/install WMF7426 as phab1003.wikimedia.org - https://phabricator.wikimedia.org/T221389 (10RobH) [16:42:48] 10Operations, 10serviceops: setup/install WMF7426 as phab1003.wikimedia.org - https://phabricator.wikimedia.org/T221389 (10RobH) [16:43:10] 10Operations, 10serviceops: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10RobH) [16:45:12] (03PS2) 10RobH: decommission production dns for acamar and achernar [dns] - 10https://gerrit.wikimedia.org/r/504919 (https://phabricator.wikimedia.org/T198286) [16:46:20] (03CR) 10RobH: [C: 03+2] decommission production dns for acamar and achernar [dns] - 10https://gerrit.wikimedia.org/r/504919 (https://phabricator.wikimedia.org/T198286) (owner: 10RobH) [16:50:00] (03PS8) 10Paladox: WIP: Update gerrit to 2.16.7 [software/gerrit] (deploy/wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495012 [16:50:07] 10Operations, 10ops-eqiad: Broken network connection on ms-be1044 after reboot - https://phabricator.wikimedia.org/T221376 (10Cmjohnson) 05Open→03Resolved The cable looks good and appears to be passing traffic, I have a link at the server and shows up on the switch xe-2/0/8 up up ms-be1044 St... [16:50:36] (03PS1) 10RobH: setting phab1003 mgmt dns entry [dns] - 10https://gerrit.wikimedia.org/r/504922 (https://phabricator.wikimedia.org/T221389) [16:51:09] (03PS2) 10RobH: setting phab1003 mgmt dns entry [dns] - 10https://gerrit.wikimedia.org/r/504922 (https://phabricator.wikimedia.org/T221389) [16:51:18] (03CR) 10RobH: [C: 03+2] setting phab1003 mgmt dns entry [dns] - 10https://gerrit.wikimedia.org/r/504922 (https://phabricator.wikimedia.org/T221389) (owner: 10RobH) [16:53:36] 10Operations, 10serviceops: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10RobH) [16:53:47] (03PS1) 10CDanis: add a .gitreview [software/conftool] - 10https://gerrit.wikimedia.org/r/504923 [16:53:51] 10Operations, 10serviceops: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Cmjohnson) [16:53:53] 10Operations, 10ops-eqiad, 10serviceops: apply hostname label for WMF7426 as phab1003 - https://phabricator.wikimedia.org/T221392 (10Cmjohnson) 05Open→03Resolved done [16:54:18] 10Operations, 10serviceops: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10RobH) a:05RobH→03Dzahn Please note this is now ready for installation, and I'm assigning to @dzahn per our IRC conversation. Please ensure the system state in netbox is changed to 'stag... [16:57:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Cmjohnson) This server is refusing to allow me to access the raid configuration. It has the old config now...I think @robh may know how to update BIO... [16:58:13] cmjohnson1: im looking at cloudvirt1005 now [16:58:30] (03PS1) 10Effie Mouzeli: prometheus: Fix haproxy mtail stats for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/504924 (https://phabricator.wikimedia.org/T220499) [17:00:04] cscott, arlolra, subbu, and halfak: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190418T1700). [17:11:16] (03PS2) 10Dzahn: conftool: add mw2151 as a jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/504793 (https://phabricator.wikimedia.org/T192457) [17:11:53] (03CR) 10Dzahn: [C: 03+2] conftool: add mw2151 as a jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/504793 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [17:13:21] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) >>! In T192457#5093559, @MoritzMuehlenhoff wrote: > During the HHVM updates I noticed that mw2151 is in site.pp as a jobrunner, but not listed in conftool-data. Thanks! Added to conftool-dat... [17:13:59] (03CR) 10CDanis: [C: 03+1] prometheus: Fix haproxy mtail stats for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/504924 (https://phabricator.wikimedia.org/T220499) (owner: 10Effie Mouzeli) [17:14:41] (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/15895/prometheus1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/504924 (https://phabricator.wikimedia.org/T220499) (owner: 10Effie Mouzeli) [17:15:52] (03CR) 10Dzahn: site/mw/conftool: assign mw2150 as jobrunner, mw2244 as API server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504794 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [17:15:55] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] prometheus: Fix haproxy mtail stats for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/504924 (https://phabricator.wikimedia.org/T220499) (owner: 10Effie Mouzeli) [17:16:05] (03PS2) 10Effie Mouzeli: prometheus: Fix haproxy mtail stats for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/504924 (https://phabricator.wikimedia.org/T220499) [17:16:22] (03CR) 10Andrew Bogott: "This looks right to me, but testing I get:" [puppet] - 10https://gerrit.wikimedia.org/r/504825 (https://phabricator.wikimedia.org/T219390) (owner: 10Alex Monk) [17:17:17] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) >>! In T192457#5024820, @RobH wrote: > Please note we may steal mw2245 for thumbor1005 use on T218323 @RobH @jijiki I included that in my gerrit change https://gerrit.wikimedia.org/r/c/opera... [17:18:35] jouncebot: now [17:18:35] For the next 0 hour(s) and 41 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190418T1700) [17:18:48] jouncebot: next [17:18:48] In 0 hour(s) and 41 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190418T1800) [17:19:51] syncing https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/504863 [17:20:01] !log twentyafterfour@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/AbuseFilter/includes/: sync https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/504863 (duration: 01m 00s) [17:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:33] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [17:26:49] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10RobH) >>! In T192457#5123320, @Dzahn wrote: >>>! In T192457#5024820, @RobH wrote: >> Please note we may steal mw2245 for thumbor1005 use on T218323 > > @RobH @jijiki I included that in my gerrit ch... [17:27:38] 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Consider ways to make puppetmaster CA changes smoother on the puppet client end - https://phabricator.wikimedia.org/T220268 (10Andrew) I'm wary of having a central repo of alternate puppetmasters (mostly because maintaining it seems like a pain) but... [17:29:12] 10Puppet, 10Patch-For-Review, 10cloud-services-team (Kanban): Have puppet-merge on puppetmaster1001 publish the official sha1 after merging - https://phabricator.wikimedia.org/T219390 (10Andrew) > I think for this to make sense we should require labs/private repository to also exist on a trusted instance som... [17:30:29] PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:32:18] mobrovac: Do you feel comfortable back-porting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/504920 ? [17:32:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10RobH) I've updated the system bios to the newest revision and then handed back to Chris. When attempting to enter the raid bios, it fails to actually... [17:33:16] James_F: yup, i'll put it up for swat [17:33:37] mobrovac: Or just go now? It's an UBN train blocker… [17:34:05] CC twentyafterfour. [17:34:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Cmjohnson) @Andrew even after the updates by rob I am not able to get to the raid utility. Do you want to keep it as-is without having the 2 spare di... [17:36:29] (03PS2) 10Alex Monk: git-sync-upstream: Rebase on top of prod's copy of the puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/504825 (https://phabricator.wikimedia.org/T219390) [17:36:33] (03CR) 10Jforrester: "I think this is a significantly inferior change to the one you abandoned in favour of this one. That is more targeted and less likely to c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504018 (https://phabricator.wikimedia.org/T220950) (owner: 10Urbanecm) [17:38:11] James_F: oh ok, going [17:39:18] * mobrovac deploying https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/504929/ [17:40:37] RECOVERY - mediawiki-installation DSH group on mw2151 is OK: OK https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist [17:44:29] mobrovac: James_F: I don't know if that change alone will fix it, but it's certainly progress. I suspect the change will make the error go away from Logstash, but the jobs may still be broken (silently operating in an invalid title instead, if core's never scheduling it anymore for most jobs which still use titles). E.g. purging 'Special:' instead of the pages being edited. [17:44:47] So do await the core patch review from Aaron before rolling forward with group1 :) [17:45:23] yes yes, that is my understanding too Krinkle, the problem are the changes made to core that were not reflected in EB [17:45:49] (03PS2) 10RobH: decommission of acamar and achernar [puppet] - 10https://gerrit.wikimedia.org/r/504918 (https://phabricator.wikimedia.org/T198286) [17:45:55] (03CR) 10RobH: [C: 03+2] decommission of acamar and achernar [puppet] - 10https://gerrit.wikimedia.org/r/504918 (https://phabricator.wikimedia.org/T198286) (owner: 10RobH) [17:46:31] !log mobrovac@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/EventBus/includes/JobExecutor.php: Default to a dummy title for invalid titles - T221368 (duration: 01m 01s) [17:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:36] T221368: cdnPurge and other jobs fail completely to execute - https://phabricator.wikimedia.org/T221368 [17:47:07] mobrovac: For next week, I think making the field optional would probably be best long-term. E.g. in EventJobJobQueue for new jobs being serialised, always omit them (for all jobs, including those that need titles, always using params). Then core will handle the run-time compat. [17:47:56] And any older jobs still having it set, should always be valid, so we can handle those in the current way as well (aside from foreign/localised namespaces, but those never worked on the new queue) [17:48:39] Yup, I wasn't going to roll the train myself, don't worry. :-) [17:49:17] !log mw2151 - scap pull [17:49:18] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286 (10RobH) [17:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:22] Krinkle: hm we have the title as a req property in most mw-emitted events (in the topic schemae) so making the title optional would take some effort [17:50:28] so probably not doable by next week [17:50:34] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286 (10RobH) a:05RobH→03Papaul These are ready to have the disks securely erased and decomissioned. [17:51:07] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) 13:40 <+icinga-wm> RECOVERY - mediawiki-installation DSH group on mw2151 is OK: OK https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist 13:49 < mutante> !log mw215... [17:51:31] mobrovac: I noticed some kind of inheritence indeed, not sure why that inheritance exists as jobs have nothing to do with normal MW events (that is, a job is more analogues to an API request than an event, whatever MW might do like create a user or insert a revision, hasn't happened yet). [17:51:45] RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational [17:51:48] Is that inheritance relied upon? [17:51:52] Anyway, yeah, maybe not by next week. [17:52:49] mobrovac: we could a dummy value though. It's only about being able to store the params in Kafka, and read them back out. Whether we also store an unused field (like how page_namespace is already unused), isn't a problem. [17:53:16] e.g. just always set it to page_namespace=-1 and page_title=":" (colon being the shortest invalid title) [17:53:24] which inheritance are you referring to Krinkle? we just try to have the schemae in sync with each other, so that all msgs that have e.g. the db, have to have the same fields [17:53:46] hm true [17:53:52] mobrovac: Is there a consumer that depends on the inheritance. e.g. some kind of generic event consumer that isn't aware of jobs. [17:54:24] Because if it doesn't, then we can make jobs not inherit from that, and then add the fields we need and not the others. That change wouldn't require changing any other schemas. [17:54:38] Anyway, as said, we dont have to break the schema, I trust you on what's easiest there :) [17:54:50] hm i see what you mean now [17:55:19] (03CR) 10Andrew Bogott: "This works for the main puppet repo but it alerts behavior for /labs/private where we're still specifying a branch name rather than a sha1" [puppet] - 10https://gerrit.wikimedia.org/r/504825 (https://phabricator.wikimedia.org/T219390) (owner: 10Alex Monk) [17:55:34] we only have one schema for all jobs though so we either change it for all or for none, but yes, as you point out, the only thing is to have something that doesn't break [17:56:04] (03CR) 10Alex Monk: "Yeah I think we should do the same with labs/private so we'd get a hash for that as well. Will need the publishing side to be set up thoug" [puppet] - 10https://gerrit.wikimedia.org/r/504825 (https://phabricator.wikimedia.org/T219390) (owner: 10Alex Monk) [17:56:27] Krinkle: we basically need to fix the enqueue/dequeue part to properly construct titles, because we already fill in the namespace correctly, it's just that the title is the display title that doesn't take into account the namespace [17:58:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Andrew) I'll try installing it. If everything else works we'll just live without the spares. [18:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190418T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:01:59] mobrovac: changing page_title from a formatted prefixed title (e.g. "User:Example") to just the title ("Example") would be very difficult at this point as we'd have no way to distinguish them at run-time. So rather than making it work, it seems easier to stop using entirely, which core now supports. So at enqueue time we'd always hardcode -1/":", and don't use at dequeue time. with a todo to drop from the schema at your earliest [18:01:59] convenience. [18:02:30] !log dzahn@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2151.codfw.wmnet,cluster=jobrunner,service=nginx [18:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:39] Krinkle: in theory we can work around it, like try with the ns,title tuple, if it doesn't exist, remove everything before ':' [18:03:41] hmmm [18:03:43] maybe not [18:04:16] mobrovac: titles in mediawiki, that's a long rabbit whole you won't come out of alive [18:04:59] there are edge cases for anything we can come up with within a string. [18:05:04] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [18:05:13] the only way to make that work would be to introduce a new field, e.g. page_title_really [18:05:19] and use that when present, together with page_namespace. [18:05:26] as it should have been originally. [18:05:28] Krinkle: yeah, yeah, that's what i realised as soon as i typed it [18:05:36] lool page_title_for_realz [18:05:37] hahaha [18:05:57] but yeah, let's just stub it until we can remove it. [18:06:53] 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: Fatal exception of type "ConfigException" - https://phabricator.wikimedia.org/T221347 (10Bawolff) Hmm, L and K having a hamming distance of 3 - Could this possibly be a memory error that wasn't detectable by ECC as a 3 bit error? [18:12:35] (03PS3) 10Dzahn: site/conftool: assign mw2150 jobrunner, mw2244,mw2245 API servers [puppet] - 10https://gerrit.wikimedia.org/r/504794 (https://phabricator.wikimedia.org/T192457) [18:13:23] (03CR) 10Dzahn: "mw2151 was just reactivated as a jobrunner, so together this means 2 for each role" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504794 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [18:14:11] mobrovac: wow, it's making an article url as well? [18:14:13] PROBLEM - Long running screen/tmux on netmon1002 is CRITICAL: CRIT: Long running tmux process. (user: crusnov PID: 10611, 1730096s 1728000s). [18:14:17] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/504933/2/includes/JobExecutor.php [18:14:27] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/504933/2/includes/EventFactory.php [18:16:34] Krinkle: yeah, that's the non-JQ EB part, we need it for every event to have a unique URI, but for JQ is likely not needed [18:16:40] so yes, i agree we need to think about this [18:16:51] especially when the title and ns are going away [18:16:56] mobrovac: unique? it is right now very not unique. [18:17:15] Krinkle: sorry, not unique, but identifiable [18:17:21] 10Operations, 10DNS, 10Mail, 10Traffic: wiki-mail DKIM failing - https://phabricator.wikimedia.org/T221290 (10herron) Indeed I'm able to produce a DKIM issue as well with wiki-mail. Here's an example (seen in headers of message triggered by account preferences change): ` dkim=invalid (public key: granula... [18:17:27] Krinkle: it's what we use for updating caches/purging etc [18:17:29] mobrovac: what is it using that string for? [18:17:34] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [18:18:04] mobrovac: right, for event consumed by restbase etc. [18:18:06] changeprop, makes sense. [18:18:24] yup exactly [18:18:33] so we'll likely need to separate those [18:18:38] but are they listening to "*" any event? including jobs? [18:19:08] they should not do anything for a job with title=Main_Page for example, because at that point, nothing will have happened yet. and maybe notihng will (e.g. if I'm adding it to my watch list) [18:19:09] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) mw2150 was not in this ticket until now, but it was in site.pp as another spare under the "Former imagescalers" section. added to ticket. checking if it has been reinstalled. [18:19:19] the real thing will have its own evnet, but I suppose it's not harmful, just wasteful :) [18:19:33] Pchelolo: are we actually using the article url for jobs for anything? dedup perhaps? cf. https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/504933/2/includes/EventFactory.php@980 [18:19:51] mobrovac: I'm writing a comment on gerrit about that [18:19:52] Krinkle: no no, these go into different topics, so no action is taken by CP for jobs [18:19:59] cool [18:21:00] mobrovac: note, train is still blocked until core patch is done as well. I suspect right now it just means jobs are running without errors but not doing what they're supposed to. [18:21:31] Krinkle: ok is somebody working on a patch for core? [18:21:43] I did already, awaiting A.aron's review. [18:22:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Andrew) [18:27:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Andrew) a:03Cmjohnson Assigning back to Chris for remove old switch port info/update netbox with new name and location/update switch and physical l... [18:27:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1002 with 10G interfaces - https://phabricator.wikimedia.org/T221140 (10Andrew) a:03Cmjohnson Assigning back to Chris for remove old switch port info/update netbox with new name and location/update switch and physical l... [18:27:21] win 18 [18:27:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1003 with 10G interfaces - https://phabricator.wikimedia.org/T221139 (10Andrew) a:03Cmjohnson Assigning back to Chris for remove old switch port info/update netbox with new name and location/update switch and physical l... [18:27:39] (03PS1) 10Herron: phabricator: remove rfc1918 ip4 addrs from SPF record [dns] - 10https://gerrit.wikimedia.org/r/504936 (https://phabricator.wikimedia.org/T221288) [18:27:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10Andrew) a:03Cmjohnson Assigning back to Chris for remove old switch port info/update netbox with new name and location/update switch and physical l... [18:28:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Andrew) a:03Cmjohnson Apart from the spare raid drives, this looks good. I think we should just forge ahead. Assigning back to Chris for remove o... [18:28:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1006 with 10G interfaces - https://phabricator.wikimedia.org/T221048 (10Andrew) a:03Cmjohnson Assigning back to Chris for remove old switch port info/update netbox with new name and location/update switch and physical l... [18:28:54] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) p:05Normal→03High [18:29:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [18:37:15] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10RobH) a:05RobH→03jijiki So this would be ideal to reimage as a thumbnor server to replace thumbor1004 via T221132. @effie: would this work for thumbor replacemen... [18:41:15] !log mw2150 - reimaging, not in confctl [18:41:16] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['mw2150.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201904181840_dzahn... [18:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:23] (03Abandoned) 10Urbanecm: Enable subpages by default for a few more extra namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504018 (https://phabricator.wikimedia.org/T220950) (owner: 10Urbanecm) [18:43:32] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10RobH) I'll be using the new cookbook documented on https://wikitech.wikimedia.org/wiki/Decom_scr... [18:44:55] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [18:45:55] (03PS1) 10Urbanecm: Add wikimania years namespaces to wgNamespacesWithSubpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504939 (https://phabricator.wikimedia.org/T220950) [18:46:28] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [18:46:35] 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: Fatal exception of type "ConfigException" - https://phabricator.wikimedia.org/T221347 (10Daimona) Given that they're close on the keyboard, I originally thought of a typo. But that would mean someone edited the file and introduced the typo on mw1... [18:47:26] Anyone able to do a very-last-minute SWAT deploy? https://gerrit.wikimedia.org/r/504939 [18:47:26] , please :) [18:47:28] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:34] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:40] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:46] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labs... [18:47:46] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:56] mobrovac: after https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/504933/ and confirming no issues in logstash for group0, I can we can roll forward. [18:48:03] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labs... [18:48:16] (03CR) 10Jforrester: Add wikimania years namespaces to wgNamespacesWithSubpages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504939 (https://phabricator.wikimedia.org/T220950) (owner: 10Urbanecm) [18:48:39] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10RobH) [18:49:27] (03PS2) 10Urbanecm: Add wikimania years namespaces to wgNamespacesWithSubpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504939 (https://phabricator.wikimedia.org/T220950) [18:51:24] kk Krinkle, sounds good, let's wait for the core patch to go through and reconvene [18:51:46] (03CR) 10Urbanecm: Add wikimania years namespaces to wgNamespacesWithSubpages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504939 (https://phabricator.wikimedia.org/T220950) (owner: 10Urbanecm) [18:51:51] mobrovac: I don't think it will, and it's not needed. Whereas the EB patch would make things, better, but hold on. Just saw something. [18:52:33] (03CR) 10Jforrester: [C: 03+1] "Fair. LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504939 (https://phabricator.wikimedia.org/T220950) (owner: 10Urbanecm) [18:55:31] herron: puppet is failing on udp2log-test1.logging.eqiad.wmflabs — can you fix or delete the VM? (And, in theory you've been getting emails alerting you about the breakage… is that happening?) [18:56:08] good catch Krinkle [18:57:01] :) [18:57:48] (03PS1) 10Dzahn: phabricator: remove phab1002 as a phab failover server [puppet] - 10https://gerrit.wikimedia.org/r/504940 (https://phabricator.wikimedia.org/T221391) [18:58:19] Krinkle: ok, once jenkins is happy we can merge and backport that as well [18:58:53] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [18:59:00] (03PS1) 10RobH: decommission labsdb100[45] [puppet] - 10https://gerrit.wikimedia.org/r/504941 (https://phabricator.wikimedia.org/T216749) [18:59:06] Krinkle: euh no, wait, what's the status of core? wasn't aaron's patch that changes the logic reverted? [18:59:09] (03CR) 10Dzahn: [C: 04-1] "thanks. i am stalling this to see whether mw1298 is being taken as a thumbor server instead or not. https://phabricator.wikimedia.org/T21" [puppet] - 10https://gerrit.wikimedia.org/r/504791 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [18:59:21] mobrovac: Nothing's happened in core yet. [18:59:21] mobrovac: no, nothing was reverted. [18:59:26] jouncebot: next [18:59:26] In 0 hour(s) and 0 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190418T1900) [18:59:31] ooh ok [18:59:35] and my patch is actually making things worse in core haha, it's at least no longer needed for this to be unbroken. [18:59:38] (03CR) 10RobH: [C: 03+2] decommission labsdb100[45] [puppet] - 10https://gerrit.wikimedia.org/r/504941 (https://phabricator.wikimedia.org/T216749) (owner: 10RobH) [18:59:52] twentyafterfour: Things still being worked on ahead of back-port to wmf.1 to fix the train. :-( [19:00:04] twentyafterfour: How many deployers does it take to do MediaWiki train - Americas version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190418T1900). [19:00:38] James_F: ack, I am following [19:00:48] (03PS1) 10RobH: decommission labsdb100[45] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/504944 (https://phabricator.wikimedia.org/T216749) [19:01:16] (03CR) 10RobH: [C: 03+2] decommission labsdb100[45] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/504944 (https://phabricator.wikimedia.org/T216749) (owner: 10RobH) [19:01:18] * mobrovac will deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/504942/ once we get V+2 from jenkins [19:01:32] (03CR) 10Dzahn: [C: 03+2] "@twentyafterfour this is because we are now getting the 64GB RAM server instead :)" [puppet] - 10https://gerrit.wikimedia.org/r/504940 (https://phabricator.wikimedia.org/T221391) (owner: 10Dzahn) [19:02:23] (03CR) 10Dzahn: [C: 03+2] "fyi Tyler as well.. follow-up to that phab upgrade meeting we had a while ago" [puppet] - 10https://gerrit.wikimedia.org/r/504940 (https://phabricator.wikimedia.org/T221391) (owner: 10Dzahn) [19:02:37] (03PS2) 10Dzahn: phabricator: remove phab1002 as a phab failover server [puppet] - 10https://gerrit.wikimedia.org/r/504940 (https://phabricator.wikimedia.org/T221391) [19:03:15] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10RobH) [19:03:20] (03PS1) 10Andrew Bogott: Remove reference to in apparmor template [puppet] - 10https://gerrit.wikimedia.org/r/504946 [19:03:36] (03PS1) 10Bstorm: wmcs puppetdb: fix puppetdb module for vms [puppet] - 10https://gerrit.wikimedia.org/r/504947 [19:04:03] (03CR) 10jerkins-bot: [V: 04-1] wmcs puppetdb: fix puppetdb module for vms [puppet] - 10https://gerrit.wikimedia.org/r/504947 (owner: 10Bstorm) [19:04:21] 10Operations, 10ops-eqiad, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10RobH) a:05RobH→03Cmjohnson These are very old R510s, so they are slated for decommission and dispos... [19:04:31] robh: multi puppet merge. i can take yours too [19:05:22] (03PS1) 10Cwhite: remove granularity key from wiki-mail DKIM [dns] - 10https://gerrit.wikimedia.org/r/504948 (https://phabricator.wikimedia.org/T221290) [19:05:51] (03PS2) 10Bstorm: wmcs puppetdb: fix puppetdb module for vms [puppet] - 10https://gerrit.wikimedia.org/r/504947 [19:06:04] 10Operations, 10ops-eqiad, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10RobH) [19:06:15] * mobrovac deploying [19:06:39] (03CR) 10jerkins-bot: [V: 04-1] wmcs puppetdb: fix puppetdb module for vms [puppet] - 10https://gerrit.wikimedia.org/r/504947 (owner: 10Bstorm) [19:07:26] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: wiki-mail DKIM failing - https://phabricator.wikimedia.org/T221290 (10colewhite) p:05Triage→03High a:03colewhite [19:08:02] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10colewhite) p:05Triage→03Normal [19:08:04] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: wiki-mail DKIM failing - https://phabricator.wikimedia.org/T221290 (10faidon) How did it work until now? Also, unrelatedly, we probably should use something stronger than a 1024-bit RSA key. [19:10:04] !log mobrovac@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/EventBus/includes/JobExecutor.php: Remove the use of page titles in JobExecutor, file 1/2 - T221368 (duration: 01m 01s) [19:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:39] T221368: cdnPurge and other jobs fail completely to execute - https://phabricator.wikimedia.org/T221368 [19:11:37] (03CR) 1020after4: [C: 03+1] "thanks for the heads up" [puppet] - 10https://gerrit.wikimedia.org/r/504940 (https://phabricator.wikimedia.org/T221391) (owner: 10Dzahn) [19:11:46] !log mobrovac@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/EventBus/includes/EventFactory.php: Remove the use of page titles in JobExecutor, file 2/2 - T221368 (duration: 00m 59s) [19:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:09] ok {{done}} Krinkle, twentyafterfour, James_F [19:12:15] Thanks mobrovac. [19:12:47] So theoretically we shouldn't see any new events in https://logstash.wikimedia.org/app/kibana#/dashboard/5587ec70-d421-11e7-a2bf-bb774cde766e?_g=(refreshInterval%3A(display%3AOff%2Cpause%3A!f%2Cvalue%3A0)%2Ctime%3A(from%3Anow-24h%2Cmode%3Aquick%2Cto%3Anow)) and we can roll the train forward to wmf.1 and wmf.2? [19:12:59] (03PS2) 10Andrew Bogott: Fix reference to in apparmor template [puppet] - 10https://gerrit.wikimedia.org/r/504946 [19:13:43] Lots of "broker transport failure" events. [19:14:00] (03PS14) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [19:14:22] 10Operations, 10serviceops: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Dzahn) [19:14:25] 10Operations, 10hardware-requests, 10Patch-For-Review: request to assign wmf6937 (mw1298, former imagescaler) (now: wmf4727) as phab1002 - https://phabricator.wikimedia.org/T195623 (10Dzahn) [19:14:28] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019 (10Dzahn) [19:14:32] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) [19:16:22] hmm this is no bueno [19:16:43] i'll restart cpjobqueue [19:17:49] !log mobrovac@deploy1001 Started restart [cpjobqueue/deploy@922cbc0]: Bounce CP4JQ, lots of transport broken failures - T221368 [19:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:53] T221368: cdnPurge and other jobs fail completely to execute - https://phabricator.wikimedia.org/T221368 [19:18:31] I'm testing in mediawiki.org [19:18:44] but seems like something that should produce a job isn't running in at least a few minutes, possibly lost indeed [19:19:04] (03PS1) 10Dzahn: install_server: remove phab1002, add phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/504951 (https://phabricator.wikimedia.org/T221389) [19:19:47] (03CR) 10Dzahn: [C: 03+2] install_server: remove phab1002, add phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/504951 (https://phabricator.wikimedia.org/T221389) (owner: 10Dzahn) [19:19:54] (03PS2) 10Dzahn: install_server: remove phab1002, add phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/504951 (https://phabricator.wikimedia.org/T221389) [19:20:33] FormatJson::encode($events) failed: Malformed UTF-8 characters, possibly incorrectly encoded. Aborting send. [19:20:38] Krinkle: hm, https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.04.18/mediawiki?id=AWox47BHm4XPTDeIY73c [19:20:42] Yeah, those seem long-standing. [19:20:48] twentyafterfour: yes, that's known (cc ottomata) [19:20:50] The "Failed creating job from description" errors have stopped. [19:20:59] https://logstash.wikimedia.org/app/kibana#/dashboard/5587ec70-d421-11e7-a2bf-bb774cde766e?_g=(refreshInterval%3A(display%3AOff%2Cpause%3A!f%2Cvalue%3A0)%2Ctime%3A(from%3Anow-24h%2Cmode%3Aquick%2Cto%3Anow)) [19:21:17] Last one at 17:44:43 UTC. [19:21:26] TranslateRenderJob: Cannot render translation page for Special:! [19:21:57] (03Abandoned) 10Bstorm: wmcs puppetdb: fix puppetdb module for vms [puppet] - 10https://gerrit.wikimedia.org/r/504947 (owner: 10Bstorm) [19:22:28] a looot of categoryMembershipChange failures [19:22:43] Yes. [19:22:54] "Could not acquire lock 'CategoryMembershipUpdates:18670'". [19:23:05] ah hm [19:23:11] https://gerrit.wikimedia.org/g/mediawiki/extensions/Translate/+/8aa4e78e4c91316a69bf988ebf8cc271444c6d50/tag/TranslateRenderJob.php#34 [19:23:12] iirc we had that problem before [19:23:19] Oh my, that's very... odd. [19:23:33] It's assigning ->params again, which means core's upgrade is being trashed. [19:23:41] Lovely. [19:23:52] Let me see if we can support that. [19:24:10] should we stop job execution for now? [19:24:48] a lot of deadlock error now [19:24:51] Are we losing the jobs or are they getting re-pooled for trying later? [19:25:06] sure, although I don't think we can fix them with a new executor logic. the queueing is broken, as I expected. It's losing the information at queue time. [19:25:15] But yes, less noise and load is better :) [19:25:19] they get retried [19:25:25] well, not forever. [19:25:29] Right. [19:25:32] ok stopping it for now [19:25:36] and the info it needs to succeed is lost. [19:25:40] until we regroup [19:25:50] could rollback further to testwiki, depending on whether we care about mw.org [19:25:57] looking into the title override now [19:26:09] So the only way those jobs are ever going to work is if we patch EventBus to work it out anyway? [19:26:26] Or is it definitely too late and we should just accept that they'll fail? [19:26:37] the target page for those jobs was lost at the source. it doesn't know what the event is about, it got overwritten with NS:'' [19:26:51] at serialisation time, during the web request [19:27:11] (03PS1) 10Dzahn: partman/netboot: remove phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/504957 (https://phabricator.wikimedia.org/T221391) [19:27:17] OK, so there's nothing we can do for those jobs. [19:27:46] Can we fix Translate to emit no more broken jobs? Can we fix MW to cope with broken jobs? Which is fastest to fix? [19:29:30] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: wiki-mail DKIM failing - https://phabricator.wikimedia.org/T221290 (10herron) >>! In T221290#5123622, @faidon wrote: > How did it work until now? I wonder the same thing. Looking through old personal emails I have a message from wiki@wikimedia... [19:29:31] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'scb*' 'disable-puppet "mobrovac: temp stop JQ for T221368" && systemctl stop cpjobqueue' [19:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:36] T221368: cdnPurge and other jobs fail completely to execute - https://phabricator.wikimedia.org/T221368 [19:29:38] (03CR) 10Dzahn: [C: 03+2] partman/netboot: remove phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/504957 (https://phabricator.wikimedia.org/T221391) (owner: 10Dzahn) [19:29:42] I've grepped with codesearch and indeed limited to Translate [19:29:44] will fix [19:29:58] meanwhile, would be good to figure out if other stuff is working. [19:30:11] with the queue stopped that will be hard I guess. what else was failing? [19:30:29] we had a lot of db deadlocks [19:30:35] for categorymembership [19:31:40] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:31:42] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:31:50] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:32:00] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:32:03] known ^^^ [19:32:04] mobrovac: going to ack these too [19:32:06] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:32:10] lemme silence these [19:32:30] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:32:48] OK, stock take. [19:33:24] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:33:32] patch up for review [19:33:46] hm i can't log in to icinga anymore [19:33:46] hm [19:33:50] looking at categorymembership locking now, seems unrelated at glance, but looking [19:33:51] ok will deal with that later [19:33:59] kk thnx Krinkle [19:34:19] mobrovac: it might be the recent ldap case-sensitivity change (?) [19:34:20] (03PS1) 10Dzahn: site: turn phab1002 into a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/504959 (https://phabricator.wikimedia.org/T221391) [19:34:24] There was a change in MW. This broke EventBus (because it was inefficiently using the title in some cirumstances), and in job creators (at least Translate, because it was expecting internal behaviour of MW). We've got two backported fixes in EventBus to work with the new MW (is that back-compat. with wmf.25?). [19:34:24] but anyway, I'm silencing them now [19:34:28] mobrovac: LDAP/wikitech user, but i gess it's Mobrovac vs mobrova [19:34:35] recent changes about that [19:34:38] aaah [19:35:16] the good part is before you had "i can login but still no privileges to run commands" and with the right spelling you had both :p [19:35:22] yup cdanis, mutante, that was it :P [19:36:06] !log cdanis@cumin1001 START - Cookbook sre.hosts.downtime [19:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:09] !log cdanis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:19] I wish that logged the args heh [19:36:25] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cookbook sre.hosts.downtime -r "mobrovac: temp stop JQ for T221368" 'scb*' [19:36:25] dowtime > disabling notifications :) [19:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:34] T221368: cdnPurge and other jobs fail completely to execute - https://phabricator.wikimedia.org/T221368 [19:36:36] yea, true about the args [19:36:47] cdanis: ah thnx! [19:36:54] np, much faster this way [19:37:00] cdanis: much easier than clicking like crazy in icinga [19:37:08] PROBLEM - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [19:37:23] related? ^ [19:37:26] i especially hate it when i click all the things and then icinga reload the page before i can click sumbit [19:37:37] mobrovac: where did you see the categoryMembershipChange failures? [19:37:42] icinga is almost as user friendly as gerrit [19:37:42] PROBLEM - Mediawiki Cirrussearch update lag - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [19:37:44] twentyafterfour: yes, likely, we are not running jobs that update [19:37:45] not seeing them at https://logstash.wikimedia.org/goto/658c5efbe8547281bad01aa5f4255085 [19:38:06] twentyafterfour: the graphs certainly look like that is related [19:38:22] Krinkle: job_type:categoryMembershipChange [19:38:23] mobrovac: use the search box so only the stuff you want shows up. and then you could use the single checkbox to select all [19:38:42] Krinkle: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.04.18/mediawiki?id=AWox66XIm4XPTDeIZX6H is just one example [19:38:49] but cookbook script is much better anyways [19:38:49] Krinkle: all of the failures seem to be for templates [19:38:55] so the name might be the problem .... [19:39:03] mobrovac: that's not on wmf.1 [19:39:07] zhwikivoyage [19:39:16] ah indeed Krinkle! [19:39:17] heh [19:39:19] Yeah, unrelated issue. [19:39:20] duh [19:39:33] The jobs could have been generated when it was on wmf.1? [19:39:37] and yeah, that job has locking issues for a while, it's a known issue [19:39:39] though deadlocks could be caused by code on wmf.1 and show up on wmf.25 nonetheless [19:39:46] mutante: yeah i do that, but when there are many checkboxes to click i'm not as fast as icinga is at refreshing :P [19:39:54] How long does the job queue take for that kind of job to get to the front of the queue to process? [19:40:31] https://logstash.wikimedia.org/goto/85a9f458fac14324117ff283255c7837 job_type:categoryMembershipChange past 7 days [19:40:32] not new :) [19:40:45] OK, let's ignore that one then. [19:40:46] cdanis: of you pass -t/--task-id to the downtime cookbook it now updates the task too ;) [19:40:55] James_F: less than a sec on avg [19:41:07] volans: oh neat. i still wish it logged arguments in the SAL [19:41:16] Hmm, in that case, never mind. [19:41:25] now, "Malformed UTF-8 characters, possibly incorrectly encoded. Aborting" [19:41:33] that , unlike what I thoguht, is indeed new. [19:41:41] Krinkle: i think we are good to merge your patch for translatejob [19:41:44] https://logstash.wikimedia.org/goto/d78e83dbb1985ea99d78106d62ba44db [19:41:45] new-ish anyway [19:41:46] New from wmf.1? New from our fixes for wmf.1? [19:41:49] Krinkle: that's unrelated, it's monolog stuff [19:41:52] many many times higher now than before 2-3 days ago [19:41:58] mobrovac: the top-most checkbox should select everything at once [19:42:22] mobrovac: ok. [19:42:22] cdanis: I know we'll get there, there is a task already [19:42:29] volans: <3 [19:42:31] mobrovac: yes, roll that out [19:42:41] (03PS2) 10Smalyshev: mwgrep: Include JSON files in search [puppet] - 10https://gerrit.wikimedia.org/r/503349 (owner: 10Esanders) [19:42:50] (03CR) 10Smalyshev: [C: 03+1] mwgrep: Include JSON files in search [puppet] - 10https://gerrit.wikimedia.org/r/503349 (owner: 10Esanders) [19:42:51] Krinkle: ah still waiting on V+2 [19:43:16] lol James_F [19:43:16] mobrovac: I have faith that CI won't merge it without the V+2. [19:43:26] (03CR) 10Dzahn: [C: 03+2] site: turn phab1002 into a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/504959 (https://phabricator.wikimedia.org/T221391) (owner: 10Dzahn) [19:44:07] Krinkle: One way to get it tested. :-) [19:44:49] My favourite way to get priority in CI is to go into jenkins and manually abort the jobs ahead of you in the queue. Definitely not disruptive for other people. ;-) [19:45:41] (03CR) 10Herron: [C: 03+1] "+1 for this as a near-term fix. Considering coverage as we approach the easter holiday I think best to merge next week (say tues?)" [dns] - 10https://gerrit.wikimedia.org/r/504948 (https://phabricator.wikimedia.org/T221290) (owner: 10Cwhite) [19:45:47] (03PS1) 10BryanDavis: toolsdb: fix IP address in eqiad1 region pdns [puppet] - 10https://gerrit.wikimedia.org/r/504962 [19:46:41] (03Abandoned) 10Cwhite: httpd: featurize mod-security for use with httpd [puppet] - 10https://gerrit.wikimedia.org/r/497930 (https://phabricator.wikimedia.org/T218784) (owner: 10Cwhite) [19:47:18] Hmm, it looks like Translate has the exact same SWAT and non-SWAT CI jobs. [19:48:20] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: wiki-mail DKIM failing - https://phabricator.wikimedia.org/T221290 (10faidon) It's been a while but if I recall correctly, the intention was to not allow (= not create a valid signature) emails that had e.g. From: person@wikipedia.org (where per... [19:48:55] (03CR) 10Andrew Bogott: [C: 03+2] toolsdb: fix IP address in eqiad1 region pdns [puppet] - 10https://gerrit.wikimedia.org/r/504962 (owner: 10BryanDavis) [19:51:17] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2150.codfw.wmnet'] ` and were **ALL** successful. [19:52:00] apergos: is dumps-1.dumps.eqiad.wmflabs yours? [19:52:24] dpkg is hanging there which is breaking puppet runs. I've been poking but can't get it unstuck [19:55:45] (03PS1) 10Dzahn: mariadb: replace phab1002 grant comments with phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/504964 (https://phabricator.wikimedia.org/T221389) [19:57:26] (03CR) 10Andrew Bogott: [C: 03+2] Fix reference to in apparmor template [puppet] - 10https://gerrit.wikimedia.org/r/504946 (owner: 10Andrew Bogott) [19:57:34] (03PS3) 10Andrew Bogott: Fix reference to in apparmor template [puppet] - 10https://gerrit.wikimedia.org/r/504946 [19:59:32] (03PS1) 10Bstorm: puppetdb: adapt the module so it works on Cloud VPS again [puppet] - 10https://gerrit.wikimedia.org/r/504968 [20:00:59] (03CR) 10Cwhite: [C: 03+1] phabricator: remove rfc1918 ip4 addrs from SPF record [dns] - 10https://gerrit.wikimedia.org/r/504936 (https://phabricator.wikimedia.org/T221288) (owner: 10Herron) [20:01:10] James_F: my way fails when the swat one gets randomly picked for failing in a flaky test, when the master branch one chucks along fine [20:01:14] which just happened [20:01:34] (03PS1) 10Andrew Bogott: apparmor: @qualify pidfile variable in apparmor config [puppet] - 10https://gerrit.wikimedia.org/r/504969 [20:01:34] 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Dzahn) @Robh Ideally i would like to use the same IP i had used for phab1002, so waiting for that decom task to be past "remove production dns". [20:01:46] (03CR) 10Andrew Bogott: [C: 03+2] apparmor: @qualify pidfile variable in apparmor config [puppet] - 10https://gerrit.wikimedia.org/r/504969 (owner: 10Andrew Bogott) [20:01:58] Krinkle: Yeah, sigh. [20:02:55] (03CR) 10Dzahn: [C: 03+1] "this is a good reminder that soon we have to update these when phab moves from phab1001 to phab1003 temporarily. note to self: grep for IP" [dns] - 10https://gerrit.wikimedia.org/r/504936 (https://phabricator.wikimedia.org/T221288) (owner: 10Herron) [20:03:56] (03PS1) 10Andrew Bogott: apparmor: @qualify socket variable in apparmor config [puppet] - 10https://gerrit.wikimedia.org/r/504970 [20:04:27] mutante: thanks! very good point [20:04:31] (03CR) 10Andrew Bogott: [C: 03+2] apparmor: @qualify socket variable in apparmor config [puppet] - 10https://gerrit.wikimedia.org/r/504970 (owner: 10Andrew Bogott) [20:05:30] herron: :) for me as well. i was in the middle of phab1003 to let us upgrade prod server from jessie finally [20:05:49] ha! good timing [20:06:09] dont know if that could also be a service IP that stays [20:06:12] (03CR) 10Herron: "I'm thinking we should wait to merge this until after the holiday weekend, just in case" [dns] - 10https://gerrit.wikimedia.org/r/504936 (https://phabricator.wikimedia.org/T221288) (owner: 10Herron) [20:06:35] andrewbogott: nope [20:06:49] I have nly the snapshot01 and dumps-puppetmaster in deployment-prep [20:07:05] some archive team people set up a dumps project in labs [20:07:09] er cloud [20:07:15] back in the day and stole the name [20:07:29] apergos: ok, I'll figure out who to poke [20:07:37] Hydriz maybe? [20:07:42] mutante: yeah something to think about, would be nice to avoid having to keep it manually in sync [20:08:41] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1002/15897/puppetdb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [20:09:17] 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Dzahn) [20:09:37] herron: ack. i just made a check box on a ticket to remind me when it is time.. but definely [20:09:49] (03PS1) 10Hashar: gerrit: reduce sshd.idleTimeout to one hour [puppet] - 10https://gerrit.wikimedia.org/r/504972 (https://phabricator.wikimedia.org/T182756) [20:09:51] (03PS1) 10Hashar: gerrit: reduce sshd.MaxConnectionsPerUser 32 -> 4 [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) [20:10:01] lunch & [20:11:38] vamos jenkins, vamos [20:12:09] leeeeroy [20:15:47] jbond42: you have three VMs with broken puppet: jbond-buster.puppet.eqiad.wmflabs, jbond-jessie.puppet.eqiad.wmflabs, jbond-puppet-client.puppet.eqiad.wmflabs. Can you delete or fix them as appropriate? [20:16:48] (03PS1) 10RobH: setting backup1001 to spare for now [puppet] - 10https://gerrit.wikimedia.org/r/504976 (https://phabricator.wikimedia.org/T196478) [20:17:32] (03CR) 10RobH: [C: 03+2] setting backup1001 to spare for now [puppet] - 10https://gerrit.wikimedia.org/r/504976 (https://phabricator.wikimedia.org/T196478) (owner: 10RobH) [20:19:09] (03CR) 10Ottomata: "It'll be TLS but that will be handled by the usual external nginx tls + varnish + lvs stuff." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [20:20:43] (03PS11) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) [20:21:51] mobrovac: it merged, if you haven't fallen asleep yet… [20:22:06] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: wiki-mail DKIM failing - https://phabricator.wikimedia.org/T221290 (10Krenair) >>! In T221290#5123696, @herron wrote: >>>! In T221290#5123622, @faidon wrote: >> How did it work until now? > > I wonder the same thing. Looking through old person... [20:22:35] Krinkle: hehe i'm in SF, so it's lunch time for me :P [20:22:39] Krinkle: ok will deploy [20:23:03] Krinkle: wait what merged? https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Translate/+/504961/ still isn't there [20:23:12] touché [20:23:23] was looking at the master commit [20:23:25] which started later [20:23:31] but.. finished earlier [20:23:33] because of the flaky failure [20:23:36] yeah i know, i got false hope too [20:24:02] We really should have CI re-try jobs that fail when the rest pass. [20:26:46] (03PS2) 10Hashar: gerrit: reduce sshd.MaxConnectionsPerUser 32 -> 4 [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) [20:26:52] James_F: Krinkle: i've started wondering whether we should send an email to wikitech-l and ops-l given that it's already been 1h that the jobqueue is not running [20:27:07] mobrovac: for all wikis? [20:27:12] I think we can start it up again [20:27:18] yeah no jobs are running [20:27:19] the failures from Translate shouldn't be of any consequence [20:27:32] I didn't realise it was for all wikis, yeah, just start it [20:27:38] it'll continue where it left off, right? [20:27:41] i'd rather have that fix out so that we can look at the clean situation [20:27:53] the broken jobs will still spew errors from before [20:27:55] yup, [20:27:59] (03CR) 10Hashar: "We are not sure whether `stream-events` is exempt :-/ Then Zuul should just reconnect on the fly so it is probably safe." [puppet] - 10https://gerrit.wikimedia.org/r/504972 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar) [20:28:17] and during its offline time, we still queued new broken jobs just the same. [20:28:22] But deploying the Translate fix will at least stop new broken jobs being generated. [20:28:36] Yes, for sure, but we can start jq asap. [20:28:43] Fair. Yeah, let's do that. [20:28:44] k doing it [20:28:45] I thought it was just for group0 or mediawikiwiki, didn't realise. [20:28:59] (03PS1) 10Effie Mouzeli: thumbor: Inlcude mtail in ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/504978 (https://phabricator.wikimedia.org/T220499) [20:29:04] mobrovac: shall I do the needful then? [20:29:05] Krinkle: We don't have hetdeploy for services. [20:29:12] cdanis: That'd be great. [20:29:41] cdanis: hold on, i will start the service and then you can re-enable puppet via cumin? sounds good? [20:29:51] sure, I can do either or both via cumin as well [20:29:54] James_F: yeah, the old queue handler was also a non-mw service, but did have per-wiki queues, which we lost. [20:30:05] Oh, interesting, it did? Huh. [20:30:05] (the php jobchron service) [20:30:29] !log mobrovac@deploy1001 Started deploy [cpjobqueue/deploy@71941b1]: Ignore Kafka disconnect errors [20:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:19] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational [20:31:20] !log mobrovac@deploy1001 Finished deploy [cpjobqueue/deploy@71941b1]: Ignore Kafka disconnect errors (duration: 00m 51s) [20:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:35] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational [20:31:37] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [20:31:39] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational [20:31:44] yeah, so if enwiki was doing a lot of small jobs, e.g. due to a lot of edits or account creations, and a small wiki is doing some edits, they still get their jobs executed fairly quickly. Overall, this isn't a super common issue given our overall edit rate isn't in the millions per minute. And any recursive actions (like template) we do in a single self-re-enqueqing job. So overall jobs still get processed pretty quickly, which I why I [20:31:44] guess we dropped that in the switch to Kafka. [20:31:52] we do still have a fragmentation by job type. [20:31:57] ok cdanis, you can re-enable puppet now, and icinga is already realising all is back [20:32:03] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational [20:32:03] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [20:32:15] so I guess in retrospect, it would've been possible maybe to disable processing of Translate jobs in particular. [20:32:22] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'scb*' 'enable-puppet "mobrovac: temp stop JQ for T221368"' [20:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:27] T221368: cdnPurge and other jobs fail completely to execute - https://phabricator.wikimedia.org/T221368 [20:33:24] At the time we saw multiple different kinds of job failures (the UTF-8, the category, and the Translate ones). [20:33:48] Retrospectively we now know that we only care about the last of those three, but… [20:33:48] yeah for the utf8 one, ill swat the fix later in the aftnoon [20:33:59] Cool. [20:34:02] (03CR) 10Effie Mouzeli: [V: 03+1] "LGMT https://puppet-compiler.wmflabs.org/compiler1002/15898/thumbor1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/504978 (https://phabricator.wikimedia.org/T220499) (owner: 10Effie Mouzeli) [20:34:25] Translate wmf.1 change is landed. [20:34:29] (Finally.) [20:34:42] 45mins later [20:34:45] cool cool [20:34:46] Yeah. :-( [20:34:46] (03PS1) 10CDanis: conftool: remove 2.7 metadata, add 3.7 [software/conftool] - 10https://gerrit.wikimedia.org/r/504980 [20:34:53] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [20:34:54] ok i'll deploy it [20:35:03] Krinkle: James_F: twentyafterfour: ^ [20:35:11] Thanks. [20:36:12] So we can abandon https://gerrit.wikimedia.org/r/c/mediawiki/core/+/504885 etc.? [20:37:36] Yeah. there may be a clean up patch in core later, I'll coordinate with Aaron on that. [20:37:43] Cool. [20:38:45] (03CR) 10Paladox: [C: 03+1] "We can always raise this a bit if needed but best to start low and work up to a acceptable number (if one hour is too low) :)" [puppet] - 10https://gerrit.wikimedia.org/r/504972 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar) [20:38:47] !log mobrovac@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/Translate/tag: Translate jobs: Remove problematic Job::$params assignments, dir 1/2 - T221368 (duration: 01m 01s) [20:39:38] (03CR) 10Paladox: [C: 03+1] gerrit: reduce sshd.MaxConnectionsPerUser 32 -> 4 [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar) [20:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10RobH) [20:40:14] T221368: cdnPurge and other jobs fail completely to execute - https://phabricator.wikimedia.org/T221368 [20:40:31] !log mobrovac@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/Translate/utils/MessageUpdateJob.php: Translate jobs: Remove problematic Job::$params assignments, dir 2/2 - T221368 (duration: 01m 00s) [20:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:47] ok deploy done [20:40:52] (03CR) 10Effie Mouzeli: [C: 03+1] redis::instance: Remove support for Ubuntu/Upstart [puppet] - 10https://gerrit.wikimedia.org/r/500415 (owner: 10Muehlenhoff) [20:40:56] :D [20:41:37] OK, so theoretically we can roll the train? [20:42:09] yes [20:43:28] Let's do a little bit of testing on mediawiki.org to verify jobs are running fine. [20:44:33] confirmed, translate and purge and recursive refresh links work fine [20:44:57] (03PS1) 10BryanDavis: gerrit: update 'accountPattern' for LDAP account locking [puppet] - 10https://gerrit.wikimedia.org/r/504981 (https://phabricator.wikimedia.org/T218654) [20:46:07] RECOVERY - Mediawiki Cirrussearch update lag - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:46:23] (03CR) 10Paladox: [C: 03+1] gerrit: update 'accountPattern' for LDAP account locking [puppet] - 10https://gerrit.wikimedia.org/r/504981 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [20:46:43] OK. twentyafterfour, over to you. :-) [20:47:00] (03CR) 10Gehel: [C: 03+1] "LGTM, will merge next week" [puppet] - 10https://gerrit.wikimedia.org/r/503709 (owner: 10Alex Monk) [20:47:17] RECOVERY - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:49:08] (03PS1) 10RobH: db1139 and db1140 production dns [dns] - 10https://gerrit.wikimedia.org/r/504982 (https://phabricator.wikimedia.org/T218985) [20:49:21] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10RobH) [20:50:07] hm, downtiming all of scb* was a bad idea in retrospect. also disables service alerts for all the other scb services. I'm working on un-downtiming now [20:50:16] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10RobH) [20:50:35] (03CR) 10CRusnov: coherence report: General improvements and rack checks (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [20:50:37] (03CR) 10RobH: [C: 03+2] db1139 and db1140 production dns [dns] - 10https://gerrit.wikimedia.org/r/504982 (https://phabricator.wikimedia.org/T218985) (owner: 10RobH) [20:50:38] James_F: :P [20:50:52] cdanis: ack, thanks [20:51:55] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10RobH) [20:52:25] !log root@icinga1001.wikimedia.org /var/lib/icinga # for DOWNTIME in $(fgrep -B12 'comment=mobrovac: temp stop JQ for T221368 - cdanis@cumin1001' retention.dat | grep -A13 servicedowntime | grep downtime_id | cut -d= -f2); do printf "[%lu] DEL_SVC_DOWNTIME;%u\n" $(date +%s) $DOWNTIME ; done > rw/icinga.cmd [20:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:30] T221368: cdnPurge and other jobs fail completely to execute - https://phabricator.wikimedia.org/T221368 [20:52:30] (03PS1) 1020after4: all wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504983 [20:52:34] (03CR) 1020after4: [C: 03+2] all wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504983 (owner: 1020after4) [20:52:48] my command reminds me of a recent xkcd [20:53:34] howso [20:53:38] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504983 (owner: 1020after4) [20:55:19] sorry chaomodus, apparently it is still *the* xkcd [20:56:24] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.1 refs T220726 [20:56:25] I dont think I could ever trust myself to run a command like that as root in prod, cdanis [20:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:30] T220726: 1.34.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T220726 [20:56:36] cdanis: oh dear :) [20:56:37] I looked at the output very carefully Krinkle haha [20:56:42] and validated what it did in syslog afterwards [20:56:58] Yeah, maybe after taking a while to run each step separately first [20:57:09] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504983 (owner: 1020after4) [20:57:13] impressive :) [20:57:27] it's fun to run such commands, afterwards you can always be like "ok, let's see what i broke this time" :P [20:57:39] (03CR) 10Effie Mouzeli: [C: 03+1] Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [20:58:47] (03CR) 10CDanis: [C: 03+1] thumbor: Inlcude mtail in ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/504978 (https://phabricator.wikimedia.org/T220499) (owner: 10Effie Mouzeli) [21:00:04] 10Operations, 10ops-codfw, 10DC-Ops: Decommission osm-db200[12] and osm-web200[1234] - https://phabricator.wikimedia.org/T187445 (10RobH) [21:00:31] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] thumbor: Inlcude mtail in ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/504978 (https://phabricator.wikimedia.org/T220499) (owner: 10Effie Mouzeli) [21:00:34] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission osm-db200[12] and osm-web200[1234] - https://phabricator.wikimedia.org/T187445 (10RobH) p:05Low→03Normal a:03Papaul [21:00:43] (03PS2) 10Effie Mouzeli: thumbor: Inlcude mtail in ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/504978 (https://phabricator.wikimedia.org/T220499) [21:03:30] Well the 60 second timeout errors are way up, but that's "normal" now after a deploy [21:03:41] Krinkle: James_F: twentyafterfour: i'm going afk for lunch now, if all hell breaks loose while i'm gone send me an email, i'll see it [21:05:10] ack [21:05:35] (03CR) 10Gehel: [C: 04-1] "I think volans already addressed everything" (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [21:07:38] 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10Dzahn) 05Resolved→03Open https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1068&service=EDAC+syslog+messages https://icinga.wikime... [21:08:35] ACKNOWLEDGEMENT - EDAC syslog messages on db1068 is CRITICAL: 10.03 ge 4 daniel_zahn https://phabricator.wikimedia.org/T213664#4924636 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [21:08:35] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on db1068 is CRITICAL: 10 ge 4 daniel_zahn https://phabricator.wikimedia.org/T213664#4924636 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [21:10:22] 10Operations, 10ops-codfw: Predictive disk failure on db2047 - https://phabricator.wikimedia.org/T149670 (10Dzahn) 05Resolved→03Open https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db2047&service=Device+not+healthy+-SMART- Service Device not healthy -SMART- On Host db2047 *cluster=mys... [21:11:32] 10Operations, 10ops-codfw: Predictive disk failure on db2047 - https://phabricator.wikimedia.org/T149670 (10Dzahn) ` @db2047:~# hpssacli ctrl all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337E0DB0) Port Name: 1I Port Name: 2I Gen8 ServBP 12+2 at Port 1I, Box 1, OK array... [21:12:00] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2047 is CRITICAL: cluster=mysql device=cciss,11 instance=db2047:9100 job=node site=codfw daniel_zahn https://phabricator.wikimedia.org/T149670 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2047&var-datasource=codfw+prometheus/ops [21:14:31] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [21:14:31] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [21:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:42] meh, expected. [21:15:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10RobH) a:05RobH→03Cmjohnson [21:15:41] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) [21:15:59] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) [21:17:24] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@e3c340f]: plugin update -- no restart needed (gerrit2001) [21:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:36] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@e3c340f]: plugin update -- no restart needed (gerrit2001) (duration: 00m 11s) [21:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:27] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@e3c340f]: plugin update -- no restart needed (cobalt) [21:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:38] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@e3c340f]: plugin update -- no restart needed (cobalt) (duration: 00m 10s) [21:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:56] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) ` Techs have completed splicing and are hands off. It may be necessary to reset your services locally at your equipment. We will now proceed to form an official RFO which we will share at a l... [21:20:32] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: wiki-mail DKIM failing - https://phabricator.wikimedia.org/T221290 (10herron) That's interesting, based on the headers in T221290#5123805 it looks like this issue goes back as far as 2015. >>! In T221290#5123730, @faidon wrote: > It's been a wh... [21:21:13] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) 05Open→03Resolved a:03Dzahn ` 12:01 <+icinga-wm> RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0... [21:24:25] PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:25:58] jijiki: ^ [21:27:47] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [21:28:05] !log puppetmaster1001 - mcrouter_generate_certs --generate [21:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:24] that's odd, did something happen to reshuffle nginx / haproxy port numbers on thumbor machines? [21:28:39] yes I should have disabled puppet [21:28:40] and I didnt [21:29:12] (03CR) 10Volans: [C: 03+2] flake8: enforce import order and adopt W504 [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 (owner: 10Volans) [21:29:45] ahh ok [21:33:11] (03Merged) 10jenkins-bot: flake8: enforce import order and adopt W504 [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 (owner: 10Volans) [21:33:20] ACKNOWLEDGEMENT - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Test haproxy metrics for thumbor traffic - T187765 - The acknowledgement expires at: 2019-04-19 22:32:41. [21:33:29] (03CR) 10Volans: [C: 03+2] prometheus: fix PytestDeprecationWarning [software/spicerack] - 10https://gerrit.wikimedia.org/r/504892 (owner: 10Volans) [21:34:01] (03CR) 10jenkins-bot: flake8: enforce import order and adopt W504 [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 (owner: 10Volans) [21:37:26] (03PS1) 10Smalyshev: Enable revision fetches in production [puppet] - 10https://gerrit.wikimedia.org/r/504990 (https://phabricator.wikimedia.org/T217897) [21:37:45] PROBLEM - HHVM rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:38:03] (03CR) 10Jforrester: [C: 03+1] mwgrep: Include JSON files in search [puppet] - 10https://gerrit.wikimedia.org/r/503349 (owner: 10Esanders) [21:38:41] (03Merged) 10jenkins-bot: prometheus: fix PytestDeprecationWarning [software/spicerack] - 10https://gerrit.wikimedia.org/r/504892 (owner: 10Volans) [21:38:55] RECOVERY - HHVM rendering on mw1343 is OK: HTTP OK: HTTP/1.1 200 OK - 73882 bytes in 0.872 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:39:37] (03CR) 10jenkins-bot: prometheus: fix PytestDeprecationWarning [software/spicerack] - 10https://gerrit.wikimedia.org/r/504892 (owner: 10Volans) [21:46:03] (03PS1) 10Jforrester: mwgrep: Also find Gadgets-definition message [puppet] - 10https://gerrit.wikimedia.org/r/504991 [21:46:21] (03PS1) 10Smalyshev: Fix DCATAP dump loading [puppet] - 10https://gerrit.wikimedia.org/r/504992 (https://phabricator.wikimedia.org/T221405) [21:46:23] (03CR) 10Jforrester: [C: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/503349 (owner: 10Esanders) [21:47:09] (03PS1) 10Cwhite: grafana: update swift dashboard with legacy metric name fallback [puppet] - 10https://gerrit.wikimedia.org/r/504993 (https://phabricator.wikimedia.org/T219825) [21:47:58] (03CR) 10CDanis: [C: 03+1] grafana: update swift dashboard with legacy metric name fallback [puppet] - 10https://gerrit.wikimedia.org/r/504993 (https://phabricator.wikimedia.org/T219825) (owner: 10Cwhite) [21:51:31] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10RobH) [21:51:57] 10Operations, 10SRE-Access-Requests: analytics-wmde group addition for Lucas Werkmeister - https://phabricator.wikimedia.org/T220084 (10Dzahn) a:05Nuria→03colewhite ready to go, handing over to Cole as the weekly clinic-duty person. just needs the existing user to be added to the existing group, has appr... [21:55:21] !log LDAP - adding rosalie-wmde to group 'wmde' (T220691) [21:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:26] T220691: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T220691 [21:55:53] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T220691 (10Dzahn) [21:56:22] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T220691 (10Dzahn) 05Open→03Resolved done [21:57:27] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Evan Prodromou - https://phabricator.wikimedia.org/T220226 (10Dzahn) 05Open→03Resolved a:03EvanProdromou ` [mwmaint1002:~] $ ldapsearch -LLLx member=uid=evanp,ou=people,dc=wikimedia,dc=org dn dn: cn=wmf,ou=groups... [22:01:03] !log remove asw2-a-eqiad license keys for troubleshoting [22:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:27] 10Operations, 10LDAP-Access-Requests: Add WDoranWMF to `wmf` LDAP group - https://phabricator.wikimedia.org/T219898 (10Dzahn) [22:09:08] (03PS1) 10Dzahn: admins: add Stella Chang and Linnea Doan to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/504997 (https://phabricator.wikimedia.org/T221118) [22:10:29] (03CR) 10Dzahn: [C: 03+2] admins: add Stella Chang and Linnea Doan to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/504997 (https://phabricator.wikimedia.org/T221118) (owner: 10Dzahn) [22:11:03] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:14:00] !log LDAP - adding 'ldoan' and 'schang' to 'wmf' (T221118) [22:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:07] T221118: add SChang & LDoan to WMF LDAP group for transparency report editing - https://phabricator.wikimedia.org/T221118 [22:14:30] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380 (10Dzahn) [22:14:33] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: add SChang & LDoan to WMF LDAP group for transparency report editing - https://phabricator.wikimedia.org/T221118 (10Dzahn) 05Open→03Resolved done! [22:14:49] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: add SChang & LDoan to WMF LDAP group for transparency report editing - https://phabricator.wikimedia.org/T221118 (10Dzahn) a:05RobH→03Dzahn [22:15:29] 10Operations, 10DC-Ops, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) [22:15:45] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10RobH) [22:16:21] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:33:59] (03PS1) 10Cwhite: admin: add lucaswerkmeister to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/504998 (https://phabricator.wikimedia.org/T220084) [22:34:05] PROBLEM - puppet last run on db1097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:34:50] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10decommission: decommission db2014,db2020, db2021, db2022, db2024, db2031 - https://phabricator.wikimedia.org/T221424 (10RobH) [22:34:59] twentyafterfour: has the train not gone yet? [22:37:36] (03CR) 10Dzahn: [C: 03+1] admin: add lucaswerkmeister to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/504998 (https://phabricator.wikimedia.org/T220084) (owner: 10Cwhite) [22:38:45] (03CR) 10Cwhite: [C: 03+2] admin: add lucaswerkmeister to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/504998 (https://phabricator.wikimedia.org/T220084) (owner: 10Cwhite) [22:39:31] mobrovac: it has [22:39:54] mobrovac: everything looks good to me in logstash [22:40:23] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: analytics-wmde group addition for Lucas Werkmeister - https://phabricator.wikimedia.org/T220084 (10colewhite) [22:40:24] kk great [22:40:36] yeah looks good here too, that's why i was worried the train didn't go :P [22:40:41] thnx twentyafterfour [22:41:01] Krinkle: looks like we are out of the woods, thanks a lot for your help! [22:41:46] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: analytics-wmde group addition for Lucas Werkmeister - https://phabricator.wikimedia.org/T220084 (10colewhite) The group membership change has been deployed. Please feel free to reopen if you encounter any related issue. [22:41:55] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: analytics-wmde group addition for Lucas Werkmeister - https://phabricator.wikimedia.org/T220084 (10colewhite) 05Open→03Resolved [22:51:44] (03PS15) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [22:59:54] (03CR) 10Krinkle: Add a WMF-specific tool for managing db config in MediaWiki (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [23:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190418T2300). [23:00:04] Urbanecm and mobrovac: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:06] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [23:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:13] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [23:00:16] here [23:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:19] yup, here [23:00:24] who's swatting? [23:00:35] RECOVERY - puppet last run on db1097 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:00:54] mobrovac: I can or you can, your call. [23:01:28] kk i'll do it, no worries James_F [23:01:37] Urbanecm: ready for your patch to go? [23:01:42] sure [23:02:02] (03PS3) 10Mobrovac: Add wikimania years namespaces to wgNamespacesWithSubpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504939 (https://phabricator.wikimedia.org/T220950) (owner: 10Urbanecm) [23:04:20] (03CR) 10Mobrovac: [V: 03+2 C: 03+2] Add wikimania years namespaces to wgNamespacesWithSubpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504939 (https://phabricator.wikimedia.org/T220950) (owner: 10Urbanecm) [23:04:55] (03PS1) 10RobH: decom phab1002 production dns and use for phab1003 [dns] - 10https://gerrit.wikimedia.org/r/505006 (https://phabricator.wikimedia.org/T221391) [23:06:38] (03CR) 10RobH: [C: 03+2] decom phab1002 production dns and use for phab1003 [dns] - 10https://gerrit.wikimedia.org/r/505006 (https://phabricator.wikimedia.org/T221391) (owner: 10RobH) [23:06:59] !log mobrovac@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add wikimania years namespaces to wgNamespacesWithSubpages - T220950 (duration: 00m 53s) [23:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:04] T220950: Enable subpages for namespace 2019 and other years namespaces - https://phabricator.wikimedia.org/T220950 [23:07:15] Urbanecm: {{done}}, please check [23:07:20] thx [23:07:50] works [23:07:52] thanks aagain [23:08:06] gr8 [23:10:58] !log mobrovac@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/EventBus/includes: (no justification provided) (duration: 00m 54s) [23:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:23] ok all patches done [23:11:31] anyone else needs to swat something before we close it? [23:12:13] (03CR) 10jenkins-bot: Add wikimania years namespaces to wgNamespacesWithSubpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504939 (https://phabricator.wikimedia.org/T220950) (owner: 10Urbanecm) [23:13:20] mobrovac: looking good mobrovac thank you!@ [23:13:44] so far so good ottomata, yeah, let's monitor for a little while before declaring victory [23:16:37] ok calling swat done [23:16:48] !log evening SWAT completed [23:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:20] PROBLEM - Mediawiki Cirrussearch update lag - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [23:39:26] PROBLEM - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [23:44:14] (03PS1) 10BryanDavis: wmcs: Respect LDAP locks in ssh-key-ldap-lookup [puppet] - 10https://gerrit.wikimedia.org/r/505025 (https://phabricator.wikimedia.org/T168692) [23:45:00] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Respect LDAP locks in ssh-key-ldap-lookup [puppet] - 10https://gerrit.wikimedia.org/r/505025 (https://phabricator.wikimedia.org/T168692) (owner: 10BryanDavis) [23:53:42] (03PS2) 10BryanDavis: wmcs: Respect LDAP locks in ssh-key-ldap-lookup [puppet] - 10https://gerrit.wikimedia.org/r/505025 (https://phabricator.wikimedia.org/T168692) [23:55:28] 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Dzahn) [23:57:01] (03CR) 10BryanDavis: wmcs: Respect LDAP locks in ssh-key-ldap-lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505025 (https://phabricator.wikimedia.org/T168692) (owner: 10BryanDavis) [23:57:29] (03PS1) 10Dzahn: netboot: add phab1003 to partman [puppet] - 10https://gerrit.wikimedia.org/r/505040 (https://phabricator.wikimedia.org/T221389)