[00:05:18] 10Operations, 10Parsoid-PHP, 10serviceops: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10Dzahn) The memory_limit set in php-fpm config is 500M. When trying to curl from restbase1017 to wtp1025 the error shows a limit of 660M though. ` [restbase1017:~] $ curl -H "Host: en.w... [00:06:01] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:31] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:14] 10Operations, 10Wikimedia-General-or-Unknown, 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Aklapper) [00:25:03] 10Operations, 10Wikimedia-General-or-Unknown, 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Aklapper) [00:27:02] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @10am UTC) - https://phabricator.wikimedia.org/T227542 (10wiki_willy) [00:30:45] 10Operations, 10Wikimedia-General-or-Unknown, 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Masumrezarock100) @Krinkle Is Reading web team associated with this task? Just wondering why you added #Readers-Web-Backlog tag in the duplicate. [00:47:45] 10Operations, 10Parsoid-PHP, 10serviceops: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10Dzahn) Ok, wrong curl command to actually talk to wtp1025 and not the cluster while avoiding cert issue. This works: ` curl --header "Host: en.wikipedia.org" --resolve 'parsoid-php.dis... [00:54:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet,service=parsoid-php [00:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:14] (03PS1) 10Ammarpad: Add localized Minerva wordmark for Sindhi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547061 (https://phabricator.wikimedia.org/T200870) [02:17:58] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Jclark-ctr) updated netbox, Finished Idrac and bios setup [02:19:07] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Jclark-ctr) [02:19:48] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Jclark-ctr) Host Switchport elastic1053 33 elastic1054 30 elastic1055 26 elastic1056 23 elastic1057... [02:40:04] !log restarting ats-tls on cp3050 with half open disabled - T236458 [02:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:11] T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 [03:04:17] 10Operations, 10Parsoid-PHP, 10serviceops: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) Let us not bump the memory limit quite yet. I am curious to see how many instances of these we run into and we can then test these pages on scandium and determine what might be a... [03:05:37] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:06] !log Rolling restart of prometheus-exporter-trafficserver-tls - T236458 [03:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:12] T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 [03:10:25] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:11:52] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enforce POST size limit of 100mb on ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/546790 (https://phabricator.wikimedia.org/T236755) (owner: 10Vgutierrez) [03:30:40] (03PS1) 10Vgutierrez: ATS: Disable HTTP/2 max error rate threshold [puppet] - 10https://gerrit.wikimedia.org/r/547067 [03:36:35] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1003/19131/" [puppet] - 10https://gerrit.wikimedia.org/r/547067 (owner: 10Vgutierrez) [03:39:34] 10Operations, 10Parsoid-PHP, 10serviceops: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) >>! In T236833#5617979, @ssastry wrote: > Let us not bump the memory limit quite yet. I am curious to see how many instances of these we run into and we can then test these pages... [04:12:13] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:22:49] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:24:52] !log restarting ats-tls on cp4027 with half open disabled - T236458 [04:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:24:59] T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 [04:43:40] (03PS2) 10EBernhardson: airflow: Add upstream configuration [puppet] - 10https://gerrit.wikimedia.org/r/544996 [04:43:42] (03PS7) 10EBernhardson: airflow: Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 [04:43:44] (03PS8) 10EBernhardson: airflow: Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 [05:06:36] 10Operations, 10Wikimedia-Incident: Slow loading and connectivity issues on enwikipedia - https://phabricator.wikimedia.org/T236872 (10RhinosF1) [05:08:01] 10Operations, 10Wikimedia-Incident: Slow loading and connectivity issues on some wikis - https://phabricator.wikimedia.org/T236872 (10RhinosF1) p:05Triage→03High [05:08:16] 10Operations, 10Wikimedia-Incident: Slow loading and connectivity issues on some wikis - https://phabricator.wikimedia.org/T236872 (10RhinosF1) Raising to high as affecting multiple wikis [05:27:27] 10Operations, 10Wikimedia-Incident: Slow loading and connectivity issues on some wikis - https://phabricator.wikimedia.org/T236872 (10RhinosF1) p:05High→03Normal Seems to have recovered but leaving open for ops investigations [05:44:21] 10Operations, 10Wikimedia-Incident: Slow loading and connectivity issues on some wikis - https://phabricator.wikimedia.org/T236872 (10jijiki) 05Open→03Resolved a:03jijiki I couldn't reproduce this slowness (from Europe) right the task was opened, I am marking it as resolved for now, please reopen if you... [05:44:30] (03PS1) 10Vgutierrez: ATS: Set the default activity timeout to 300 seconds on ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/547073 (https://phabricator.wikimedia.org/T236458) [05:47:50] 10Operations, 10Wikimedia-Incident: Slow loading and connectivity issues on some wikis - https://phabricator.wikimedia.org/T236872 (10RhinosF1) No problem, Thanks! [05:55:43] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/19132/" [puppet] - 10https://gerrit.wikimedia.org/r/547073 (https://phabricator.wikimedia.org/T236458) (owner: 10Vgutierrez) [05:58:10] !log Rolling restart of ats-tls to get rid of leaked sockets and benefit from the lower inactivity timeout - T236458 [05:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:16] T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 [06:06:03] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:10:57] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:25] (03CR) 10Elukey: [C: 03+1] "It looks good to me, but the ownership of the Kafka main cluster is now of the Infra Foundations team, so Keith/Cole would probably be the" [puppet] - 10https://gerrit.wikimedia.org/r/545094 (owner: 10Dzahn) [06:30:35] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 90.21% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [06:40:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "> It would be good to have this reviewed, yes. I'm not sure how to" [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [07:36:12] (03PS1) 10Effie Mouzeli: hhvm: fremove mwrepl and /etc/hhvm/fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/547079 (https://phabricator.wikimedia.org/T229792) [07:37:52] (03PS2) 10Effie Mouzeli: hhvm: remove mwrepl and /etc/hhvm/fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/547079 (https://phabricator.wikimedia.org/T229792) [07:38:00] (03PS3) 10Effie Mouzeli: hhvm: remove mwrepl and /etc/hhvm/fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/547079 (https://phabricator.wikimedia.org/T229792) [07:57:30] !log oblivian@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [07:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:18] !log oblivian@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [07:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:30] (03CR) 10Elukey: "From the patch I can see that Airflow's code is deployed alongside with Search-specific one, same thing for the puppet specific parts. Wou" [puppet] - 10https://gerrit.wikimedia.org/r/544989 (owner: 10EBernhardson) [08:19:57] 10Operations, 10observability, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Joe) In the case of HHVM, fatals were correctly handled by the daemon and... [08:20:09] PROBLEM - Check Varnish expiry mailbox lag on cp5008 is CRITICAL: CRITICAL: expiry mailbox lag is 8919499 https://wikitech.wikimedia.org/wiki/Varnish [08:20:09] (03PS1) 10KartikMistry: Enable CX out of beta in Albanian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547086 (https://phabricator.wikimedia.org/T236064) [08:21:43] 10Operations, 10netops, 10Wikimedia-Incident: Improve resiliency of the eqsin transport link - https://phabricator.wikimedia.org/T236878 (10ayounsi) p:05Triage→03Normal [08:25:46] !log installing php7.0 security updates [08:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:25] (03CR) 10Volans: "Couple of comments and an open question inline" (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [08:31:34] 10Operations, 10netops: cr3-esams crash - https://phabricator.wikimedia.org/T236598 (10ayounsi) re1 is unresponsive, even through console. We have 2 options to try to power cycle it: - Have someone onsite unseat/reseat the card (non disruptive) - Power cycle the whole router (disruptive) I'd suggest the 2nd... [08:40:37] (03PS1) 10Muehlenhoff: Install php-readline on deployment servers instead of 7.0 version [puppet] - 10https://gerrit.wikimedia.org/r/547143 [08:45:57] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [08:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:44] (03CR) 10Ayounsi: New esams stuff (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/545660 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [08:51:41] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [08:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:09] onimisionipe: ^ [08:53:17] yep [08:55:26] (03PS1) 10Effie Mouzeli: (WIP) prometheus: remove hhvm stats gathering [puppet] - 10https://gerrit.wikimedia.org/r/547144 (https://phabricator.wikimedia.org/T229792) [08:57:31] (03CR) 10jerkins-bot: [V: 04-1] (WIP) prometheus: remove hhvm stats gathering [puppet] - 10https://gerrit.wikimedia.org/r/547144 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [09:01:37] (03PS1) 10Effie Mouzeli: prometheus::cluster_config: add $ensure in definition [puppet] - 10https://gerrit.wikimedia.org/r/547145 [09:21:42] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Vgutierrez) @JAllemandou it's currently split like this: ` - VCL_Log CP-TLS-Version: TLSv1.2 - VCL_Log CP-TLS-Sess... [09:23:29] (03CR) 10Elukey: [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) (owner: 10Joal) [09:24:07] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 56.37 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:25:10] spike 30 minutes ago... [09:25:45] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [09:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:33] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 89.65 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:30:55] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The patch now works as intended, but I would like to see it improved further. See comments in the code." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [09:31:55] (03CR) 10Filippo Giunchedi: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [09:33:33] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp4028 [puppet] - 10https://gerrit.wikimedia.org/r/547149 (https://phabricator.wikimedia.org/T231627) [09:33:35] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp4028 [puppet] - 10https://gerrit.wikimedia.org/r/547150 (https://phabricator.wikimedia.org/T231627) [09:33:38] 10Operations, 10observability, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10fgiunchedi) >>! In T234283#5618330, @Joe wrote: > We need to do one of th... [09:34:28] !log Switch from nginx to ats-tls on cp4028 - T231627 [09:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:33] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [09:36:09] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp4028 [puppet] - 10https://gerrit.wikimedia.org/r/547149 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [09:37:27] (03PS1) 10Muehlenhoff: Rename attrbute to enable MFA [puppet] - 10https://gerrit.wikimedia.org/r/547153 [09:38:01] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp4028 [puppet] - 10https://gerrit.wikimedia.org/r/547150 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [09:38:20] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10JAllemandou) Thanks @Vgutierrez - I think representing those values in a map (or an array) is probably the easiest and most flexibl... [09:38:23] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: default to multiple object servers per port [puppet] - 10https://gerrit.wikimedia.org/r/546956 (https://phabricator.wikimedia.org/T222366) (owner: 10Filippo Giunchedi) [09:40:28] (03CR) 10Jbond: "looks good one minor style nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547153 (owner: 10Muehlenhoff) [09:44:04] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [09:45:23] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:46:25] (03CR) 10Joal: [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) (owner: 10Joal) [09:46:53] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp4029 [puppet] - 10https://gerrit.wikimedia.org/r/547156 (https://phabricator.wikimedia.org/T231627) [09:46:55] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp4029 [puppet] - 10https://gerrit.wikimedia.org/r/547157 (https://phabricator.wikimedia.org/T231627) [09:46:57] !log Switch from nginx to ats-tls on cp4029 - T231627 [09:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:02] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [09:47:06] (03PS8) 10Joal: [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) [09:47:44] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp4029 [puppet] - 10https://gerrit.wikimedia.org/r/547156 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [09:48:17] 10Operations, 10User-fgiunchedi: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10fgiunchedi) [09:48:30] 10Operations, 10User-fgiunchedi: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10fgiunchedi) 05Open→03Resolved Hosts are fully in service, resolving [09:49:32] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp4029 [puppet] - 10https://gerrit.wikimedia.org/r/547157 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [09:55:30] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [10:07:03] !log restarting bacula-dir, bacula-sd on backup1001 T236406 [10:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:11] T236406: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 [10:12:40] (03CR) 10Muehlenhoff: Rename attrbute to enable MFA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547153 (owner: 10Muehlenhoff) [10:12:53] (03PS2) 10Muehlenhoff: Rename attribute to enable MFA [puppet] - 10https://gerrit.wikimedia.org/r/547153 [10:17:27] (03PS1) 10Giuseppe Lavagetto: blubberoid: add TLS termination in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/547159 [10:20:05] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add new prometheus mysqld exporter group: test [puppet] - 10https://gerrit.wikimedia.org/r/547009 (owner: 10Jcrespo) [10:21:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/547153 (owner: 10Muehlenhoff) [10:22:27] (03CR) 10Jcrespo: [C: 03+2] prometheus: Add new prometheus mysqld exporter group: test [puppet] - 10https://gerrit.wikimedia.org/r/547009 (owner: 10Jcrespo) [10:22:31] (03PS9) 10Joal: [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) [10:25:49] (03CR) 10Filippo Giunchedi: "Code structure LGTM! PCC fails on icinga1001: https://puppet-compiler.wmflabs.org/compiler1001/19136/icinga1001.wikimedia.org/change.icing" [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [10:28:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547030 (https://phabricator.wikimedia.org/T231870) (owner: 10CDanis) [10:35:53] (03CR) 10Filippo Giunchedi: Introduce Elastic 7 support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [10:36:18] 04Critical Alert for device cr3-esams.wikimedia.org - Juniper alarm active [10:38:18] (03PS3) 10Filippo Giunchedi: Introduce Elastic 7 support [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) [10:41:17] (03PS4) 10Effie Mouzeli: logstash: send PHP7 fatal-error messages type:mediawiki channel:fatal [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) [10:41:54] (03PS2) 10Jbond: puppet_compile: use full puppet version in html report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546930 [10:42:40] (03CR) 10Effie Mouzeli: logstash: send PHP7 fatal-error messages type:mediawiki channel:fatal (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [10:43:05] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10elukey) Sure! The JSON format of what we collect from Varnish for webrequest is in `profile::cache::kafka::webrequest`: ` format... [10:43:56] (03CR) 10Jbond: [C: 03+2] hiera_lookup: add message pointing to `puppet lookup` [puppet] - 10https://gerrit.wikimedia.org/r/546143 (owner: 10Jbond) [10:44:03] (03CR) 10Faidon Liambotis: New esams stuff (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/545660 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [10:44:11] (03CR) 10Effie Mouzeli: "LGTM https://puppet-compiler.wmflabs.org/compiler1002/19135/" [puppet] - 10https://gerrit.wikimedia.org/r/547079 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [10:46:55] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:48:06] (03CR) 10Jbond: [C: 03+2] wmflib::secret: add a new secret function which supports binary files [puppet] - 10https://gerrit.wikimedia.org/r/546464 (https://phabricator.wikimedia.org/T236481) (owner: 10Jbond) [10:48:20] (03PS1) 10Filippo Giunchedi: aptrepo: include minor version in elastic 7 repos [puppet] - 10https://gerrit.wikimedia.org/r/547161 (https://phabricator.wikimedia.org/T234854) [10:51:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546465 (https://phabricator.wikimedia.org/T236481) (owner: 10Jbond) [10:55:23] (03CR) 10Jbond: [C: 03+2] "thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546465 (https://phabricator.wikimedia.org/T236481) (owner: 10Jbond) [10:55:37] (03PS2) 10Jbond: apereo_cas: migrate keystor to wmflib::secret [puppet] - 10https://gerrit.wikimedia.org/r/546465 (https://phabricator.wikimedia.org/T236481) [10:55:59] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: migrate keystor to wmflib::secret [puppet] - 10https://gerrit.wikimedia.org/r/546465 (https://phabricator.wikimedia.org/T236481) (owner: 10Jbond) [10:56:33] (03PS3) 10Jbond: apereo_cas: migrate keystor to wmflib::secret [puppet] - 10https://gerrit.wikimedia.org/r/546465 (https://phabricator.wikimedia.org/T236481) [10:59:43] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191030T1100). [11:00:05] duesen and matthiasmullie: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:05] (03CR) 10Jbond: [C: 03+2] apereo_cas: migrate keystor to wmflib::secret [puppet] - 10https://gerrit.wikimedia.org/r/546465 (https://phabricator.wikimedia.org/T236481) (owner: 10Jbond) [11:00:53] * duesen wibbles [11:01:22] (03CR) 10Faidon Liambotis: [C: 04-1] New esams stuff (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/545660 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [11:01:30] duesen: Do you want to deploy your patch yourself, or do you prefer me doing it for you? [11:01:58] Urbanecm: i don't have +2 on config. also, I'm scared :) [11:02:30] duesen: okay then : [11:02:55] (03CR) 10Urbanecm: [C: 03+2] Re-apply: MCR: Set testwiki to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546875 (https://phabricator.wikimedia.org/T198558) (owner: 10Daniel Kinzler) [11:03:50] (03Merged) 10jenkins-bot: Re-apply: MCR: Set testwiki to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546875 (https://phabricator.wikimedia.org/T198558) (owner: 10Daniel Kinzler) [11:04:00] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10elukey) I set up a test webrequest.conf on cp2001, and confirmed that the solution works! Side note - varnishkafka set with output... [11:04:19] duesen: please test your patch at mwdebug1001, and let me know. [11:04:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [11:05:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/547145 (owner: 10Effie Mouzeli) [11:05:50] (03PS1) 10Jbond: Revert "apereo_cas: migrate keystor to wmflib::secret" [puppet] - 10https://gerrit.wikimedia.org/r/547166 [11:06:28] Urbanecm: working on it [11:06:36] ack [11:08:10] (03CR) 10Jbond: [C: 03+2] Revert "apereo_cas: migrate keystor to wmflib::secret" [puppet] - 10https://gerrit.wikimedia.org/r/547166 (owner: 10Jbond) [11:08:12] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki: Add 'caught_by' to php7-fatal-error log message [puppet] - 10https://gerrit.wikimedia.org/r/546230 (https://phabricator.wikimedia.org/T234283) (owner: 10Krinkle) [11:08:44] Urbanecm: looks like it's working! [11:08:58] wonderful! Going to sync then :) [11:11:36] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546165 (https://phabricator.wikimedia.org/T236478) (owner: 10Jbond) [11:11:51] duesen: synced, seems logmsgbot had an outage [11:11:57] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, blocking is having a variable for the hours threshold" [puppet] - 10https://gerrit.wikimedia.org/r/546165 (https://phabricator.wikimedia.org/T236478) (owner: 10Jbond) [11:12:16] Urbanecm: as long as it'S just logmsgbot... [11:12:42] !log Synchronized wmf-config/InitialiseSettings.php: SWAT: 61cb77c: Re-apply: MCR: Set testwiki to use the new MCR-only schema (T198558) (duration: 00m 59s) [11:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:48] T198558: Set testwiki to use the new MCR-only schema - https://phabricator.wikimedia.org/T198558 [11:12:50] yeah :) [11:13:34] !log EU SWAT done [11:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:34] Urbanecm: actually, can you give me admin rights on testwiki? That would allow me to test more use cases [11:15:37] (03CR) 10Effie Mouzeli: [C: 03+1] Install php-readline on deployment servers instead of 7.0 version [puppet] - 10https://gerrit.wikimedia.org/r/547143 (owner: 10Muehlenhoff) [11:16:18] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "lvs::monitor_services: increase number of tries before MCS is critical" [puppet] - 10https://gerrit.wikimedia.org/r/545873 (https://phabricator.wikimedia.org/T229286) (owner: 10Mholloway) [11:16:29] duesen: I'm not a bureaucrat there [11:17:21] But you can run maintenance/createAndPromote.php --sysop --force "DKinzler_(WMF)" ;) [11:17:49] nvm, Kosta tested deletion/undeletion for me [11:17:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scaffold: only expose one port as a service by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 (owner: 10Giuseppe Lavagetto) [11:17:58] I'll find a bureaucrat [11:17:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] blubberoid: add TLS termination in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/547159 (owner: 10Giuseppe Lavagetto) [11:18:01] sorry I'm late for SWAT [11:18:03] that was fast duesen :-) [11:18:06] timezone switchover got me messed up :p [11:18:12] (03Merged) 10jenkins-bot: scaffold: only expose one port as a service by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 (owner: 10Giuseppe Lavagetto) [11:18:17] matthiasmullie: I've already closed it, feel freee to reopen&do your stuff :-) [11:18:21] Urbanecm: I#m sitting in his kitchen, that helps :) [11:18:25] We're sitting at the same table so it was easy [11:18:27] ha [11:18:38] Urbanecm: so, SWAT done right now? [11:18:45] ok, I'll go do my thing then [11:18:52] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546290 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [11:18:53] matthiasmullie: yup, I'm done :) [11:19:00] duesen: in case you don't know, https://test.wikipedia.org/wiki/Wikipedia:Requests :-) [11:19:12] ok cool - going in [11:19:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: add TLS termination in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/547159 (owner: 10Giuseppe Lavagetto) [11:19:46] (03PS3) 10Matthias Mullie: Increase rate limits for newbie non-ip users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541532 (https://phabricator.wikimedia.org/T231463) [11:19:58] (03Merged) 10jenkins-bot: blubberoid: add TLS termination in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/547159 (owner: 10Giuseppe Lavagetto) [11:23:21] (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/compiler1001/19142/prometheus1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/547145 (owner: 10Effie Mouzeli) [11:23:49] (03PS1) 10Giuseppe Lavagetto: blubberoid/staging: brown paper bag fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/547170 [11:25:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid/staging: brown paper bag fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/547170 (owner: 10Giuseppe Lavagetto) [11:25:16] (03Merged) 10jenkins-bot: blubberoid/staging: brown paper bag fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/547170 (owner: 10Giuseppe Lavagetto) [11:26:22] (03CR) 10Effie Mouzeli: [C: 03+2] prometheus::cluster_config: add $ensure in definition [puppet] - 10https://gerrit.wikimedia.org/r/547145 (owner: 10Effie Mouzeli) [11:28:45] (03PS2) 10Effie Mouzeli: (WIP) prometheus: remove hhvm stats gathering [puppet] - 10https://gerrit.wikimedia.org/r/547144 (https://phabricator.wikimedia.org/T229792) [11:29:01] (03CR) 10Matthias Mullie: [C: 03+2] Increase rate limits for newbie non-ip users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541532 (https://phabricator.wikimedia.org/T231463) (owner: 10Matthias Mullie) [11:29:48] (03Merged) 10jenkins-bot: Increase rate limits for newbie non-ip users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541532 (https://phabricator.wikimedia.org/T231463) (owner: 10Matthias Mullie) [11:31:00] (03CR) 10jerkins-bot: [V: 04-1] (WIP) prometheus: remove hhvm stats gathering [puppet] - 10https://gerrit.wikimedia.org/r/547144 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:31:50] !log mlitn@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Increase rate limits for newbie non-ip users on Commons (duration: 01m 01s) [11:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:24] (03PS1) 10Ema: cache: reimage cp5008 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/547178 (https://phabricator.wikimedia.org/T227432) [11:37:57] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0) [11:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] profile::acme_chief::cloud: Require python3-designateclient etc. [puppet] - 10https://gerrit.wikimedia.org/r/545081 (https://phabricator.wikimedia.org/T235252) (owner: 10Alex Monk) [11:42:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] profile::mediawiki::httpd: set a SERVERGROUP env variable [puppet] - 10https://gerrit.wikimedia.org/r/546448 (https://phabricator.wikimedia.org/T235899) (owner: 10Giuseppe Lavagetto) [11:42:17] !log depool cp5008 and reimage as text_ats T227432 [11:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:22] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [11:42:32] (03CR) 10Ema: [C: 03+2] cache: reimage cp5008 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/547178 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [11:43:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: ingress: use docker image from internal registry [puppet] - 10https://gerrit.wikimedia.org/r/546459 (https://phabricator.wikimedia.org/T236249) (owner: 10Arturo Borrero Gonzalez) [11:44:08] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5008.eqsin.wmnet'] ` The log can be found in `/var/log/wm... [11:45:53] (03PS1) 10Giuseppe Lavagetto: blubberoid: various fixes to the tls sections of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/547179 [11:47:14] (03PS1) 10Jbond: apereo_cas: migrate keystor to wmflib::secret [puppet] - 10https://gerrit.wikimedia.org/r/547180 [11:49:16] !log temporarily disabling puppet on LDAP servers for a schema change [11:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:00] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "the comment statement is inaccurate." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [11:50:39] (03CR) 10Jbond: [C: 03+2] apereo_cas: migrate keystor to wmflib::secret [puppet] - 10https://gerrit.wikimedia.org/r/547180 (owner: 10Jbond) [11:53:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] "This look a pretty easy and probably common mistake to make. We need to figure out how to catch it in CI" [deployment-charts] - 10https://gerrit.wikimedia.org/r/547179 (owner: 10Giuseppe Lavagetto) [11:57:16] (03PS3) 10Muehlenhoff: Rename attribute to enable MFA [puppet] - 10https://gerrit.wikimedia.org/r/547153 [11:59:52] (03PS1) 10Jbond: Revert "apereo_cas: migrate keystor to wmflib::secret" [puppet] - 10https://gerrit.wikimedia.org/r/547181 [12:00:06] (03CR) 10Muehlenhoff: [C: 03+2] Rename attribute to enable MFA [puppet] - 10https://gerrit.wikimedia.org/r/547153 (owner: 10Muehlenhoff) [12:01:14] (03CR) 10Jbond: [C: 03+2] puppet_compile: use full puppet version in html report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546930 (owner: 10Jbond) [12:02:06] (03CR) 10Jbond: [C: 03+2] Revert "apereo_cas: migrate keystor to wmflib::secret" [puppet] - 10https://gerrit.wikimedia.org/r/547181 (owner: 10Jbond) [12:03:02] (03PS1) 10Jbond: apereo_cas: migrate keystore to wmflib::secret"" [puppet] - 10https://gerrit.wikimedia.org/r/547182 [12:06:11] (03PS5) 10Effie Mouzeli: logstash: send PHP7 fatal-error messages type:mediawiki channel:fatal [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) [12:07:04] (03PS2) 10Jbond: apereo_cas: migrate keystore to wmflib::secret"" [puppet] - 10https://gerrit.wikimedia.org/r/547182 [12:09:41] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:10:34] (03PS3) 10Jbond: apereo_cas: migrate keystore to wmflib::secret"" [puppet] - 10https://gerrit.wikimedia.org/r/547182 [12:16:36] (03PS4) 10Jbond: apereo_cas: migrate keystore to wmflib::secret"" [puppet] - 10https://gerrit.wikimedia.org/r/547182 [12:19:34] (03PS5) 10Jbond: apereo_cas: migrate keystore to wmflib::secret"" [puppet] - 10https://gerrit.wikimedia.org/r/547182 [12:21:51] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops: Allow testing of feature-flag-protected features in deployment-charts CI - https://phabricator.wikimedia.org/T236899 (10Joe) [12:22:02] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops: Allow testing of feature-flag-protected features in deployment-charts CI - https://phabricator.wikimedia.org/T236899 (10Joe) p:05Triage→03High a:03Joe [12:22:11] (03PS6) 10Jbond: apereo_cas: migrate keystore to wmflib::secret"" [puppet] - 10https://gerrit.wikimedia.org/r/547182 [12:22:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:22:57] 10Operations, 10Traffic: Slow loading and connectivity issues on some wikis - https://phabricator.wikimedia.org/T236872 (10Aklapper) 05Resolved→03Declined [12:22:57] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [12:22:57] !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:25] (03CR) 10Alexandros Kosiaris: ":D" [puppet] - 10https://gerrit.wikimedia.org/r/546164 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [12:24:26] (03PS3) 10Effie Mouzeli: prometheus: remove hhvm stats gathering and stop exporters [puppet] - 10https://gerrit.wikimedia.org/r/547144 (https://phabricator.wikimedia.org/T229792) [12:26:05] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:29:22] (03CR) 10Alexandros Kosiaris: "Minor inline comment, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547079 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:30:17] (03PS1) 10Muehlenhoff: Extend Cumin aliases for LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/547185 [12:31:17] (03PS7) 10Jbond: apereo_cas: migrate keystore to wmflib::secret"" [puppet] - 10https://gerrit.wikimedia.org/r/547182 [12:31:31] (03CR) 10Effie Mouzeli: hhvm: remove mwrepl and /etc/hhvm/fatal-error.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547079 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:31:38] 10Operations, 10SRE-tools: sre.hosts.downtime fails with "No hosts provided" - https://phabricator.wikimedia.org/T236684 (10ema) I've just observed the issue again with cp5008: ` 12:21:02 | cp5008.eqsin.wmnet | Started first puppet run (sit back, relax, and enjoy the wait) START - Cookbook sre.hosts.downtime... [12:35:30] (03CR) 10Muehlenhoff: [C: 03+2] Extend Cumin aliases for LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/547185 (owner: 10Muehlenhoff) [12:37:37] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:37:58] (03PS8) 10Jbond: apereo_cas: migrate keystore to wmflib::secret"" [puppet] - 10https://gerrit.wikimedia.org/r/547182 [12:38:18] (03PS1) 10Gehel: elasticsearch: run apt with DEBIAN_FRONTEND=noninteractive [cookbooks] - 10https://gerrit.wikimedia.org/r/547187 [12:39:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/547187 (owner: 10Gehel) [12:39:50] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: kubeadm-k8s: introduce filter for some upstream versions [puppet] - 10https://gerrit.wikimedia.org/r/547188 (https://phabricator.wikimedia.org/T236824) [12:39:56] moritzm: you have an alert on anything APT related? [12:40:02] moritzm: thanks! that was fast! [12:40:30] (03CR) 10Gehel: [C: 03+2] elasticsearch: run apt with DEBIAN_FRONTEND=noninteractive [cookbooks] - 10https://gerrit.wikimedia.org/r/547187 (owner: 10Gehel) [12:40:42] I just it fly by :) [12:44:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: kubeadm-k8s: introduce filter for some upstream versions [puppet] - 10https://gerrit.wikimedia.org/r/547188 (https://phabricator.wikimedia.org/T236824) (owner: 10Arturo Borrero Gonzalez) [12:46:56] (03CR) 10CDanis: [C: 03+2] grafana: double-proxy for wpt-graphite [puppet] - 10https://gerrit.wikimedia.org/r/547030 (https://phabricator.wikimedia.org/T231870) (owner: 10CDanis) [12:49:09] PROBLEM - Host cp5008 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:22] !log updating package versions in install1002 for thirdparty/kubeadm-k8s stretch-wikimedia (T236824) [12:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:28] T236824: Toolforge: new k8s: get new deb packages for 1.15.4 or 1.15.5 - https://phabricator.wikimedia.org/T236824 [12:52:52] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Jclark-ctr) a:05Jclark-ctr→03Christopher [12:57:15] !log cdanis@cumin1001 conftool action : set/pooled=no; selector: name=cp5008.eqsin.wmnet [12:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:02] !log rolling restart of slapd to pick up LDAP schema change [12:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:13] 10Operations, 10Puppet: Investigate using the rich_)data opsion to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10jbond) [12:59:16] (03Abandoned) 10Jbond: apereo_cas: migrate keystore to wmflib::secret"" [puppet] - 10https://gerrit.wikimedia.org/r/547182 (owner: 10Jbond) [12:59:22] !log cdanis@cumin1001 conftool action : set/pooled=inactive; selector: name=cp5008.eqsin.wmnet [12:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:04] 10Operations, 10Puppet: Investigate using the rich_data opsion to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10jbond) [13:01:41] 10Operations, 10Puppet: Investigate using the rich_data opsion to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10jbond) [13:03:42] (03PS4) 10Effie Mouzeli: hhvm: remove mwrepl and /etc/hhvm/fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/547079 (https://phabricator.wikimedia.org/T229792) [13:07:19] (03Abandoned) 10Effie Mouzeli: WIP: jobrunners: Make jobrunners PHP7 only by default [puppet] - 10https://gerrit.wikimedia.org/r/526132 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli) [13:11:02] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1003/19151/" [puppet] - 10https://gerrit.wikimedia.org/r/547144 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [13:13:07] (03PS2) 10Jbond: check_puppetrun: don't alert for disabled puppet agents for 1 day [puppet] - 10https://gerrit.wikimedia.org/r/546165 (https://phabricator.wikimedia.org/T236478) [13:13:40] (03CR) 10Jbond: "thanks, updated" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546165 (https://phabricator.wikimedia.org/T236478) (owner: 10Jbond) [13:16:43] (03CR) 10Filippo Giunchedi: [C: 03+1] check_puppetrun: don't alert for disabled puppet agents for 1 day [puppet] - 10https://gerrit.wikimedia.org/r/546165 (https://phabricator.wikimedia.org/T236478) (owner: 10Jbond) [13:17:32] (03CR) 10Jbond: [C: 03+2] check_puppetrun: don't alert for disabled puppet agents for 1 day [puppet] - 10https://gerrit.wikimedia.org/r/546165 (https://phabricator.wikimedia.org/T236478) (owner: 10Jbond) [13:19:00] (03PS4) 10Effie Mouzeli: prometheus: remove hhvm stats gathering and stop exporters [puppet] - 10https://gerrit.wikimedia.org/r/547144 (https://phabricator.wikimedia.org/T229792) [13:19:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Do we need to rebuild the cluster with this new init config or can we enable it at runtime?" [puppet] - 10https://gerrit.wikimedia.org/r/546764 (https://phabricator.wikimedia.org/T215678) (owner: 10Bstorm) [13:20:00] (03CR) 10Hashar: "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/546989 (owner: 10Hashar) [13:23:06] jbond42: mhh a bunch of 'unable to read output' from nrpe for puppet last run [13:23:51] ack ill revert [13:24:13] (03PS1) 10Jbond: Revert "check_puppetrun: don't alert for disabled puppet agents for 1 day" [puppet] - 10https://gerrit.wikimedia.org/r/547195 [13:24:36] heh, haven't checked what's wrong though [13:25:38] yes i see i think i have one of the variables named wrong ill send a followup path later [13:26:14] (03PS1) 10Jbond: puppetdb6: migrate cumin1001 to puppetmast using new db [puppet] - 10https://gerrit.wikimedia.org/r/547196 (https://phabricator.wikimedia.org/T235655) [13:26:58] (03PS1) 10ArielGlenn: ability to configure a wiki to produce empty abstract files [dumps] - 10https://gerrit.wikimedia.org/r/547197 (https://phabricator.wikimedia.org/T236006) [13:27:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:29:21] (03CR) 10RLazarus: "Okay, PTAL! I'm still a little uncertain about how to handle dependency versions -- is there a Right Way to express "pin to what's in Debi" (0311 comments) [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [13:29:34] (03PS1) 10Jbond: puppet_checkpuppetrun: fix variable name [puppet] - 10https://gerrit.wikimedia.org/r/547199 [13:29:46] godog: can you review ^^ [13:30:27] (03CR) 10Filippo Giunchedi: [C: 03+1] puppet_checkpuppetrun: fix variable name [puppet] - 10https://gerrit.wikimedia.org/r/547199 (owner: 10Jbond) [13:30:32] jbond42: yup, LGTM [13:30:39] thx [13:31:01] (03CR) 10Krinkle: [C: 03+1] logstash: send PHP7 fatal-error messages type:mediawiki channel:fatal [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [13:31:30] (03CR) 10Effie Mouzeli: [C: 03+2] logstash: send PHP7 fatal-error messages type:mediawiki channel:fatal [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [13:31:49] (03CR) 10Jbond: [C: 03+2] puppet_checkpuppetrun: fix variable name [puppet] - 10https://gerrit.wikimedia.org/r/547199 (owner: 10Jbond) [13:32:16] (03PS8) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (https://phabricator.wikimedia.org/T236699) [13:32:49] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:34:16] (03PS2) 10Muehlenhoff: Install php-readline on deployment servers instead of 7.0 version [puppet] - 10https://gerrit.wikimedia.org/r/547143 [13:34:39] looks like the mw errors are 60s timeouts btw [13:35:34] latencies are elevated quite a bit, not just those timeouts, also 75%ile [13:35:44] on zhwiki [13:35:46] mcrouter traffic is way up [13:36:05] !log andrew@deploy1001 Started deploy [horizon/deploy@53028ab]: Rolling out improvments to the puppet git archiver [13:36:05] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:08] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5008.eqsin.wmnet'] ` Of which those **FAILED**: ` ['cp5008.eqsin.wmnet'] ` [13:37:41] I suspect we got some interesting traffic [13:38:16] (03CR) 10Muehlenhoff: [C: 03+2] Install php-readline on deployment servers instead of 7.0 version [puppet] - 10https://gerrit.wikimedia.org/r/547143 (owner: 10Muehlenhoff) [13:38:42] indeed [13:39:42] !log andrew@deploy1001 Finished deploy [horizon/deploy@53028ab]: Rolling out improvments to the puppet git archiver (duration: 03m 38s) [13:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:33] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:44:30] it looks like timeouts deep in wikitext parsing on zhwiki [13:44:42] (and probably has more to do with changes in traffic pattern than anything else) [13:46:46] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5008.eqsin.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2019103... [13:48:09] (03PS6) 10CDanis: Grafana: Add external Graphite for synthetic testing [puppet] - 10https://gerrit.wikimedia.org/r/540572 (https://phabricator.wikimedia.org/T231870) (owner: 10Phedenskog) [13:48:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:48:57] RECOVERY - HTTPS-wmflabs on tools.wmflabs.org is OK: SSL OK - Certificate toolforge.org valid until 2020-01-19 22:15:27 +0000 (expires in 81 days) https://phabricator.wikimedia.org/tag/toolforge/ [13:50:01] RECOVERY - Host cp5008 is UP: PING OK - Packet loss = 0%, RTA = 235.24 ms [13:50:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Swap toolforge proxies to use acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/545679 (https://phabricator.wikimedia.org/T235252) (owner: 10Alex Monk) [13:50:31] (03CR) 10CDanis: [C: 03+2] Grafana: Add external Graphite for synthetic testing [puppet] - 10https://gerrit.wikimedia.org/r/540572 (https://phabricator.wikimedia.org/T231870) (owner: 10Phedenskog) [13:51:32] (03PS2) 10ArielGlenn: ability to configure a wiki to produce empty abstract files [dumps] - 10https://gerrit.wikimedia.org/r/547197 (https://phabricator.wikimedia.org/T236006) [13:51:58] I am not sure what is up with the mediawiki errors [13:52:02] (03PS1) 10Jbond: puppetdb6: move rdb1010 to puppetmaster using new puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/547208 (https://phabricator.wikimedia.org/T235655) [13:52:22] different urls and different errors in php, although they allo timeout [13:52:27] all* [13:54:00] 10Operations, 10Wikidata, 10wikidata-tech-focus: Review robots.txt for Wikidata in light of entity/Q64 being indexed - https://phabricator.wikimedia.org/T227246 (10Addshore) [13:54:09] 10Operations, 10Wikidata, 10wikidata-tech-focus: Review robots.txt for Wikidata in light of entity/Q64 being indexed - https://phabricator.wikimedia.org/T227246 (10Addshore) p:05Triage→03Normal [13:55:33] (03PS1) 10Jbond: puppetdb6: update cumin to use the new puppetdb instance [puppet] - 10https://gerrit.wikimedia.org/r/547209 [13:57:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547079 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [14:00:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] hhvm: remove mwrepl and /etc/hhvm/fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/547079 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [14:02:00] (03PS3) 10ArielGlenn: ability to configure a wiki to produce empty abstract files [dumps] - 10https://gerrit.wikimedia.org/r/547197 (https://phabricator.wikimedia.org/T236006) [14:03:25] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:04:00] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-upgrade [14:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:51] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [14:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:54] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [14:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:33] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [14:07:01] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 28424 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:07:55] thanks icinga for telling me that twice [14:07:56] appreciate it [14:08:00] (the page0 [14:08:26] apergos: EVERYTHING'S STILL OKAY, DON'T WORRY [14:08:41] GREAT JUST GREAT [14:08:48] TOO MUCH SHOUTING HERE [14:09:26] (03PS1) 10Jbond: puppetdb6: move prometheus1004 to puppetmaster using new puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/547213 (https://phabricator.wikimedia.org/T235655) [14:12:53] (03PS1) 10Jbond: puppetdb6: move debmonitor1001 to puppetmaster using new puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/547214 (https://phabricator.wikimedia.org/T235655) [14:13:21] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) Some jobs are getting stuck (single xml backups) for some issue complaining about the sd daemon. Even cancel gets stuck (expe... [14:15:48] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [14:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:27] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [14:16:51] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [14:17:32] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/19154/" [puppet] - 10https://gerrit.wikimedia.org/r/547041 (https://phabricator.wikimedia.org/T235494) (owner: 10Ottomata) [14:17:38] (03PS4) 10Ottomata: statistics - rename published-datasets to just published [puppet] - 10https://gerrit.wikimedia.org/r/547041 (https://phabricator.wikimedia.org/T235494) [14:17:40] (03CR) 10Ottomata: [V: 03+2 C: 03+2] statistics - rename published-datasets to just published [puppet] - 10https://gerrit.wikimedia.org/r/547041 (https://phabricator.wikimedia.org/T235494) (owner: 10Ottomata) [14:17:57] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 28424 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:18:30] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28537 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:19:06] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:52] !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99) [14:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:58] (03PS1) 10Jbond: puppetdb6: update config to use the new puppetdb6 servers [puppet] - 10https://gerrit.wikimedia.org/r/547218 (https://phabricator.wikimedia.org/T235655) [14:21:00] (03PS1) 10Jbond: puppetdb6: remove old puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/547219 (https://phabricator.wikimedia.org/T235655) [14:21:12] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1866 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:21:42] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1866 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:23:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547196 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:24:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547208 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:24:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547213 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:24:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547214 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:25:41] (03CR) 10Jbond: [C: 03+2] puppetdb6: migrate cumin1001 to puppetmast using new db [puppet] - 10https://gerrit.wikimedia.org/r/547196 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:30:05] (03CR) 10Jbond: [C: 03+2] puppetdb6: move rdb1010 to puppetmaster using new puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/547208 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:30:18] (03PS1) 10Arturo Borrero Gonzalez: toolforge: docker registry: use new SSL certificate by acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/547221 [14:30:29] (03PS2) 10Jbond: puppetdb6: move rdb1010 to puppetmaster using new puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/547208 (https://phabricator.wikimedia.org/T235655) [14:30:47] (03PS2) 10Jbond: puppetdb6: move prometheus1004 to puppetmaster using new puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/547213 (https://phabricator.wikimedia.org/T235655) [14:30:55] (03PS2) 10Jbond: puppetdb6: move debmonitor1001 to puppetmaster using new puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/547214 (https://phabricator.wikimedia.org/T235655) [14:31:06] (03PS2) 10Jbond: puppetdb6: update config to use the new puppetdb6 servers [puppet] - 10https://gerrit.wikimedia.org/r/547218 (https://phabricator.wikimedia.org/T235655) [14:31:15] (03PS2) 10Jbond: puppetdb6: remove old puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/547219 (https://phabricator.wikimedia.org/T235655) [14:31:44] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [14:32:01] !log disable puppet on all mw* hosts [14:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:50] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) Tried again this morning, but the kernel panics happen too fast to make much progress once the agent starts actually using the NIC (I've only ever had one agent run complete successf... [14:33:24] gehel: ^ [14:33:43] (03PS1) 10Arturo Borrero Gonzalez: toolforge: old k8s: refresh reference to ferm handlers [puppet] - 10https://gerrit.wikimedia.org/r/547222 [14:34:31] (03CR) 10Effie Mouzeli: [C: 03+2] hhvm: remove mwrepl and /etc/hhvm/fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/547079 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [14:35:27] (03CR) 10Jbond: [C: 03+2] puppetdb6: move prometheus1004 to puppetmaster using new puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/547213 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:36:04] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5008.eqsin.wmnet'] ` and were **ALL** successful. [14:36:29] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: old k8s: refresh reference to ferm handlers [puppet] - 10https://gerrit.wikimedia.org/r/547222 (owner: 10Arturo Borrero Gonzalez) [14:36:34] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 68187 bytes in 5.911 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:36:44] 10Operations, 10SRE-tools: sre.hosts.downtime fails with "No hosts provided" - https://phabricator.wikimedia.org/T236684 (10Volans) a:05Volans→03jbond Confirmed it's a puppetdb slowness: ` 2019-10-30 12:22:57,863 [DEBUG puppetdb.py:256 in _execute] Queried puppetdb for '["or", ["=", "certname", "cp5008.eqs... [14:37:06] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28817 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:37:21] (03PS1) 10Tpt: Enables Wikisource extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547223 (https://phabricator.wikimedia.org/T236502) [14:37:26] (03CR) 10Volans: "post-merge -1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547185 (owner: 10Muehlenhoff) [14:39:36] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for Wikidebate project - https://phabricator.wikimedia.org/T236829 (10Ottomata) @Sophivorus I just created this mailing list with you as the administrator. I believe you should be emailed a password. I was not prompted to add a secondary administrat... [14:40:06] onimisionipe: cluster restart in progress, will recover in a few [14:40:41] !log pool cp5008 with ATS backend T227432 [14:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:51] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [14:41:53] jouncebot: next [14:41:54] In 3 hour(s) and 18 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191030T1800) [14:42:08] (03CR) 10Jbond: [C: 03+2] puppetdb6: move debmonitor1001 to puppetmaster using new puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/547214 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:42:19] (03PS3) 10Jbond: puppetdb6: move debmonitor1001 to puppetmaster using new puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/547214 (https://phabricator.wikimedia.org/T235655) [14:43:10] (03CR) 10Muehlenhoff: [C: 03+2] Extend Cumin aliases for LDAP servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547185 (owner: 10Muehlenhoff) [14:45:40] (03CR) 10Jforrester: [C: 03+2] Enables Wikisource extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547223 (https://phabricator.wikimedia.org/T236502) (owner: 10Tpt) [14:46:34] (03Merged) 10jenkins-bot: Enables Wikisource extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547223 (https://phabricator.wikimedia.org/T236502) (owner: 10Tpt) [14:46:35] (03PS3) 10Jbond: puppetdb6: update config to use the new puppetdb6 servers [puppet] - 10https://gerrit.wikimedia.org/r/547218 (https://phabricator.wikimedia.org/T235655) [14:46:57] (03CR) 10jerkins-bot: [V: 04-1] puppetdb6: update config to use the new puppetdb6 servers [puppet] - 10https://gerrit.wikimedia.org/r/547218 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:47:27] (03PS1) 10Muehlenhoff: Fix up site-specific LDAP replica aliases [puppet] - 10https://gerrit.wikimedia.org/r/547224 [14:47:33] (03PS4) 10Jbond: puppetdb6: update config to use the new puppetdb6 servers [puppet] - 10https://gerrit.wikimedia.org/r/547218 (https://phabricator.wikimedia.org/T235655) [14:48:44] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T236502 Define wmgUseWikisource as default-false (duration: 01m 22s) [14:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:49] T236502: Deploy Wikisource extension to beta cluster - https://phabricator.wikimedia.org/T236502 [14:48:54] (03PS2) 10Giuseppe Lavagetto: Rakefile: add the ability to run fixtures with special values [deployment-charts] - 10https://gerrit.wikimedia.org/r/547179 (https://phabricator.wikimedia.org/T236899) [14:49:56] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [14:50:30] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T210174 Load Wikisource extension when wmgUseWikisource is true (duration: 01m 01s) [14:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:36] T210174: Deploy Wikisource extension to Wikimedia cluster - https://phabricator.wikimedia.org/T210174 [14:50:55] (03CR) 10Muehlenhoff: [C: 03+2] Fix up site-specific LDAP replica aliases [puppet] - 10https://gerrit.wikimedia.org/r/547224 (owner: 10Muehlenhoff) [14:51:59] (03PS1) 10Effie Mouzeli: hhvm: force the removal of some directories [puppet] - 10https://gerrit.wikimedia.org/r/547227 (https://phabricator.wikimedia.org/T229792) [14:52:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] Rakefile: add the ability to run fixtures with special values [deployment-charts] - 10https://gerrit.wikimedia.org/r/547179 (https://phabricator.wikimedia.org/T236899) (owner: 10Giuseppe Lavagetto) [14:52:54] (03PS1) 10Jforrester: extension-list: Add Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547228 (https://phabricator.wikimedia.org/T210174) [14:53:47] (03CR) 10Jforrester: [C: 03+2] extension-list: Add Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547228 (https://phabricator.wikimedia.org/T210174) (owner: 10Jforrester) [14:54:14] (03PS1) 10Gehel: elasticsearch: apt-get install needs "-y" in non interactive mode [cookbooks] - 10https://gerrit.wikimedia.org/r/547229 [14:54:39] (03Merged) 10jenkins-bot: extension-list: Add Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547228 (https://phabricator.wikimedia.org/T210174) (owner: 10Jforrester) [14:55:06] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/547229 (owner: 10Gehel) [14:55:16] (03PS1) 10Jbond: nrpe::check_puppetrun: use correct interger operation '-' not '+' [puppet] - 10https://gerrit.wikimedia.org/r/547230 [14:55:48] (03CR) 10DCausse: [C: 03+1] elasticsearch: apt-get install needs "-y" in non interactive mode [cookbooks] - 10https://gerrit.wikimedia.org/r/547229 (owner: 10Gehel) [14:56:13] (03PS1) 10Ottomata: Disable eventlogging-consumer mysql in prod [puppet] - 10https://gerrit.wikimedia.org/r/547231 (https://phabricator.wikimedia.org/T232349) [14:56:18] (03CR) 10Alexandros Kosiaris: [C: 03+1] Rakefile: add the ability to run fixtures with special values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/547179 (https://phabricator.wikimedia.org/T236899) (owner: 10Giuseppe Lavagetto) [14:56:23] (03PS2) 10Ottomata: Disable eventlogging-consumer mysql in prod [puppet] - 10https://gerrit.wikimedia.org/r/547231 (https://phabricator.wikimedia.org/T232349) [14:57:55] (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/compiler1001/19155/mw1317.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/547227 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [14:58:38] (03CR) 10Jbond: [C: 03+2] nrpe::check_puppetrun: use correct interger operation '-' not '+' [puppet] - 10https://gerrit.wikimedia.org/r/547230 (owner: 10Jbond) [14:58:46] (03CR) 10CDanis: [C: 03+1] hhvm: force the removal of some directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547227 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [14:59:07] (03CR) 10Gehel: [C: 03+2] elasticsearch: apt-get install needs "-y" in non interactive mode [cookbooks] - 10https://gerrit.wikimedia.org/r/547229 (owner: 10Gehel) [14:59:55] (03CR) 10Ottomata: [C: 03+2] Disable eventlogging-consumer mysql in prod [puppet] - 10https://gerrit.wikimedia.org/r/547231 (https://phabricator.wikimedia.org/T232349) (owner: 10Ottomata) [15:00:19] (03CR) 10Effie Mouzeli: hhvm: force the removal of some directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547227 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [15:00:52] (03PS2) 10Effie Mouzeli: hhvm: force the removal of some directories [puppet] - 10https://gerrit.wikimedia.org/r/547227 (https://phabricator.wikimedia.org/T229792) [15:00:59] rlazarus: hi, I'd like to inquire re. an Apache patch. mutante said I should ping you [15:01:25] (03Merged) 10jenkins-bot: elasticsearch: apt-get install needs "-y" in non interactive mode [cookbooks] - 10https://gerrit.wikimedia.org/r/547229 (owner: 10Gehel) [15:01:46] (03PS5) 10Jbond: puppetdb6: update config to use the new puppetdb6 servers [puppet] - 10https://gerrit.wikimedia.org/r/547218 (https://phabricator.wikimedia.org/T235655) [15:02:38] (03PS3) 10Jbond: puppetdb6: remove old puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/547219 (https://phabricator.wikimedia.org/T235655) [15:02:54] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-upgrade [15:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:19] hauskater: hi :) can I get back to you in ~1h? [15:03:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547218 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [15:03:54] rlazarus: sure. I think I'll still be around. [15:04:00] if not, well, another time :) [15:04:52] (03PS6) 10Jbond: puppetdb6: update config to use the new puppetdb6 servers [puppet] - 10https://gerrit.wikimedia.org/r/547218 (https://phabricator.wikimedia.org/T235655) [15:05:23] (03PS4) 10Jbond: puppetdb6: remove old puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/547219 (https://phabricator.wikimedia.org/T235655) [15:05:27] (03CR) 10Effie Mouzeli: [C: 03+2] hhvm: force the removal of some directories [puppet] - 10https://gerrit.wikimedia.org/r/547227 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [15:07:32] (03PS5) 10Jbond: puppetdb6: remove old puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/547219 (https://phabricator.wikimedia.org/T235655) [15:09:04] (03PS7) 10Jbond: puppetdb6: update config to use the new puppetdb6 servers [puppet] - 10https://gerrit.wikimedia.org/r/547218 (https://phabricator.wikimedia.org/T235655) [15:09:24] 10Operations, 10Readers-Web-Backlog, 10Wikimedia-General-or-Unknown, 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Krinkle) Yes. In the past, the team most active as stakeholder around the topic of Wikipedia being indexed by search engin... [15:10:31] (03PS2) 10Cwhite: mtail,profile: add smtp metrics collection with mtail [puppet] - 10https://gerrit.wikimedia.org/r/546290 (https://phabricator.wikimedia.org/T236505) [15:10:32] !log enable-puppet in mw* hosts [15:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:20] (03PS3) 10Cwhite: mtail,profile: add smtp metrics collection with mtail [puppet] - 10https://gerrit.wikimedia.org/r/546290 (https://phabricator.wikimedia.org/T236505) [15:11:20] PROBLEM - traffic_server backend process restarted on cp5008 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5008&var-layer=backend [15:12:07] (03CR) 10Cwhite: mtail,profile: add smtp metrics collection with mtail (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546290 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [15:13:33] I'm going to deploy some stuff to reduce deadlocks on wikibase term store [15:13:37] jynus: ^ [15:16:13] cool, thanks Amir1! [15:17:23] (03PS20) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [15:17:44] (03CR) 10jerkins-bot: [V: 04-1] lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [15:17:49] (03PS8) 10Jbond: puppetdb6: update config to use the new puppetdb6 servers [puppet] - 10https://gerrit.wikimedia.org/r/547218 (https://phabricator.wikimedia.org/T235655) [15:17:53] (03PS6) 10Jbond: puppetdb6: remove old puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/547219 (https://phabricator.wikimedia.org/T235655) [15:17:55] (03PS1) 10Jbond: puppetmaster1003: remove config for canary puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/547237 (https://phabricator.wikimedia.org/T235655) [15:19:14] hauskater: Hey. [15:19:22] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10Gehel) p:05Triage→03High [15:19:47] James_F: so purging the cache did worked [15:19:51] *work [15:20:08] message and linking works okay [15:20:29] ofc the wikistats project isn't working for beta projects but I think that was about to be expected :) [15:20:46] 10Operations, 10Mobile-Content-Service, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 4 others: New Service Request: wikifeeds - https://phabricator.wikimedia.org/T223469 (10akosiaris) 05Open→03Resolved a:03akosiaris This is done. wikifeeds has been deployed for some time now [15:21:20] James_F: some scap warnings on beta though. Not sure if they're important [15:21:37] cf. https://integration.wikimedia.org/ci/job/beta-scap-eqiad/273413/console [15:21:55] (03PS21) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [15:22:59] Timeouts occasionally happen, I think it mostly fixes itself. Hopefully. [15:23:44] !log shutting down elastic1039 to be ready for disk swap - T236601 [15:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:49] T236601: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 [15:26:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547237 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [15:26:54] (03PS1) 10Ottomata: Absent eventlogging 'replication' between db1107 and db1108 [puppet] - 10https://gerrit.wikimedia.org/r/547239 (https://phabricator.wikimedia.org/T159170) [15:27:10] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Event-Platform, and 4 others: Public EventGate endpoint for analytics event intake - https://phabricator.wikimedia.org/T233629 (10akosiaris) [15:27:41] (03CR) 10Herron: [C: 03+1] logstash: drop quotes from filter config options [puppet] - 10https://gerrit.wikimedia.org/r/544217 (https://phabricator.wikimedia.org/T235891) (owner: 10Filippo Giunchedi) [15:27:50] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547219 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [15:28:42] (03CR) 10Herron: [C: 03+1] aptrepo: include minor version in elastic 7 repos [puppet] - 10https://gerrit.wikimedia.org/r/547161 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [15:28:53] (03PS22) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [15:29:22] !log gehel@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97) [15:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:51] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Event-Platform, and 3 others: Public schema.wikimedia.org endpoint for schema.svc - https://phabricator.wikimedia.org/T233630 (10akosiaris) @Ottomata, I 'd say so. I would ask for review from #Traffic or #ServiceOps, but unless it was under the goals o... [15:30:06] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Event-Platform, and 4 others: Public EventGate endpoint for analytics event intake - https://phabricator.wikimedia.org/T233629 (10Ottomata) > I'm inclined to just use the existent eventgate-analytics backend endpoint for now. Recent discussions about m... [15:30:08] (03CR) 10Herron: [C: 03+1] puppetdb6: remove old puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/547219 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [15:30:22] (03PS3) 10Giuseppe Lavagetto: Rakefile: add the ability to run fixtures with special values [deployment-charts] - 10https://gerrit.wikimedia.org/r/547179 (https://phabricator.wikimedia.org/T236899) [15:30:25] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Release Pipeline, 10serviceops-radar, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10akosiaris) [15:30:29] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Event-Platform, and 3 others: Public schema.wikimedia.org endpoint for schema.svc - https://phabricator.wikimedia.org/T233630 (10Ottomata) Ok, I will make patches then. [15:30:35] (03PS1) 10Filippo Giunchedi: Rename wezen to centrallog2001 [dns] - 10https://gerrit.wikimedia.org/r/547241 (https://phabricator.wikimedia.org/T224564) [15:31:07] (03CR) 10Ottomata: "Unused puppetization will be removed in a subsequent patch." [puppet] - 10https://gerrit.wikimedia.org/r/547239 (https://phabricator.wikimedia.org/T159170) (owner: 10Ottomata) [15:31:55] (03CR) 10Herron: [C: 03+1] puppetdb6: update config to use the new puppetdb6 servers [puppet] - 10https://gerrit.wikimedia.org/r/547218 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [15:32:01] (03CR) 10Giuseppe Lavagetto: Rakefile: add the ability to run fixtures with special values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/547179 (https://phabricator.wikimedia.org/T236899) (owner: 10Giuseppe Lavagetto) [15:32:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rakefile: add the ability to run fixtures with special values [deployment-charts] - 10https://gerrit.wikimedia.org/r/547179 (https://phabricator.wikimedia.org/T236899) (owner: 10Giuseppe Lavagetto) [15:32:40] (03Merged) 10jenkins-bot: Rakefile: add the ability to run fixtures with special values [deployment-charts] - 10https://gerrit.wikimedia.org/r/547179 (https://phabricator.wikimedia.org/T236899) (owner: 10Giuseppe Lavagetto) [15:33:14] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 8 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10aaron) >>! In T231086#5601608, @fgiunchedi wrote: > swiftrepl is puppetized now to run an eqiad -> codfw sync once a week on Monday... [15:33:17] 10Operations, 10Patch-For-Review: Add U2F/FIDO as second factor for CAS - https://phabricator.wikimedia.org/T233937 (10MoritzMuehlenhoff) >>! In T233937#5529214, @MoritzMuehlenhoff wrote: > John and I have discussed next steps on IRC: Initially we'll make U2F opt-in via a memberOf/LDAP check. At a later step w... [15:33:30] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Release Pipeline, 10serviceops-radar, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Mholloway) I think this is outdated (though see T236797 for a question about interacting with external APIs from Wi... [15:33:37] (03PS1) 10Filippo Giunchedi: rsyslog: setup temporary rsync for logs transfer [puppet] - 10https://gerrit.wikimedia.org/r/547245 (https://phabricator.wikimedia.org/T224564) [15:33:39] (03PS1) 10Filippo Giunchedi: hieradata: remove wezen from service [puppet] - 10https://gerrit.wikimedia.org/r/547246 (https://phabricator.wikimedia.org/T224564) [15:33:40] (03PS1) 10Filippo Giunchedi: Rename wezen to centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/547247 (https://phabricator.wikimedia.org/T224564) [15:33:43] (03PS1) 10Filippo Giunchedi: hieradata: pool centrallog2001 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/547248 (https://phabricator.wikimedia.org/T224564) [15:33:49] (03PS3) 10Jbond: check_puppetrun: alert critical after 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/546195 [15:35:07] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Release Pipeline, and 3 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Mholloway) [15:35:09] (03CR) 10Tpt: "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547223 (https://phabricator.wikimedia.org/T236502) (owner: 10Tpt) [15:36:02] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-upgrade [15:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:17] (03CR) 10Muehlenhoff: Rename wezen to centrallog2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547247 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [15:38:01] (03CR) 10Cwhite: "PCC checks out now https://puppet-compiler.wmflabs.org/compiler1001/19159/" [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [15:39:27] (03CR) 10Milimetric: "@ArielGlenn, thanks, good point, although this particular data is super duper tiny, like kilobytes I think (does it still matter?)" [puppet] - 10https://gerrit.wikimedia.org/r/545982 (https://phabricator.wikimedia.org/T131280) (owner: 10Milimetric) [15:40:45] (03PS2) 10Filippo Giunchedi: Rename wezen to centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/547247 (https://phabricator.wikimedia.org/T224564) [15:40:47] (03PS2) 10Filippo Giunchedi: hieradata: pool centrallog2001 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/547248 (https://phabricator.wikimedia.org/T224564) [15:41:21] (03CR) 10Filippo Giunchedi: Rename wezen to centrallog2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547247 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [15:41:47] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Wikibase deadlock reduction, [[gerrit:547236|Shorten out when there is nothing to clean up]] (T236466) (duration: 01m 05s) [15:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:51] T236466: PHP Warning: [data-update-failed]: A data update callback triggered an exception (Wikimedia\Rdbms\Database::makeList: empty input for field wbxl_text_id) [Called from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate in /extensions/Wikibase/repo/includes/Content/DataUpdateAdapter.php at line 62] - https://phabricator.wikimedia.org/T236466 [15:42:19] (03PS3) 10Andrew Bogott: puppetmaster: Add feature flag for geoip provisioning [puppet] - 10https://gerrit.wikimedia.org/r/546664 (https://phabricator.wikimedia.org/T236487) (owner: 10BryanDavis) [15:43:13] (03PS1) 10Bstorm: kubernetes: allow full opt-out of rsyslog config [puppet] - 10https://gerrit.wikimedia.org/r/547252 [15:43:30] (03CR) 10Cwhite: [C: 03+1] logstash: drop quotes from filter config options [puppet] - 10https://gerrit.wikimedia.org/r/544217 (https://phabricator.wikimedia.org/T235891) (owner: 10Filippo Giunchedi) [15:44:32] (03CR) 10Andrew Bogott: [C: 03+2] puppetmaster: Add feature flag for geoip provisioning [puppet] - 10https://gerrit.wikimedia.org/r/546664 (https://phabricator.wikimedia.org/T236487) (owner: 10BryanDavis) [15:45:34] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: drop quotes from filter config options [puppet] - 10https://gerrit.wikimedia.org/r/544217 (https://phabricator.wikimedia.org/T235891) (owner: 10Filippo Giunchedi) [15:45:45] (03PS2) 10Filippo Giunchedi: logstash: drop quotes from filter config options [puppet] - 10https://gerrit.wikimedia.org/r/544217 (https://phabricator.wikimedia.org/T235891) [15:46:08] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Wikibase deadlock reduction, [[gerrit:547236|Shorten out when there is nothing to clean up]] (T236466) (duration: 01m 06s) [15:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:40] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [15:47:09] (03CR) 10Andrew Bogott: [C: 03+1] kubernetes: allow full opt-out of rsyslog config [puppet] - 10https://gerrit.wikimedia.org/r/547252 (owner: 10Bstorm) [15:47:24] (03CR) 10Cwhite: [C: 03+1] aptrepo: include minor version in elastic 7 repos [puppet] - 10https://gerrit.wikimedia.org/r/547161 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [15:48:57] !log roll restart logstash after https://gerrit.wikimedia.org/r/c/operations/puppet/+/544217 [15:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:50] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: include minor version in elastic 7 repos [puppet] - 10https://gerrit.wikimedia.org/r/547161 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [15:50:00] (03PS2) 10Filippo Giunchedi: aptrepo: include minor version in elastic 7 repos [puppet] - 10https://gerrit.wikimedia.org/r/547161 (https://phabricator.wikimedia.org/T234854) [15:54:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547161 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [15:57:37] hauskater: thanks for your patience! still around? [15:57:47] rlazarus: yes sir [15:59:10] (03CR) 10Alexandros Kosiaris: "LGTM, but I do have a question. Will we be able to revert this at some point? From what I gathered the new version of toolforge uses it's " [puppet] - 10https://gerrit.wikimedia.org/r/547252 (owner: 10Bstorm) [15:59:28] so, backstory on my end is T236699 -- I'm working on a new tool called httpbb for testing Apache configs [15:59:29] T236699: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 [15:59:49] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/547252 (owner: 10Bstorm) [16:00:16] the idea is that with a change like https://gerrit.wikimedia.org/r/c/operations/puppet/+/545889 we should be able to add asserts for the new vhost, and confirm that (a) that change did what we expect it to, and (b) everything else still works the way it used to, before deploying more widely [16:00:53] but, this is all a bit future-tense :) the current state is under review in https://gerrit.wikimedia.org/r/c/operations/software/httpbb/+/545689, and we definitely don't have a full suite of production-ready tests yet [16:01:53] aha [16:02:20] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Wikibase deadlock reduction, [[gerrit:547243|Stop locking and use DISTINCT when finding used terms to delete]] (T236466) (duration: 01m 05s) [16:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:25] T236466: PHP Warning: [data-update-failed]: A data update callback triggered an exception (Wikimedia\Rdbms\Database::makeList: empty input for field wbxl_text_id) [Called from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate in /extensions/Wikibase/repo/includes/Content/DataUpdateAdapter.php at line 62] - https://phabricator.wikimedia.org/T236466 [16:02:53] (03PS2) 10Bstorm: kubernetes: allow full opt-out of rsyslog config [puppet] - 10https://gerrit.wikimedia.org/r/547252 [16:03:18] so tl;dr, it's much more likely that you can help me test my tool than that I can help you test your change -- but both would involve you waiting around at least a little bit for me to catch up with you, and I don't know what your timeline is [16:04:30] cc _joe_ in case you have a master plan [16:05:06] <_joe_> hauskater: what rlazarus said is correct [16:05:15] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Wikibase deadlock reduction, [[gerrit:547244|Stop locking and use DISTINCT when finding used terms to delete]] (T234948) (duration: 01m 04s) [16:05:20] <_joe_> I wanted to shamelessly use that tiny change to test our new tool :D [16:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:26] T234948: New Wikibase deadlocks on Wikidata wiki since 2019-10-08T00:00:02: Wikibase\Lib\Store\Sql\Terms\{closure} Deadlock found when trying to get lock; try restarting transaction - https://phabricator.wikimedia.org/T234948 [16:05:32] <_joe_> but I said so in my comment :P [16:05:48] well, I'm just an amateur 'round this parts. I could help testing but I'd need guidance. In any case, I'd hate the wmf-config patch become unmergeable because of conflicts, etc. And that patch was a lot of time/work on my end [16:06:02] <_joe_> ok yes [16:06:06] <_joe_> so let's merge your change [16:06:20] <_joe_> no more work on your part was required [16:06:35] <_joe_> but if this is needed for the mw-config change, let's get to it [16:06:38] I'd appreciate that. We could test with another apache change later if that's okay [16:06:43] <_joe_> sure [16:06:58] we can't create the wiki without the Apache config [16:07:02] <_joe_> we have many to do it with, I just asked if there was no urgency :) [16:07:14] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to= [16:07:14] atasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:07:15] <_joe_> yes, I forget of that chain of dependency [16:07:25] (03PS3) 10Bstorm: kubernetes: allow full opt-out of rsyslog config [puppet] - 10https://gerrit.wikimedia.org/r/547252 [16:07:44] and while it's not urgent, operations/mediawiki-config.git is a busy repo and we risk the initial configuration patch to become useless if it's too conflicted [16:07:55] <_joe_> sure sure got that [16:08:28] (03CR) 10Bstorm: "I am unhappy with the puppet compiler output and fiddling with this a bit." [puppet] - 10https://gerrit.wikimedia.org/r/547252 (owner: 10Bstorm) [16:08:30] but please let me know if I can help you with anything else. I feel bad for this. [16:09:21] you're fine, don't worry! appreciate you checking in :) [16:10:00] there'll be an unending supply of guinea pig changes, will happily loop you in if the timing works out when we get there [16:10:21] <_joe_> hauskater: don't feel bad :D [16:10:49] Amir1: what was the time of your deploy? I see potentially a slowdown of deadlocks, although to early to say for sure [16:11:13] *too [16:11:26] jynus: yes, let me get it for you, it's in SAL [16:11:47] (03PS7) 10Giuseppe Lavagetto: mediawiki::web:prod_sites.pp: Apache config for ge.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [16:11:52] was it the 16:02+16:05? [16:11:52] jynus: wmf.3 are the parts that matter: https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:03] I see, the 02 [16:12:20] the main one is 02 [16:12:22] the last one [16:12:41] it stops locking millions of rows [16:12:59] it is looking good, but early to say as it is sometimes query-burst related [16:13:13] More info: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/546925 [16:13:14] I don't want to jinx it [16:13:16] :-D [16:13:20] haha sure [16:13:24] thanks for looking to it/working on it [16:13:30] that is super-helpful [16:13:44] (03PS4) 10Bstorm: kubernetes: allow full opt-out of rsyslog config [puppet] - 10https://gerrit.wikimedia.org/r/547252 [16:14:03] _joe_: I did some PCC tests for that dns change using some random hosts and it looks PCC was happy [16:14:09] I'm staring at numbers non-stop. I found that they are actually nine different issus. I'm trying to address them one by one (my main project has ended, so I'm term store fixes full time now) [16:14:15] s/dns/apache/ [16:14:23] Coffee please [16:14:54] 10Operations, 10Core Platform Team, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, and 5 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10BBlack) >>! In T226840#5615777, @Ottomata wrote: > It soun... [16:15:14] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:15:16] <_joe_> hauskater: yeah I am showing how to use it to rlazarus :) [16:15:34] :) [16:16:17] I can't use it via integration.wm directly. Looks I lack permissions, so I had to resort to 'check experimental'. [16:16:57] (03PS5) 10Bstorm: kubernetes: allow full opt-out of rsyslog config [puppet] - 10https://gerrit.wikimedia.org/r/547252 [16:18:22] jynus: oh btw. the maintenance script neds to be restarted to pick it up :/ It will be restarted in one hour or so [16:19:25] (03CR) 10Bstorm: "I do not see how the puppet compiler is confused on this patch, but I can entirely safely try it out in toolforge to be sure before mergin" [puppet] - 10https://gerrit.wikimedia.org/r/547252 (owner: 10Bstorm) [16:19:50] Amir1: so it was a good thing I suggested long time a go to make those get new config every some time [16:20:12] preciselly for cases like this [16:21:08] that would be nice [16:22:05] (03CR) 10Paladox: kubernetes: allow full opt-out of rsyslog config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547252 (owner: 10Bstorm) [16:22:48] (03CR) 10Bstorm: kubernetes: allow full opt-out of rsyslog config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547252 (owner: 10Bstorm) [16:23:25] (03PS6) 10Bstorm: kubernetes: allow full opt-out of rsyslog config [puppet] - 10https://gerrit.wikimedia.org/r/547252 [16:25:22] (03CR) 10Bstorm: "Ok, thanks to Paladox's finding the issue, now it shows up correctly as a noop https://puppet-compiler.wmflabs.org/compiler1001/19168/kube" [puppet] - 10https://gerrit.wikimedia.org/r/547252 (owner: 10Bstorm) [16:26:18] (03CR) 10Bstorm: "So I'm ready to merge if this is cool. It can be reverted once our new cluster takes over for all the old nodes (which will be a while ye" [puppet] - 10https://gerrit.wikimedia.org/r/547252 (owner: 10Bstorm) [16:32:18] (03CR) 10Bstorm: [C: 03+2] "Going ahead with merge since compiler is now working/clean, and I'll make sure it doesn't change anything on prod nodes after merge." [puppet] - 10https://gerrit.wikimedia.org/r/547252 (owner: 10Bstorm) [16:34:28] (03PS2) 10Ottomata: Sync geoeditors data to dumps and add links [puppet] - 10https://gerrit.wikimedia.org/r/545982 (https://phabricator.wikimedia.org/T131280) (owner: 10Milimetric) [16:39:40] (03CR) 10Ottomata: [C: 03+2] Sync geoeditors data to dumps and add links [puppet] - 10https://gerrit.wikimedia.org/r/545982 (https://phabricator.wikimedia.org/T131280) (owner: 10Milimetric) [16:43:25] (03CR) 10Nuria: [C: 03+1] "This change would take effect in a bit, let's please make sure to vet files." [puppet] - 10https://gerrit.wikimedia.org/r/545982 (https://phabricator.wikimedia.org/T131280) (owner: 10Milimetric) [16:46:12] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10dbarratt) As far as UX is concerned... The HTTP request should **not** happen on page load... [16:47:28] (03CR) 10Elukey: [C: 03+1] Absent eventlogging 'replication' between db1107 and db1108 [puppet] - 10https://gerrit.wikimedia.org/r/547239 (https://phabricator.wikimedia.org/T159170) (owner: 10Ottomata) [16:48:01] 10Operations, 10Puppet, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) [16:48:19] 10Operations, 10SRE-tools, 10Patch-For-Review, 10User-jbond: sre.hosts.downtime fails with "No hosts provided" - https://phabricator.wikimedia.org/T236684 (10jbond) [16:48:51] 10Operations, 10Puppet, 10User-jbond: Audit /etc/apt directories - https://phabricator.wikimedia.org/T214605 (10jbond) [16:49:07] 10Puppet, 10Patch-For-Review, 10User-jbond: Populate puppetdb1002 with live data - https://phabricator.wikimedia.org/T235655 (10jbond) [16:49:18] 10Operations, 10Puppet, 10User-jbond: facter3: use structured facts - https://phabricator.wikimedia.org/T222160 (10jbond) [16:50:33] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10User-jbond: PCC always has an ERROR when compiling for servers with profile::redis::slave - https://phabricator.wikimedia.org/T228266 (10jbond) [16:50:49] 10Puppet, 10Patch-For-Review, 10User-jbond: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10jbond) [16:51:11] 10Operations: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10jbond) 05Open→03Resolved [16:51:30] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review, 10User-jbond: upgrade puppet master servers - https://phabricator.wikimedia.org/T227587 (10jbond) [16:51:50] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review, 10User-jbond: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) [16:52:02] 10Operations, 10Puppet, 10User-jbond: missing CRL - https://phabricator.wikimedia.org/T235185 (10jbond) [16:52:27] 10Puppet, 10cloud-services-team, 10Patch-For-Review, 10User-jbond: ensure additional puppetmaster files are managed by puppet - https://phabricator.wikimedia.org/T234332 (10jbond) [16:52:51] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Joe) Hi, I assumed the fetching of such data would happen via an async job indeed, upon ima... [16:53:01] 10Operations, 10Puppet, 10User-jbond: reimage of puppet servers can fail - https://phabricator.wikimedia.org/T235067 (10jbond) [16:54:20] Amir1: I may have celebrated too soon, it may be a monitoring artifact, as now instead of errors for deadlocks, there are lock wait timeout exceptions [16:54:45] (03CR) 10RLazarus: [C: 03+2] mediawiki::web:prod_sites.pp: Apache config for ge.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [16:55:25] the edit save time went through the roof after 16:30 [16:55:54] it can be collision of the backports and the script [16:56:06] do you want me to kill it and restart it? [16:56:20] jynus: can we just kill it for a little bit [16:56:30] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Joe) >>! In T236797#5620179, @dbarratt wrote: > As far as UX is concerned... > > The HTTP... [16:56:45] Amir1: where does it run? mwmaint1001? [16:56:57] jynus: mwmaint1002 [16:56:58] 1002 I guess [16:57:35] name? [16:57:46] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10dbarratt) >>! In T236797#5620272, @Joe wrote: > I would assume that running in the client w... [16:58:04] rebuildItemTerms.php? Amir1? [16:58:14] jynus: up [16:58:17] *yup [16:59:24] !log killed rebuildItemTerms on mwmaint1002 [16:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:02] 10Operations, 10Patch-For-Review: Build cergen for buster - https://phabricator.wikimedia.org/T235405 (10elukey) `CERGEN=yes DIST=buster-wikimedia pdebuild` is needed to build the package in case anybody needs it :) [17:00:32] you know, remember this is a long term marathon, reverting and trying again is not a loss if for some reason it didn't work [17:00:54] * addshore looks for recovery [17:01:04] otherwise we might have to revert those patches [17:01:36] I may leave soon, so I trust you to do the right thing in any case [17:02:46] (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/546764 (https://phabricator.wikimedia.org/T215678) (owner: 10Bstorm) [17:03:53] addshore: it's recovered now [17:03:56] Amir1: seeing api request rate back [17:04:06] cool [17:04:55] the error will show up again if the script starts again (in 26 minutes) [17:05:11] ping any other root on this channel if you need help/puppet patch [17:05:26] thanks _joe_ and rlazarus for the merge [17:05:27] I make a patch to disable it for now [17:05:47] we should be good to go [17:06:01] <_joe_> hauskater: we're still deploying the change [17:06:17] _joe_: sudo puppet-merge ? :) [17:06:20] <_joe_> it will be fully deployed in ~ 1 hour [17:06:26] oh, wow [17:06:28] one hour [17:06:40] <_joe_> hauskater: puppet runs are staggered by ~ 30 minutes [17:07:02] Amir1: can you link the patch to https://phabricator.wikimedia.org/T236928 ? :) [17:07:24] _joe_: I thought you just merged on gerrit and ran puppet-merge on the puppetmaster [17:08:00] <_joe_> hauskater: yes [17:08:15] <_joe_> but before the change is applied puppet needs to run on ~ 200 hosts [17:08:35] (03PS1) 10Ladsgroup: mediawiki: Temporary disable rebuildTermIndex [puppet] - 10https://gerrit.wikimedia.org/r/547264 (https://phabricator.wikimedia.org/T236928) [17:08:37] <_joe_> and I'm teaching rlazarus how to check the correctness of the change [17:08:44] too many hosts, let's just shout them down ;) [17:08:48] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Mholloway) The HTTP requests for labels happen asynchronously in a deferred update on uploa... [17:08:48] *shut [17:08:50] yep, it's my first time doing this, so it's a little slower than usual :) [17:09:06] no urgency here [17:09:09] take your time [17:09:13] * hauskater learning things [17:09:28] and again, thanks for your help [17:11:41] _joe_: hey, can you merge this and run puppet agent in mwmaint1002? https://gerrit.wikimedia.org/r/547264 Things explode if this doesn't happen in 19 minutes :( [17:11:52] <_joe_> Amir1: uh but [17:12:00] <_joe_> mwmaint1002 doesn't have that apache config [17:12:45] Jaime just killed the cronjob in that node. [17:12:48] * Amir1 is confused now [17:13:02] <_joe_> yes the cronjob runs there [17:13:12] <_joe_> but I don't know how that can relate to the apache config on that node [17:13:15] Amir1: im guessing the script is still not running now? [17:13:36] <_joe_> is there a script I need to disable for the time being? [17:13:38] <_joe_> a cron I mean [17:14:15] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10Dzahn) Could you please merge/amend/remove the missing cumin ali... [17:14:27] (03CR) 10Jcrespo: [C: 03+1] mediawiki: Temporary disable rebuildTermIndex [puppet] - 10https://gerrit.wikimedia.org/r/547264 (https://phabricator.wikimedia.org/T236928) (owner: 10Ladsgroup) [17:14:44] addshore: yes, because it's killed but cron will start it every :30 [17:14:50] Amir, _joe_ is busy with an unrelated apache change [17:15:07] please someone else merge that, I know it is necessary and +1'ed it already [17:15:25] <_joe_> I can merge it in a sec [17:15:32] <_joe_> it's rlazarus who's working on the change [17:15:49] _joe_: I see, sorry for the confusion, it's just semi-urgent, I saw Joe is online [17:16:14] Amir1: I think we need to revert the 2 patches too [17:17:16] addshore: I can't find errors in logstash, the graph looks good too [17:17:20] let me grab links [17:17:35] https://logstash.wikimedia.org/goto/16ac7c263fef3038e0e356d71e62969e [17:17:38] https://usercontent.irccloud-cdn.com/file/juZfBqJt/image.png [17:17:40] Amir1: ^^ [17:17:45] https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&var-metric=p50&var-module=wbsetlabel&panelId=19&fullscreen&from=now-1h&to=now [17:18:06] the script killed edit rate, we stopped the script, edit rate slowly came back, before getting to the top, the edit rate alone caused problems [17:18:24] just the edit rate needed time to recover to show it is also a problem it seems? [17:18:26] maybe [17:18:34] that is https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&from=now-3h&to=now&var-metric=p95&var-module=wbsetlabel [17:18:39] <_joe_> sorry so the two problems are unrelated, right jynus [17:18:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Temporary disable rebuildTermIndex [puppet] - 10https://gerrit.wikimedia.org/r/547264 (https://phabricator.wikimedia.org/T236928) (owner: 10Ladsgroup) [17:19:16] I mean this is current wikidata edit rate, so something isn't right https://usercontent.irccloud-cdn.com/file/WmbqNQZT/image.png [17:19:16] let me check one last thing, is it a waywar bot or not [17:19:32] DBs just catching uo? [17:19:40] that peak is when we stopped the script [17:20:03] <_joe_> Amir1, addshore puppet running on mwmaint1002 [17:20:09] thanks [17:20:41] that's very very low for edits/sec [17:20:47] James_F: we also hav 5-8 seconds maxlag on the wikidata dbs currently it seems [17:20:58] apergos: just wikidata edits though [17:21:02] I just checked wikipulse and it really is an order of magnitude too small [17:21:04] yeah wikidata [17:21:18] db lag still bouncing around 5-15s ? [17:21:39] okay, let's revert then [17:21:39] <_joe_> Notice: /Stage[main]/Mediawiki::Maintenance::Wikidata/Cron[wikidata-rebuildItemTerms]/ensure: removed [17:21:44] Amir1: yup [17:23:21] db slave lag even going up to 20+s now :( [17:23:32] 10Operations, 10Puppet, 10User-jbond: Investigate using the rich_data opsion to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10jbond) [17:23:50] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/545982 (https://phabricator.wikimedia.org/T131280) (owner: 10Milimetric) [17:23:51] and edit rate down to almost nothing, probably a bunch of bots respecting replag [17:24:02] yup :) i regularly check [17:24:06] and yell [17:24:11] 10Operations, 10Puppet, 10observability, 10User-jbond: update failed puppet checks so that they go critical 24 hours - https://phabricator.wikimedia.org/T236478 (10jbond) [17:24:30] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10User-jbond: add compiler1003 to jenkins - https://phabricator.wikimedia.org/T236468 (10jbond) [17:24:47] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) [17:25:45] addshore: syncing [17:25:47] 10Operations, 10Puppet, 10Traffic, 10Patch-For-Review, 10User-jbond: Serve volatile uri from local site - https://phabricator.wikimedia.org/T235427 (10jbond) [17:26:12] 10Operations, 10Puppet, 10netops, 10User-jbond: Investigate improvements to how puppet manages interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) [17:26:14] (03PS3) 10Dzahn: Design microsite: Set the scap deploy_user to "deploy-design" [puppet] - 10https://gerrit.wikimedia.org/r/547014 (owner: 1020after4) [17:26:30] 10Operations, 10netops, 10User-jbond: Investigate the potential benefits of BGPalerter - https://phabricator.wikimedia.org/T230600 (10jbond) [17:26:36] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Revert 16:02 UTC T236928 (duration: 01m 04s) [17:26:38] Seems that the replication lag was just on db1126 [17:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:43] T236928: Wikidata editing via API (and UI) slow (timeouts) - https://phabricator.wikimedia.org/T236928 [17:26:56] according to https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104 it shot up to 25 mins for a minuite or 2 [17:27:14] 10Operations, 10Cassandra, 10User-jbond: Create a cassandra.service which subsumes casandra-{a,b,c} services using PartsOf=cassandra.service - https://phabricator.wikimedia.org/T229916 (10jbond) [17:27:37] 10Operations, 10ops-eqiad, 10cloud-services-team, 10User-jbond: (OoW) cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10jbond) [17:27:41] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Joe) >>! In T236797#5620320, @Mholloway wrote: > The HTTP requests for labels happen asynch... [17:28:02] 10Operations, 10Puppet, 10Packaging: update puppetdb and puppet-master packages to be compatible with puppet5 - https://phabricator.wikimedia.org/T222879 (10jbond) 05Open→03Resolved [17:28:07] twentyafterfour: i'll go ahead with the deploy-design switch [17:28:32] 10Operations, 10User-jbond: Discussion: Explore push notifications options - https://phabricator.wikimedia.org/T221265 (10jbond) [17:28:48] 10Operations, 10Puppet: Create canary roles for all canaries - https://phabricator.wikimedia.org/T221226 (10jbond) 05Open→03Resolved [17:29:03] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Anomie) a:05Anomie→03None I'm not actively working on... [17:29:06] 10Operations, 10Puppet, 10User-jbond: puppet fact: migrate away from the uniqueid fact - https://phabricator.wikimedia.org/T221083 (10jbond) [17:29:07] Looks like both the rep lag have recovered and edit rate is shooting up [17:29:10] looks like we are back to normal, wikidata edits/min completely redlined (> 330) [17:29:14] (03CR) 10Dzahn: [C: 03+2] Design microsite: Set the scap deploy_user to "deploy-design" [puppet] - 10https://gerrit.wikimedia.org/r/547014 (owner: 1020after4) [17:29:15] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Revert 16:05 UTC T236928 (duration: 01m 05s) [17:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:27] Can we create alarms and things in grafana yet? (I'v been away) [17:29:51] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review, 10User-jbond: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) Just waiting for the puppetdb's t get upgraded [17:30:01] now hovering around 270/min, stil reasonable [17:30:47] 10Operations, 10DC-Ops, 10hardware-requests: Request spare systems to test ipmi password reset cookbook - https://phabricator.wikimedia.org/T218117 (10jbond) 05Open→03Resolved no longer required as reset has been tested in production [17:30:54] (03CR) 10Dzahn: "Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/deploy-service]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/547014 (owner: 1020after4) [17:31:37] edit rate should slowly recover over the next 5 mins i would guess, as long as we don't see an increase in api execution time we should be all fine [17:31:58] 10Operations, 10Continuous-Integration-Config, 10User-jbond: operations/puppet CI errors not being displayed in console log - https://phabricator.wikimedia.org/T214726 (10jbond) [17:32:04] (03CR) 10Dzahn: "puppet run looking good on bromine/vega. deploy-service key removed. deploy-design user/group/key created." [puppet] - 10https://gerrit.wikimedia.org/r/547014 (owner: 1020after4) [17:32:18] 10Operations, 10Puppet, 10Traffic, 10Patch-For-Review, 10User-jbond: Serve volatile uri from local site - https://phabricator.wikimedia.org/T235427 (10BBlack) ~15m delays should be ok for the GeoIP stuff, it was already sync'd to various consuming cache and DNS nodes over the ~30 minute splay window of p... [17:33:28] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1572447662000&to=1572456797439&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-shard=s2&var-shard=s3&var-shard=s4&var-shard=s5&var-shard=s6&var-shard=s7&var-shard=s8&var-role=All [17:33:42] interesting [17:33:58] so that is indeed from the moment the maint script restarted [17:34:35] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Mholloway) @joe That sounds good, thanks. I'll update the code accordingly. [17:34:40] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1572447662000&to=1572456872649&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-shard=s2&var-shard=s3&var-shard=s4&var-shard=s5&var-shard=s6&var-shard=s7&var-shard=s8&var-role=All [17:34:52] (03PS2) 10Bstorm: toolforge-k8s: enable the settings API and PodPreset [puppet] - 10https://gerrit.wikimedia.org/r/546764 (https://phabricator.wikimedia.org/T215678) [17:35:03] added that to the ticket [17:35:51] (03CR) 10Bstorm: "Checking the existing configmap, it doesn't have quotes around a comma-separated list. Trying to mimic that style just in case that is use" [puppet] - 10https://gerrit.wikimedia.org/r/546764 (https://phabricator.wikimedia.org/T215678) (owner: 10Bstorm) [17:36:56] (03CR) 10Bstorm: "I'll edit this into the config map on the cluster now and see if we can validate it works live, then merge it." [puppet] - 10https://gerrit.wikimedia.org/r/546764 (https://phabricator.wikimedia.org/T215678) (owner: 10Bstorm) [17:37:01] !log twentyafterfour@deploy1001 Started deploy [design/style-guide@4d8d085]: (no justification provided) [17:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:04] !log twentyafterfour@deploy1001 Finished deploy [design/style-guide@4d8d085]: (no justification provided) (duration: 00m 04s) [17:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:16] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Dzahn) I created a new keypair for deployment (ke... [17:38:42] (03CR) 10Dzahn: "corresponding ticket https://phabricator.wikimedia.org/T235677" [puppet] - 10https://gerrit.wikimedia.org/r/547014 (owner: 1020after4) [17:39:04] interesting all of those reads seemed to only hit that one db [17:40:24] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10mmodell) heh amazing timing! I just tried it and... [17:41:16] 10Operations, 10SRE-Access-Requests: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani - https://phabricator.wikimedia.org/T236518 (10Dzahn) The 3 users exist on the deployment server and are members of the new group: ` [deploy1001:~] $ id volker-e uid=12186(volker-... [17:41:23] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Joe) Basically what you need is: - a setting for the domains to exclude from proxying to -... [17:41:23] Interestingly, the scap of design/style-guide failed but the channel log doesn't appear to reflect that it was a fail [17:41:35] !log run reprepro clearvanished on install1002 to clean leftovers of buster-wikimedia|thirdparty/elastic7 [17:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:17] Amir1: still seeing some increased execution times on https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&from=now-2h&to=now&var-metric=p95&var-module=wbeditentity but not sure if that is just because the requests started ages ago and are no just finishing xD [17:43:07] !log upload cergen 0.2.5-1+deb10u1 to buster-wikimedia component/cergen [17:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:19] addshore: don’t we terminate requests after 60 s? [17:43:57] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Dzahn) Once deployment has been confirmed keep in... [17:43:58] hm [17:44:55] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Dzahn) >>! In T235677#5620454, @mmodell wrote: >... [17:45:30] (03CR) 10Bstorm: "Turns out if you edit the manifest at /etc/kubernetes/manifests/api-server.yaml, it automatically restarts and includes your new config. " [puppet] - 10https://gerrit.wikimedia.org/r/546764 (https://phabricator.wikimedia.org/T215678) (owner: 10Bstorm) [17:46:49] I can't push my reverts to gerrit :( [17:46:58] D: [17:47:05] pushing things from production node :( [17:47:15] on branch [17:47:16] <_joe_> Amir1: what's going on? [17:47:23] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10Dzahn) 05Open→03Stalled [17:47:35] <_joe_> with gerrit I mean [17:47:49] 10Operations, 10SRE-Access-Requests: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani - https://phabricator.wikimedia.org/T236518 (10Dzahn) 05Open→03Resolved [17:47:53] _joe_: I'm trying to push it in gerrit using the process mentioned in https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Reverting [17:47:56] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Dzahn) [17:47:59] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: enable the settings API and PodPreset [puppet] - 10https://gerrit.wikimedia.org/r/546764 (https://phabricator.wikimedia.org/T215678) (owner: 10Bstorm) [17:48:15] <_joe_> and what is going wrong with it? [17:48:21] but first it doesn't get to branch and now it complains for lack of change id [17:48:26] (03CR) 10Bstorm: "I will update docs on how to do that." [puppet] - 10https://gerrit.wikimedia.org/r/546764 (https://phabricator.wikimedia.org/T215678) (owner: 10Bstorm) [17:48:40] <_joe_> Amir1: what is the change you want to revert? [17:48:58] 10Operations, 10Patch-For-Review: Build cergen for buster - https://phabricator.wikimedia.org/T235405 (10Ottomata) Huh, can you add that to the debian/README? [17:49:01] <_joe_> can I see what you were able to submit? [17:49:08] sure [17:49:16] it's in wikibase extension in wmf.3 and wmf.4 [17:49:56] when I run gitdir=$(git rev-parse --git-dir); scp -p -P 29418 ladsgroup@gerrit.wikimedia.org:hooks/commit-msg ${gitdir}/hooks/, it fails because my gerrit ssh key is not in the production node [17:49:57] (03PS2) 10Ottomata: Absent eventlogging 'replication' between db1107 and db1108 [puppet] - 10https://gerrit.wikimedia.org/r/547239 (https://phabricator.wikimedia.org/T159170) [17:50:10] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/19170/db1108.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/547239 (https://phabricator.wikimedia.org/T159170) (owner: 10Ottomata) [17:50:43] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Absent eventlogging 'replication' between db1107 and db1108 [puppet] - 10https://gerrit.wikimedia.org/r/547239 (https://phabricator.wikimedia.org/T159170) (owner: 10Ottomata) [17:51:27] I just revert it in gerrit and then I use its change -Id [17:51:32] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: Cleanup the puppetmaster module so that we stop breaking expectations (and the puppet compiler) - https://phabricator.wikimedia.org/T211547 (10jbond) [17:51:45] <_joe_> Amir1: https://gerrit.wikimedia.org/r/tools/hooks/commit-msg [17:52:04] <_joe_> copy this file to the hooks dir of your repo [17:52:09] 10Operations, 10User-jbond: Upgrade CAS to 6.1.0 - https://phabricator.wikimedia.org/T236815 (10jbond) [17:52:24] 10Operations, 10Icinga, 10observability, 10User-jbond: Monitoring for puppetdb queue size - https://phabricator.wikimedia.org/T236707 (10jbond) [17:52:37] <_joe_> anyways, please go on doing what you need to do [17:52:43] 10Operations, 10DC-Ops, 10decommission, 10User-jbond: decommission rhodium - https://phabricator.wikimedia.org/T235503 (10jbond) [17:52:58] _joe_: nice, I can do that too, like wget [17:53:29] 10Operations, 10ops-codfw: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10jbond) Is this fixed now? [17:53:40] 10Operations, 10ops-codfw, 10User-jbond: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10jbond) [17:53:51] 10Operations, 10User-jbond: Investigate GID allocation for system users - https://phabricator.wikimedia.org/T235163 (10jbond) [17:54:02] 10Operations, 10User-jbond: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 (10jbond) [17:54:18] 10Operations, 10User-jbond: Systemd hardening of CAS service unit - https://phabricator.wikimedia.org/T233951 (10jbond) [17:54:26] 10Operations, 10User-jbond: Revisit Tomcat deployment of CAS - https://phabricator.wikimedia.org/T233950 (10jbond) [17:54:34] 10Operations, 10User-jbond: Fine-tune CAS logging - https://phabricator.wikimedia.org/T233949 (10jbond) [17:54:41] 10Operations, 10User-jbond: Review ticket policies - https://phabricator.wikimedia.org/T233948 (10jbond) [17:54:52] 10Operations, 10User-jbond: CAS build as a deb - https://phabricator.wikimedia.org/T233947 (10jbond) [17:55:02] 10Operations, 10User-jbond: Validate user lockout - https://phabricator.wikimedia.org/T233946 (10jbond) [17:55:11] 10Operations, 10User-jbond: Banning IPs / subnets from accessing login/validation endpoint - https://phabricator.wikimedia.org/T233945 (10jbond) [17:55:21] 10Operations, 10User-jbond: Log / alert on too many failing logins / Throttling login attempts - https://phabricator.wikimedia.org/T233944 (10jbond) [17:55:32] 10Operations, 10User-jbond: Maintain session history / audit log - https://phabricator.wikimedia.org/T233942 (10jbond) [17:55:41] 10Operations, 10User-jbond: Validate Single Logout Flow - https://phabricator.wikimedia.org/T233941 (10jbond) [17:55:51] 10Operations, 10User-jbond: CLI tools for CAS administration - https://phabricator.wikimedia.org/T233940 (10jbond) [17:55:57] 10Operations, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10jbond) [17:56:08] 10Operations, 10User-jbond: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938 (10jbond) [17:56:12] 10Operations, 10Patch-For-Review: Build cergen for buster - https://phabricator.wikimedia.org/T235405 (10elukey) >>! In T235405#5620530, @Ottomata wrote: > Huh, can you add that to the debian/README? Yep yep already planning to, will do it tomorrow! [17:56:24] (03CR) 10Jgreen: [C: 03+1] DNS: Remove production and mgmt DNS for frav1001 [dns] - 10https://gerrit.wikimedia.org/r/544279 (https://phabricator.wikimedia.org/T222109) (owner: 10Papaul) [17:56:26] 10Operations, 10Patch-For-Review, 10User-jbond: Add U2F/FIDO as second factor for CAS - https://phabricator.wikimedia.org/T233937 (10jbond) [17:56:31] 10Operations, 10User-jbond: Integrate CAS into backup infrastructure - https://phabricator.wikimedia.org/T233936 (10jbond) [17:56:33] (03PS3) 10Jgreen: DNS: Remove production and mgmt DNS for frav1001 [dns] - 10https://gerrit.wikimedia.org/r/544279 (https://phabricator.wikimedia.org/T222109) (owner: 10Papaul) [17:56:42] 10Operations, 10User-jbond: Icinga Monitoring for CAS - https://phabricator.wikimedia.org/T233935 (10jbond) [17:56:50] 10Operations, 10User-jbond: Collects metrics for CAS - https://phabricator.wikimedia.org/T233934 (10jbond) [17:56:59] 10Operations, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10jbond) [17:57:09] 10Operations, 10User-jbond: Cross data center setup for CAS - https://phabricator.wikimedia.org/T233931 (10jbond) [17:57:16] 10Operations, 10User-jbond: Create a staging environment for CAS - https://phabricator.wikimedia.org/T233930 (10jbond) [17:57:25] 10Operations, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond) [17:57:39] 10Operations, 10puppet-compiler, 10User-jbond, 10User-jijiki: Remove nginx submodule from puppet - https://phabricator.wikimedia.org/T230206 (10jbond) [17:57:49] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10jbond) [17:57:55] 10Operations, 10Puppet, 10observability, 10User-jbond: Use git commit id as "configuration version" for puppet - https://phabricator.wikimedia.org/T228854 (10jbond) [17:58:11] 10Operations, 10Analytics, 10Traffic, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10jbond) [17:58:18] 10Operations, 10LDAP, 10User-jbond: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650 (10jbond) [17:58:32] (03CR) 10Jgreen: [C: 03+2] DNS: Remove production and mgmt DNS for frav1001 [dns] - 10https://gerrit.wikimedia.org/r/544279 (https://phabricator.wikimedia.org/T222109) (owner: 10Papaul) [17:58:43] 10Operations, 10puppet-compiler, 10User-jbond: puppet-catalog-compiler: compilation result randomly places servers in the 'failed' section - https://phabricator.wikimedia.org/T224977 (10jbond) [17:58:52] (03PS1) 10Ladsgroup: Revert "mediawiki: Temporary disable rebuildTermIndex" [puppet] - 10https://gerrit.wikimedia.org/r/547271 [17:59:13] (03PS2) 10Ladsgroup: Revert "mediawiki: Temporary disable rebuildTermIndex" [puppet] - 10https://gerrit.wikimedia.org/r/547271 [17:59:15] 10Operations, 10Puppet, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10jbond) [17:59:37] 10Operations, 10User-jbond: Audit our infrastructure for authenticated services - https://phabricator.wikimedia.org/T220361 (10jbond) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191030T1800). [18:00:04] urandom and Amir1: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:11] o/ [18:00:40] 10Operations, 10puppet-compiler, 10User-jbond: puppet: compiler-update-facts error and warning - https://phabricator.wikimedia.org/T214472 (10jbond) [18:00:45] ignore mine [18:01:22] 10Operations, 10Puppet, 10User-jbond: require_package should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10jbond) [18:01:24] (03PS1) 10Ottomata: Remove now unused eventlogging::replica puppetization [puppet] - 10https://gerrit.wikimedia.org/r/547273 (https://phabricator.wikimedia.org/T159170) [18:01:31] 10Operations, 10SRE-tools, 10Patch-For-Review, 10User-jbond, 10cloud-services-team (Kanban): Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169 (10jbond) [18:03:08] 10Operations, 10Traffic, 10netops, 10IPv6, and 2 others: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jbond) [18:03:27] 10Operations, 10Puppet, 10User-jbond: Add a CI check for the use of hiera() function - https://phabricator.wikimedia.org/T220820 (10jbond) [18:04:08] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10jbond) [18:04:19] 10Puppet, 10User-jbond, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10jbond) [18:05:47] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Mholloway) While investigating this yesterday I found that the Google client library that o... [18:07:23] (03PS2) 10Ottomata: Remove now unused eventlogging::replica puppetization [puppet] - 10https://gerrit.wikimedia.org/r/547273 (https://phabricator.wikimedia.org/T159170) [18:11:21] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Joe) So after some quick grepping, we already define a proxy in `mediawiki-config`, and it... [18:14:19] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kaldari) >After filtering out *wiktionary ns0 and ns1, loo... [18:15:00] (03CR) 10Dzahn: [C: 03+2] gerrit: remove cobalt from ssh known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/545335 (https://phabricator.wikimedia.org/T236187) (owner: 10Dzahn) [18:15:08] (03PS2) 10Dzahn: gerrit: remove cobalt from ssh known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/545335 (https://phabricator.wikimedia.org/T236187) [18:17:47] hauskater: belatedly -- your Apache change should be everywhere, shout if you still need anything [18:18:11] rlazarus: I appreciate your assistance, and that of _joe_ as well :) [18:20:43] MaxSem, RoanKattouw, Niharika, Urbanecm: will there be a SWAT? [18:21:35] jouncebot: next [18:21:35] In 0 hour(s) and 38 minute(s): Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191030T1900) [18:21:42] jouncebot: now [18:21:43] For the next 0 hour(s) and 38 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191030T1800) [18:22:13] urandom: apparently we're at it but so far no deployer is around? [18:22:25] hauskater: apparently [18:23:58] * hauskater eyes Urbanecm [18:25:10] (03CR) 10Ottomata: [C: 03+2] "Adding Jaime for FYI, yeehaw! eventlogging_sync is disabled! :)" [puppet] - 10https://gerrit.wikimedia.org/r/547273 (https://phabricator.wikimedia.org/T159170) (owner: 10Ottomata) [18:25:30] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/19172/db1108.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/547273 (https://phabricator.wikimedia.org/T159170) (owner: 10Ottomata) [18:25:47] urandom: Ain't you able to self-deploy? [18:26:28] Amir1: hi, wondering if you can SWAT a patch for urandom? [18:26:30] Hey, I can SWAT. [18:26:42] awesome [18:26:47] James_F: thanks! [18:27:05] (03PS3) 10Jforrester: Migrate to Kask for Echo seen-time storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547022 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [18:27:24] (03CR) 10Jforrester: [C: 03+2] Migrate to Kask for Echo seen-time storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547022 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [18:28:15] (03Merged) 10jenkins-bot: Migrate to Kask for Echo seen-time storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547022 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [18:28:50] urandom: Live on mwdebug1001. Can you test? [18:28:55] James_F: yup! [18:31:30] James_F: looks good [18:31:37] 10Operations, 10observability, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10Addshore) o/ From the WMDE side we would love to be able to set up more alerts for more things. grafana could be a great place for this, We actually just set... [18:31:45] Kk. [18:32:57] (03PS1) 10Bstorm: labsaliaser: switch to systemd timer and use root [puppet] - 10https://gerrit.wikimedia.org/r/547277 [18:33:09] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T222851 Migrate to Kask for Echo seen-time storage (duration: 01m 01s) [18:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:18] T222851: Improve Echo seentime code for multi-DC access - https://phabricator.wikimedia.org/T222851 [18:33:28] OK, anything else for SWAT? [18:34:08] not for me [18:34:38] Amir1's patches seem to have been UBN-merged and deployed and UUUBN reverted? [18:37:21] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [18:37:51] (03PS2) 10Bstorm: labsaliaser: switch to systemd timer and use root [puppet] - 10https://gerrit.wikimedia.org/r/547277 [18:37:53] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/545094 (owner: 10Dzahn) [18:38:52] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) RT (requesttracker) moved from jessie and public IP (ununpentium) to buster and private IP (moscovium) and https to backend via https://rt.discovery.wmnet [18:39:03] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) [18:41:04] 10Operations, 10Traffic, 10Core Platform Team Workboards (Clinic Duty Team): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10WDoranWMF) [18:41:08] James_F: yup [18:41:38] (03CR) 10Bstorm: "Seems legit: https://puppet-compiler.wmflabs.org/compiler1002/19173/cloudservices1003.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/547277 (owner: 10Bstorm) [18:42:00] (03PS2) 10Dzahn: acme_chief: remove cobalt from authorized hosts [puppet] - 10https://gerrit.wikimedia.org/r/545334 (https://phabricator.wikimedia.org/T236187) [18:42:08] Kk. [18:42:18] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [18:42:54] (03CR) 10Dzahn: [C: 03+2] acme_chief: remove cobalt from authorized hosts [puppet] - 10https://gerrit.wikimedia.org/r/545334 (https://phabricator.wikimedia.org/T236187) (owner: 10Dzahn) [18:44:00] (03PS2) 10Dzahn: ci: remove cobalt from firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/545330 (https://phabricator.wikimedia.org/T236187) [18:44:24] (03PS8) 10Jforrester: Variant configuration: Allow for YAML-based inheritance of configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 (https://phabricator.wikimedia.org/T223602) [18:44:25] (03PS2) 10Jforrester: Variant configuration: Generate dblists from YAML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) [18:45:36] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Generate dblists from YAML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:45:54] (03PS3) 10Bstorm: labsaliaser: switch to systemd timer and use root [puppet] - 10https://gerrit.wikimedia.org/r/547277 [18:46:30] (03CR) 10Dzahn: [C: 03+2] ci: remove cobalt from firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/545330 (https://phabricator.wikimedia.org/T236187) (owner: 10Dzahn) [18:46:53] (03CR) 10Jforrester: Variant configuration: Generate dblists from YAML (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:47:31] UUUBN lol [18:48:34] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): geoipupdate missing on buster on Cloud VPS - https://phabricator.wikimedia.org/T236487 (10bd808) 05Open→03Resolved p:05Triage→03Normal a:03bd808 Fixed by updating Puppet configuration to exclude geoip package provisioning by default for `::role... [18:48:55] (03CR) 10Bstorm: "Going to merge and try it out." [puppet] - 10https://gerrit.wikimedia.org/r/547277 (owner: 10Bstorm) [18:48:57] (03CR) 10Bstorm: [C: 03+2] labsaliaser: switch to systemd timer and use root [puppet] - 10https://gerrit.wikimedia.org/r/547277 (owner: 10Bstorm) [18:50:43] (03CR) 10Jforrester: [C: 03+1] "What's left to do here?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [18:51:14] (03PS2) 10Dzahn: mariadb: remove cobalt from ferm_misc rules [puppet] - 10https://gerrit.wikimedia.org/r/545333 (https://phabricator.wikimedia.org/T236187) [18:51:43] (03CR) 10Jcrespo: "I need to check, but this may break check_private_data on labs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:52:55] (03CR) 10Jforrester: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:56:14] (03CR) 10Jcrespo: "yeah, operations/puppet:modules/role/files/mariadb/check_private_data.py needs a minor, non-blocking patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:57:55] Jdlrobson: Hi. Do you still need Netlify for MinervaNeue and extension-PopUps @ github ? [19:00:04] brennen and twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191030T1900). [19:00:05] (03CR) 10Jcrespo: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [19:02:21] (03PS1) 10Brennen Bearnes: group1 wikis to 1.35.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547279 [19:02:23] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.35.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547279 (owner: 10Brennen Bearnes) [19:03:29] (03PS1) 10Bstorm: labsaliaser: wrap up the systemd timer command in a script [puppet] - 10https://gerrit.wikimedia.org/r/547280 [19:03:32] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547279 (owner: 10Brennen Bearnes) [19:05:13] !log moscovium - stop and remove rsync server, purge rsync package T180641 [19:05:14] (03PS2) 10Bstorm: labsaliaser: wrap up the systemd timer command in a script [puppet] - 10https://gerrit.wikimedia.org/r/547280 [19:05:16] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.4 [19:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:20] T180641: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 [19:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:10] (03PS1) 10Ottomata: Optimize archiva-gitfat-link script [puppet] - 10https://gerrit.wikimedia.org/r/547281 (https://phabricator.wikimedia.org/T235668) [19:06:17] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.4 (duration: 01m 00s) [19:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:25] (03CR) 10Dzahn: [C: 03+2] cumin: update which server is the kafka-main canary [puppet] - 10https://gerrit.wikimedia.org/r/545094 (owner: 10Dzahn) [19:07:33] (03PS3) 10Dzahn: cumin: update which server is the kafka-main canary [puppet] - 10https://gerrit.wikimedia.org/r/545094 [19:08:34] (03CR) 10Ottomata: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/547281 should be better!" [puppet] - 10https://gerrit.wikimedia.org/r/541775 (owner: 10Alexandros Kosiaris) [19:09:58] 10Operations, 10serviceops, 10Patch-For-Review: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) [19:10:01] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Reimage cobalt and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 (10Dzahn) [19:10:04] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [19:11:31] (03CR) 10Bstorm: [C: 03+2] "merging this to make the last patch actually work." [puppet] - 10https://gerrit.wikimedia.org/r/547280 (owner: 10Bstorm) [19:11:39] (03CR) 10BryanDavis: [C: 03+1] labsaliaser: wrap up the systemd timer command in a script [puppet] - 10https://gerrit.wikimedia.org/r/547280 (owner: 10Bstorm) [19:12:24] (03PS1) 10Jcrespo: check_private_data: ignore comments on private.dblist [puppet] - 10https://gerrit.wikimedia.org/r/547283 (https://phabricator.wikimedia.org/T223602) [19:13:53] (03CR) 10Jcrespo: "It shouldn't be a difficult patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/547283" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [19:14:28] (03CR) 10jerkins-bot: [V: 04-1] check_private_data: ignore comments on private.dblist [puppet] - 10https://gerrit.wikimedia.org/r/547283 (https://phabricator.wikimedia.org/T223602) (owner: 10Jcrespo) [19:16:01] (03PS1) 10Halfak: Adds hunspell-eu to ores/manifests/base.pp [puppet] - 10https://gerrit.wikimedia.org/r/547285 [19:16:24] (03PS2) 10Jcrespo: check_private_data: ignore comments on private.dblist [puppet] - 10https://gerrit.wikimedia.org/r/547283 (https://phabricator.wikimedia.org/T223602) [19:16:41] (03PS2) 10Halfak: Adds hunspell-eu to ores/manifests/base.pp [puppet] - 10https://gerrit.wikimedia.org/r/547285 (https://phabricator.wikimedia.org/T223788) [19:16:57] (03CR) 10Jcrespo: "I thought lstrip was a global function? Mixing PHP maybe?" [puppet] - 10https://gerrit.wikimedia.org/r/547283 (https://phabricator.wikimedia.org/T223602) (owner: 10Jcrespo) [19:25:39] (03PS1) 10Bstorm: labsaliaser: remove the cron entry and add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/547287 [19:26:15] (03CR) 10jerkins-bot: [V: 04-1] labsaliaser: remove the cron entry and add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/547287 (owner: 10Bstorm) [19:27:10] 10Operations, 10observability, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10Krinkle) This task is a placeholder for improving the current situation. It is not a blocker. Various teams at WMF use Grafana for alerting and I encourage WM... [19:31:12] (03PS2) 10Bstorm: labsaliaser: remove the cron entry and add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/547287 [19:32:08] Shouldn't the security options in the article be baked into https://wikitech.wikimedia.org/wiki/Production_access#Setting_up_your_SSH_config ? [19:32:35] also greg-g, your example config seems outdated [19:32:58] probably :) [19:33:16] (03PS10) 10Joal: Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) [19:33:43] greg-g: It was very helpful for SSH config noobs like myself [19:35:30] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0) [19:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:58] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [19:42:45] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Anomie) >>! In T219279#5620704, @kaldari wrote: >>After fi... [19:42:45] hauskater: "Jdlrobson: Hi. Do you still need Netlify for MinervaNeue and extension-PopUps @ github ?" nope. although it would be hella useful from a development pov we have other ways (albeit a little more cumbersome) of publishing static sites from repos. [19:43:12] Jdlrobson: So, shall I reject the request to have that App installed then? [19:43:19] Just to be sure :) [19:45:59] Umh, I think Netlify was also avalaible for MobileFrontend Jdlrobson [19:46:06] was it needed there? [19:49:02] 10Operations: Add docker-engine to buster-wikimedia distribution - https://phabricator.wikimedia.org/T236947 (10Ottomata) [19:49:29] (03CR) 10Jforrester: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [19:50:00] (03CR) 10Jforrester: [C: 03+1] check_private_data: ignore comments on private.dblist [puppet] - 10https://gerrit.wikimedia.org/r/547283 (https://phabricator.wikimedia.org/T223602) (owner: 10Jcrespo) [19:50:11] (03CR) 10Jforrester: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [19:52:07] 10Operations: Add docker-engine to buster-wikimedia distribution - https://phabricator.wikimedia.org/T236947 (10Ottomata) Hm, on second thought, perhaps the Jessie package doesn't work in buster after all... ` Oct 30 19:51:32 deployment-eventgate-2 docker[15339]: /usr/bin/docker: Error response from daemon: inv... [19:52:51] (03PS2) 10Jforrester: Enable WebAuthn on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546975 (https://phabricator.wikimedia.org/T227242) (owner: 10Reedy) [19:52:59] (03PS3) 10Jforrester: Enable WebAuthn on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546975 (https://phabricator.wikimedia.org/T227242) (owner: 10Reedy) [19:53:29] James_F: You're not planning on deploying that, right? [19:54:15] (missing vendor patch being merged due to CI dependancies) [19:58:09] PROBLEM - Check the last execution of labs-ip-alias-dump on cloudservices1004 is CRITICAL: CRITICAL: Status of the systemd unit labs-ip-alias-dump https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:58:33] (03PS2) 10Jforrester: Remove unused $hostName variable in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546641 (owner: 10Krinkle) [19:59:11] (03PS1) 10Dzahn: admins: add twentyafterfour to design deployers [puppet] - 10https://gerrit.wikimedia.org/r/547290 (https://phabricator.wikimedia.org/T236518) [19:59:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani, Mukunda Modell - https://phabricator.wikimedia.org/T236518 (10Dzahn) [20:00:04] (03CR) 10Dzahn: [C: 03+2] admins: add twentyafterfour to design deployers [puppet] - 10https://gerrit.wikimedia.org/r/547290 (https://phabricator.wikimedia.org/T236518) (owner: 10Dzahn) [20:00:04] cscott, arlolra, subbu, halfak, accraze, and mdholloway: #bothumor I � Unicode. All rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191030T2000). [20:00:05] PROBLEM - Check the last execution of labs-ip-alias-dump on cloudservices1003 is CRITICAL: CRITICAL: Status of the systemd unit labs-ip-alias-dump https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:03:35] Hey folks. I have a simple puppet change. Does anyone have time to take a look? https://gerrit.wikimedia.org/r/547285 [20:03:52] I'm hoping to get my kevinbazira unblocked when he picks up work tomorrow morning (UTC+3) [20:03:56] !log twentyafterfour@deploy1001 Started deploy [design/style-guide@4d8d085]: (no justification provided) [20:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:01] !log twentyafterfour@deploy1001 Finished deploy [design/style-guide@4d8d085]: (no justification provided) (duration: 00m 05s) [20:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:22] twentyafterfour: done on deploy1001 [20:04:37] twentyafterfour: next is your other change called "2/2" though [20:04:49] (03PS1) 10Ottomata: Update deployment-eventgate host for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547291 [20:05:41] (03CR) 10Dzahn: [C: 03+2] Adds hunspell-eu to ores/manifests/base.pp [puppet] - 10https://gerrit.wikimedia.org/r/547285 (https://phabricator.wikimedia.org/T223788) (owner: 10Halfak) [20:05:56] halfak: done [20:05:58] <3! [20:06:01] Thank you [20:06:19] (03PS3) 10Bstorm: labsaliaser: remove the cron entry and add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/547287 [20:06:27] !log twentyafterfour@deploy1001 Started deploy [design/style-guide@4d8d085]: (no justification provided) [20:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:50] !log twentyafterfour@deploy1001 Finished deploy [design/style-guide@4d8d085]: (no justification provided) (duration: 00m 23s) [20:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:19] !log twentyafterfour@deploy1001 Started deploy [design/style-guide@4d8d085]: (no justification provided) [20:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:27] !log twentyafterfour@deploy1001 deploy aborted: (no justification provided) (duration: 00m 07s) [20:07:28] 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on analytics1049 - https://phabricator.wikimedia.org/T234785 (10wiki_willy) Thanks @elukey I'll close out this request, if all the alerting is suppressed now. [20:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:34] 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on analytics1049 - https://phabricator.wikimedia.org/T234785 (10wiki_willy) 05Open→03Resolved [20:07:53] halfak: yw. the access request for Kevin Bazira also looks resolved. nice to see that happen before their first day, it's uncommon [20:08:16] but how it should be [20:08:20] mutante: which keyholder_key should I use? I don't see a deploy_design.yml in /etc/keyholder-auth.d [20:08:21] Oh. Actually kevin has been around for about a month :) [20:08:27] :\ [20:08:51] halfak: oh, i misunderstood it as first day of work, gotcha [20:08:52] (03PS1) 10Mholloway: WikimediaEditorTasks: Enable edit streaks on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547292 [20:09:15] Aha! Still a good thought! [20:09:34] (03CR) 10jerkins-bot: [V: 04-1] WikimediaEditorTasks: Enable edit streaks on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547292 (owner: 10Mholloway) [20:09:56] !log twentyafterfour@deploy1001 Started deploy [design/style-guide@4d8d085]: (no justification provided) [20:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:47] !log twentyafterfour@deploy1001 Finished deploy [design/style-guide@4d8d085]: (no justification provided) (duration: 00m 51s) [20:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:00] !log twentyafterfour@deploy1001 Started deploy [design/style-guide@4d8d085]: (no justification provided) [20:11:03] !log twentyafterfour@deploy1001 Finished deploy [design/style-guide@4d8d085]: (no justification provided) (duration: 00m 03s) [20:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:20] (03PS2) 10Mholloway: WikimediaEditorTasks: Enable edit streaks on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547292 [20:11:26] (03CR) 10BryanDavis: [C: 03+1] "Should work well for skipping full line comments" [puppet] - 10https://gerrit.wikimedia.org/r/547283 (https://phabricator.wikimedia.org/T223602) (owner: 10Jcrespo) [20:12:03] 10Operations, 10Icinga: dwisehaupt needs access to iginca for frack hosts - https://phabricator.wikimedia.org/T235676 (10Jgreen) [20:12:19] 10Operations, 10ops-eqiad: (Need by Aug 1) rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10wiki_willy) a:05ArielGlenn→03Cmjohnson Reassigning to @Cmjohnson for Ariel's RAID question [20:13:04] (03PS1) 10Alex Monk: toolforge::proxy: Remove old star.wmflabs.org absent resource [puppet] - 10https://gerrit.wikimedia.org/r/547293 [20:13:55] twentyafterfour: i am not sure yet how it gets to deploy1001. it has been created on bromine as File[/etc/ssh/userkeys/deploy-design [20:14:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10wiki_willy) @jijiki - just following up to see if this is still an issue or if we can resolve this. Thanks, Willy [20:14:33] (03CR) 10Mholloway: [C: 03+2] WikimediaEditorTasks: Enable edit streaks on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547292 (owner: 10Mholloway) [20:15:20] (03Merged) 10jenkins-bot: WikimediaEditorTasks: Enable edit streaks on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547292 (owner: 10Mholloway) [20:15:50] 10Operations, 10ops-eqiad: (Need by Aug 1) rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10ArielGlenn) Why was this assigned to me (which I didn't even notice)? Doesn't it get handed off after role::spare is put on the box, somewhere around "handoff for service imple... [20:16:43] twentyafterfour: looks like we are missing ::keyholder::agent on the backends [20:17:05] yeah [20:17:20] hieradata/role/common/deployment_server.yaml [20:17:30] (03CR) 10BBlack: [C: 03+1] "+1 ok for traffic geoip data updates" [puppet] - 10https://gerrit.wikimedia.org/r/542922 (https://phabricator.wikimedia.org/T235427) (owner: 10Jbond) [20:17:37] (03PS4) 10Bstorm: labsaliaser: remove the cron entry and add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/547287 [20:17:44] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: WikimediaEditorTasks: Enable edit streaks on beta (duration: 01m 03s) [20:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:42] !log arlolra@deploy1001 Started deploy [parsoid/deploy@a69ec92]: Updating Parsoid to 5ac1623 [20:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:15] (03PS1) 10Dzahn: deployment: add deploy-design trusted group stanza [puppet] - 10https://gerrit.wikimedia.org/r/547296 (https://phabricator.wikimedia.org/T235677) [20:21:26] (03PS1) 1020after4: add deploy-design to keyholder agents [puppet] - 10https://gerrit.wikimedia.org/r/547297 (https://phabricator.wikimedia.org/T235677) [20:21:30] twentyafterfour: ^ but needs more in scap::sources ? [20:21:56] yeah that basically [20:22:22] (03CR) 10Dzahn: [C: 03+2] deployment: add deploy-design trusted group stanza [puppet] - 10https://gerrit.wikimedia.org/r/547296 (https://phabricator.wikimedia.org/T235677) (owner: 10Dzahn) [20:23:42] Keyholder::Server/Keyholder::Agent[deploy-design]/File[/etc/keyholder-auth.d/deploy_design.yml]/ensure: defined content as '{md5}93b930de413782e1ed563f824caf6ab7' [20:23:45] twentyafterfour: ^ go [20:24:33] Reedy: Sorry, no, I wasn't, but you should C-1. ;-) [20:24:48] !log twentyafterfour@deploy1001 Started deploy [design/style-guide@4d8d085]: (no justification provided) [20:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:00] permission denied [20:25:06] !log twentyafterfour@deploy1001 Finished deploy [design/style-guide@4d8d085]: (no justification provided) (duration: 00m 18s) [20:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:18] twentyafterfour: keyholder arm ? [20:25:43] I think yeah [20:25:47] keyholder status doesn't show it [20:26:28] (03CR) 10Ottomata: [C: 03+2] Update deployment-eventgate host for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547291 (owner: 10Ottomata) [20:26:44] also don't see it in /etc/keyholder-auth.d [20:27:18] twentyafterfour: somebody added keys without using the same passphrase [20:27:30] eww [20:27:41] yea, eww [20:28:28] apache2secmod specifically [20:28:52] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@a69ec92]: Updating Parsoid to 5ac1623 (duration: 09m 10s) [20:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:04] (03Abandoned) 1020after4: add deploy-design to keyholder agents [puppet] - 10https://gerrit.wikimedia.org/r/547297 (https://phabricator.wikimedia.org/T235677) (owner: 1020after4) [20:29:43] !log otto@deploy1001 Synchronized wmf-config/LabsServices.php: Syncing LabsServices.php change for beta eventgate instance replacement (duration: 01m 01s) [20:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:07] 10Operations, 10Puppet, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10colewhite) [20:31:16] twentyafterfour: it is not listed on https://wikitech.wikimedia.org/wiki/Keyholder so that prevents us from arming it [20:31:19] 10Operations, 10Puppet, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10colewhite) p:05Triage→03Low [20:33:41] hmm keyholder should probably let you arm some keys without all of them [20:33:47] todo ^ [20:34:25] (03PS1) 1020after4: Add design/style-guide to scap::sources on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/547300 (https://phabricator.wikimedia.org/T235677) [20:35:22] it really shouldn't be this difficult to add a new scap deployment :-/ [20:40:49] twentyafterfour: ah, it has "add" [20:41:42] (03PS5) 10Bstorm: labsaliaser: remove the cron entry and add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/547287 [20:41:49] twentyafterfour: Identity added: /etc/keyholder.d/deploy_design (root@puppetmaster1001) [20:42:16] that comment should not be root@puppetmaster but that's just the comment [20:42:45] !log Updated Parsoid to 5ac1623 (T235656, T233818, T234549, T227209, T236112) [20:42:49] (03PS6) 10Bstorm: labsaliaser: remove the cron entry and add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/547287 [20:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:57] T236112: Missing contentmodel handlers for everything but wikitext - https://phabricator.wikimedia.org/T236112 [20:42:58] T234549: "Properly" address missing srcText issues in PageConfigFrame - https://phabricator.wikimedia.org/T234549 [20:42:58] T227209: Security Review For Parsoid-PHP - https://phabricator.wikimedia.org/T227209 [20:42:58] T233818: Call to a member function getContent() on null - https://phabricator.wikimedia.org/T233818 [20:42:59] T235656: Ref fragments remain unexpanded in Image:Frameless and mw:ExpandedAttrs nodes - https://phabricator.wikimedia.org/T235656 [20:44:33] (03CR) 10Dzahn: [C: 03+2] Add design/style-guide to scap::sources on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/547300 (https://phabricator.wikimedia.org/T235677) (owner: 1020after4) [20:46:03] !log twentyafterfour@deploy1001 Started deploy [design/style-guide@4d8d085]: (no justification provided) [20:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:07] !log twentyafterfour@deploy1001 Finished deploy [design/style-guide@4d8d085]: (no justification provided) (duration: 00m 04s) [20:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:12] !log twentyafterfour@deploy1001 Started deploy [design/style-guide@4d8d085]: (no justification provided) [20:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:15] !log twentyafterfour@deploy1001 Finished deploy [design/style-guide@4d8d085]: (no justification provided) (duration: 00m 03s) [20:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:50] sign_and_send_pubkey: signing failed: agent refused operation [20:47:52] Permission denied (publickey,keyboard-interactive). [20:48:21] (03PS1) 10Ottomata: Add eventgate-logging-external instance in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/547307 (https://phabricator.wikimedia.org/T236386) [21:02:23] (03CR) 10BryanDavis: labsaliaser: remove the cron entry and add monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547287 (owner: 10Bstorm) [21:03:13] (03PS1) 10RLazarus: Add rlazarus Yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/547310 [21:06:54] (03CR) 10CDanis: [C: 03+1] Add rlazarus Yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/547310 (owner: 10RLazarus) [21:06:58] (03CR) 10RLazarus: [C: 03+2] Add rlazarus Yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/547310 (owner: 10RLazarus) [21:09:35] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) Eventually a few days after I noticed jobs were slower than expected and spend a couple days narrowing down. Tha... [21:10:50] (03CR) 10Jforrester: [C: 04-1] "Needs vendor patches first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546975 (https://phabricator.wikimedia.org/T227242) (owner: 10Reedy) [21:16:59] (03PS1) 10Hashar: contint: downgrade docker on Stretch to match Jessie [puppet] - 10https://gerrit.wikimedia.org/r/547313 (https://phabricator.wikimedia.org/T236675) [21:17:36] !log ppchelko@deploy1001 Started deploy [restbase/deploy@88cf547]: Parsoid mirroring followups: T236837, T236838 [21:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:43] T236837: High rate of 412 responses from Parsoid-PHP - https://phabricator.wikimedia.org/T236837 [21:17:43] T236838: RESTBase mirror mode for Parsoid-PHP doesn't honor storage - https://phabricator.wikimedia.org/T236838 [21:21:10] 10Operations, 10Parsoid-PHP, 10serviceops: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) Okay, I have the full set of about 110 urls (that were failing with Parsoid/PHP with OOMs in the last 24 hours) and am ready to test these urls on scandium. @mutante, can you bum... [21:22:39] 10Operations, 10Parsoid-PHP, 10serviceops: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) Some of those urls were being repeatedly retried (many 10s of times over the last 24 hours probably because of T236838) .. so, that is why only ~100 unique urls even though there... [21:31:40] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@88cf547]: Parsoid mirroring followups: T236837, T236838 (duration: 14m 04s) [21:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:46] T236837: High rate of 412 responses from Parsoid-PHP - https://phabricator.wikimedia.org/T236837 [21:31:46] T236838: RESTBase mirror mode for Parsoid-PHP doesn't honor storage - https://phabricator.wikimedia.org/T236838 [21:34:14] I'm no expert but T236955 's 255.255.* IPs looks like non-public ones to me, like DNS? [21:34:15] T236955: Lift IP cap for edit-a-thon at Bard College Nov. 11, 2019 - https://phabricator.wikimedia.org/T236955 [21:41:08] hauskater: pretty sure they're netmasks associated with the IP prefix :) [21:43:31] (03PS1) 10RLazarus: Delete rlazarus non-Yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/547319 [21:46:39] (03CR) 10Hashar: "Cherry picked it on the integration puppet master. I have manually downgraded the whole fleet." [puppet] - 10https://gerrit.wikimedia.org/r/547313 (https://phabricator.wikimedia.org/T236675) (owner: 10Hashar) [21:48:10] (03CR) 10RLazarus: [C: 03+2] Delete rlazarus non-Yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/547319 (owner: 10RLazarus) [21:55:05] !log ppchelko@deploy1001 Started deploy [restbase/deploy@fa934c8]: Bump parsoid mirroring to 25% and fix 412: T235902, T236837 [21:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:11] T235902: Tracking: Shadow Parsoid/PHP deployment to production cluster to handle mirrored reparse traffic - https://phabricator.wikimedia.org/T235902 [21:55:11] T236837: High rate of 412 responses from Parsoid-PHP - https://phabricator.wikimedia.org/T236837 [22:08:59] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@fa934c8]: Bump parsoid mirroring to 25% and fix 412: T235902, T236837 (duration: 13m 54s) [22:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:05] T235902: Tracking: Shadow Parsoid/PHP deployment to production cluster to handle mirrored reparse traffic - https://phabricator.wikimedia.org/T235902 [22:09:06] T236837: High rate of 412 responses from Parsoid-PHP - https://phabricator.wikimedia.org/T236837 [22:13:41] 10Operations, 10ops-esams, 10decommission, 10Patch-For-Review: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883 (10Papaul) [22:14:08] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10Papaul) [22:14:12] 10Operations, 10ops-esams, 10decommission, 10Patch-For-Review: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883 (10Papaul) 05Open→03Resolved a:03Papaul Complete [22:17:42] 10Operations, 10ops-esams: cp3038, cp3039 - power supply redundancy failure - https://phabricator.wikimedia.org/T203272 (10Papaul) 05Open→03Resolved a:03Papaul These servers are unracked and recycled, we can resolve this task [22:18:39] 10Operations, 10ops-esams, 10DC-Ops: Check power supply balance settings on cp3030+ - https://phabricator.wikimedia.org/T98984 (10Papaul) 05Open→03Resolved a:03Papaul This server is decommissioned, we can resolve this task [22:19:07] !log andrew@deploy1001 Started deploy [horizon/deploy@2d551d8]: Rolling out a currently-turned-off puppet edit mode [22:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:13] (03CR) 10Jforrester: "Remember that each dblist's existence slows down every page view and we're trying to discourage them (until my magic YAML files replaces t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [22:22:22] !log andrew@deploy1001 Finished deploy [horizon/deploy@2d551d8]: Rolling out a currently-turned-off puppet edit mode (duration: 03m 15s) [22:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:26] (03PS22) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) [22:22:28] (03PS30) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [22:22:30] (03PS28) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [22:22:32] (03PS26) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [22:22:34] (03PS27) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [22:22:36] (03PS27) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [22:29:47] !log scandium - live hack /srv/mediawiki/wmf-config/InitialiseSettings.php - set wmgMemoryLimit to 850 (*1024 *1024), restart php7.2-fpm (T236833) [22:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:52] T236833: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 [22:30:42] subbu: ^ (i did not see the ping earlier on the ticket because different phab usernames. please use dzahn there, i need to get rid of the other account) [22:30:49] 10Operations, 10ops-esams: cr3-esams:et-1/0/0 flap - https://phabricator.wikimedia.org/T236767 (10Papaul) @ayounsi we had this same problem @faidon and I on cr3-esams:et-1/0/1 and asw2-esams:et-6/0/51 . it ended up being the optic on the router side. I think that cr3-esams:et-1/0/0 is using the same type of... [22:30:54] i changed the memory limit to 860 on scandium [22:31:30] by changing MW config like i did when tested the other day [22:32:21] (03CR) 10Bstorm: labsaliaser: remove the cron entry and add monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547287 (owner: 10Bstorm) [22:33:17] 10Operations, 10ops-esams, 10DC-Ops: Setup management switch in OE12 - https://phabricator.wikimedia.org/T84700 (10Papaul) 05Open→03Declined We don't know this anymore [22:34:43] 10Operations, 10Parsoid-PHP, 10serviceops: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10Dzahn) > @Mutante, can you bump the limit to 850 MB on scandium? @ssastry I changed it to 850MB on scandium in MediawikiConfig as logged above. (Please use the @Dzahn user here on Ph... [22:34:51] (03PS1) 10Jhedden: bootstrapvz: remove ed25519 ssh host keys after build [puppet] - 10https://gerrit.wikimedia.org/r/547333 [22:35:52] (03CR) 10Jhedden: "more info on commands globbing syntax" [puppet] - 10https://gerrit.wikimedia.org/r/547333 (owner: 10Jhedden) [22:36:39] mutante, thanks .. yes, will use dzahn in future there . [22:38:09] subbu: so yea, when i tested for that little table i pasted, i always just changed the limit there in MW config (not php-fpm) and i could see the different results [22:38:44] ok .. will test later. thanks. [22:38:59] yw, ack [22:40:45] (03CR) 10Gergő Tisza: "> Remember that each dblist's existence slows down every page view and we're trying to discourage them (until my magic YAML files replaces" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [22:41:01] (03PS7) 10Bstorm: labsaliaser: remove the cron entry and add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/547287 [22:41:45] (03CR) 10Jforrester: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [22:43:02] (03CR) 10Bstorm: labsaliaser: remove the cron entry and add monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547287 (owner: 10Bstorm) [22:44:26] (03PS8) 10Bstorm: labsaliaser: remove the cron entry and add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/547287 [22:51:51] (03PS1) 10Papaul: DNS: Remove mgmt Dns for ms-be300[1-4] and ms-fe300[1-2] [dns] - 10https://gerrit.wikimedia.org/r/547337 [22:51:54] (03CR) 10Gergő Tisza: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [22:52:03] (03CR) 10Jhedden: "we may want to use the upstream fix instead: https://github.com/andsens/bootstrap-vz/commit/fe0f8eba5bd30de9c6d6693dfab8f692b1978015" [puppet] - 10https://gerrit.wikimedia.org/r/547333 (owner: 10Jhedden) [22:55:51] (03CR) 10Alex Monk: "Upstream removes some stuff for DSA keys, don't think we've got rid of those in labs yet." [puppet] - 10https://gerrit.wikimedia.org/r/547333 (owner: 10Jhedden) [22:56:12] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt Dns for ms-be300[1-4] and ms-fe300[1-2] [dns] - 10https://gerrit.wikimedia.org/r/547337 (owner: 10Papaul) [22:56:35] (03PS1) 10Isaac Johnson: Undeploy reader surveys in English, Polish, and Russian. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547339 (https://phabricator.wikimedia.org/T232525) [22:58:01] 10Operations, 10ops-esams, 10decommission, 10Patch-For-Review: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518 (10Papaul) [22:58:03] (03CR) 10Mathew.onipe: query_service: rename wdqs module to query_service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [22:58:26] 10Operations, 10ops-esams, 10decommission, 10Patch-For-Review: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518 (10Papaul) 05Open→03Resolved Complete [22:58:28] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10Papaul) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191030T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:06:11] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10Papaul) [23:08:43] !log power cycle cr3-esams re1 - T236598 [23:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:48] T236598: cr3-esams crash - https://phabricator.wikimedia.org/T236598 [23:14:35] RECOVERY - Juniper alarms on cr3-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [23:17:30] 10Operations, 10netops: cr3-esams crash - https://phabricator.wikimedia.org/T236598 (10ayounsi) 05Open→03Resolved Power cycled CB1 (hosting re1) following https://kb.juniper.net/InfoCenter/index?page=content&id=KB14278&cat=JUNOS&actp=LIST and RE1 is now back online in a healthy state. [23:19:41] (03PS1) 10Papaul: DNS: Remove mgmt DNS for amslvs[1-3] [dns] - 10https://gerrit.wikimedia.org/r/547347 [23:20:27] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:26:16] (03CR) 10BryanDavis: labsaliaser: remove the cron entry and add monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547287 (owner: 10Bstorm) [23:26:41] (03PS1) 10Dzahn: parsoid-php: on beta, add sudo privs for php-fpm restarts [puppet] - 10https://gerrit.wikimedia.org/r/547349 (https://phabricator.wikimedia.org/T236275) [23:31:19] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr3-esams.wikimedia.org recovered from Juniper alarm active [23:31:37] RECOVERY - Check the Netbox report management for fail status. on netbox1001 is OK: management.ManagementConsole OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:31:53] (03PS9) 10Bstorm: labsaliaser: remove the cron entry and add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/547287 [23:32:33] (03CR) 10Bstorm: labsaliaser: remove the cron entry and add monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547287 (owner: 10Bstorm) [23:33:45] 10Operations, 10ops-codfw, 10DC-Ops: furud: disconnect and power down all disk shelves - https://phabricator.wikimedia.org/T199251 (10faidon) 05Open→03Resolved [23:35:01] (03CR) 10Bstorm: [C: 03+2] labsaliaser: remove the cron entry and add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/547287 (owner: 10Bstorm) [23:42:51] RECOVERY - Check the Netbox report librenms for fail status. on netbox1001 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:50:02] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for amslvs[1-3] [dns] - 10https://gerrit.wikimedia.org/r/547347 (owner: 10Papaul) [23:56:16] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "noop in prod: https://puppet-compiler.wmflabs.org/compiler1002/19180/" [puppet] - 10https://gerrit.wikimedia.org/r/547349 (https://phabricator.wikimedia.org/T236275) (owner: 10Dzahn)