[00:03:53] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:17] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 37852680 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:09:05] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 54352 and 49 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:04:11] (03PS1) 10DannyS712: Remove "Create a book" link on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561403 (https://phabricator.wikimedia.org/T241683) [03:08:36] (03PS2) 10DannyS712: Remove "Create a book" link on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561403 (https://phabricator.wikimedia.org/T241683) [03:17:58] (03CR) 10Pppery: Remove "Create a book" link on enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561403 (https://phabricator.wikimedia.org/T241683) (owner: 10DannyS712) [04:12:17] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 53990080 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:13:56] (03PS1) 10BryanDavis: toolforge: Monitor local crontabs with Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/561412 (https://phabricator.wikimedia.org/T210993) [04:14:19] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 192835096 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:15:53] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 64320912 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:15:54] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Monitor local crontabs with Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/561412 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [04:16:11] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 51299952 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:16:37] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: replace diamond redis monitoring with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/561379 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [04:17:43] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18842976 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:17:55] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 51120 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:17:59] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 35864 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:18:53] (03PS2) 10BryanDavis: toolforge: Monitor local crontabs with Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/561412 (https://phabricator.wikimedia.org/T210993) [04:19:31] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 11120 and 76 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:19:31] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 27168 and 76 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:20:35] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [04:21:09] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Monitor local crontabs with Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/561412 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [04:25:23] (03CR) 10BryanDavis: "Jenkins error is "modules/toollabs/manifests/init.pp:164 wmf-style: class 'toollabs' declares class profile::prometheus::node_local_cronta" [puppet] - 10https://gerrit.wikimedia.org/r/561412 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [04:28:33] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [04:39:32] (03CR) 10Tim Starling: [C: 03+1] "Looks good, approving for self-merge and deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558763 (https://phabricator.wikimedia.org/T232563) (owner: 10Krinkle) [04:51:03] (03PS1) 10Andrew Bogott: Revert "toolforge: replace diamond redis monitoring with prometheus" [puppet] - 10https://gerrit.wikimedia.org/r/561423 [04:51:27] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [04:55:41] (03CR) 10Andrew Bogott: [C: 03+2] Revert "toolforge: replace diamond redis monitoring with prometheus" [puppet] - 10https://gerrit.wikimedia.org/r/561423 (owner: 10Andrew Bogott) [05:17:34] (03CR) 10BryanDavis: "Reverted in Ie776a16125f0302a9ab58d379bdf0177643fca68 because I didn't realize that this also needs Prometheus server config changes which" [puppet] - 10https://gerrit.wikimedia.org/r/561379 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [05:56:56] (03PS1) 10BryanDavis: toolforge: replace diamond redis monitoring with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/561437 (https://phabricator.wikimedia.org/T210993) [05:58:56] (03PS2) 10BryanDavis: toolforge: replace diamond redis monitoring with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/561437 (https://phabricator.wikimedia.org/T210993) [06:00:34] (03CR) 10BryanDavis: "phamhi, godog: this is largely a guess by me about how to work around the lack of PuppetDB to do magic wiring of the jobs. Please review a" [puppet] - 10https://gerrit.wikimedia.org/r/561437 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [06:20:47] (03PS1) 10Marostegui: mariadb: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/561440 [06:21:36] (03CR) 10Marostegui: [C: 03+2] mariadb: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/561440 (owner: 10Marostegui) [06:22:40] !log Depool labsdb1009 [06:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2087:3316 - T239453', diff saved to https://phabricator.wikimedia.org/P10020 and previous config saved to /var/cache/conftool/dbconfig/20200102-062650-marostegui.json [06:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:54] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:29:14] !log Remove revision partitions from db2087:3316 T239453 [06:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:03] !log Upgrade labsdb1009 [06:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:07] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [06:34:12] (03PS1) 10Marostegui: Revert "mariadb: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/561442 [06:35:01] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/561442 (owner: 10Marostegui) [06:44:50] !log Repool labsdb1009 [06:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:48] !log Deploy schema change on db2131 - T241387 [06:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:51] T241387: Extend flow_wiki_ref.ref_src_wiki - https://phabricator.wikimedia.org/T241387 [06:57:29] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:01:01] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:18:37] !log Deploy schema change on labswiki.flow_wiki_ref (empty table) T241387 [07:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:41] T241387: Extend flow_wiki_ref.ref_src_wiki - https://phabricator.wikimedia.org/T241387 [07:26:45] !log Upgrade db2079 [07:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:03] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) @Andrew @Bstorm who in WMCS would be responsible for restarting mysql on these hosts? ` labservices1001 labservices1002 labtestserv... [07:28:56] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:49:19] !log Deploy schema change on techconductwiki.flow_wiki_ref (empty table) on s3 master (db1123) T241387 [07:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:23] T241387: Extend flow_wiki_ref.ref_src_wiki - https://phabricator.wikimedia.org/T241387 [07:55:51] (03PS1) 10Elukey: graphite::wmcs::archiver: fix more errors in archive-instances [puppet] - 10https://gerrit.wikimedia.org/r/561453 [07:58:01] (03PS1) 10Muehlenhoff: Offboard Mathew [puppet] - 10https://gerrit.wikimedia.org/r/561490 [08:04:33] (03CR) 10Muehlenhoff: [C: 03+2] Offboard Mathew [puppet] - 10https://gerrit.wikimedia.org/r/561490 (owner: 10Muehlenhoff) [08:07:15] (03PS1) 10Muehlenhoff: Remove Mathew from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/561564 [08:09:57] (03CR) 10Muehlenhoff: [C: 03+2] Remove Mathew from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/561564 (owner: 10Muehlenhoff) [08:10:21] !log Deploy schema change on officewiki.flow_wiki_ref on s3 master (db1123) T241387 [08:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:23] T241387: Extend flow_wiki_ref.ref_src_wiki - https://phabricator.wikimedia.org/T241387 [08:15:12] (03CR) 10Andrew Bogott: [C: 03+2] graphite::wmcs::archiver: fix more errors in archive-instances [puppet] - 10https://gerrit.wikimedia.org/r/561453 (owner: 10Elukey) [08:22:59] (03PS2) 10Ema: Revert "vcl: rewrite cache busting Main_Page tests" [puppet] - 10https://gerrit.wikimedia.org/r/561157 [08:26:39] !log Upgrade db2075 [08:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:03] (03CR) 10Ema: [C: 03+2] mtail: port varnishbackendtiming to ATS [puppet] - 10https://gerrit.wikimedia.org/r/561266 (https://phabricator.wikimedia.org/T233474) (owner: 10Ema) [08:28:06] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:31:18] (03PS1) 10Ammarpad: Set $wgArticleCountMethod to 'any' for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561572 (https://phabricator.wikimedia.org/T241694) [08:33:47] (03PS1) 10Muehlenhoff: Removed LDAP access for pdrouin [puppet] - 10https://gerrit.wikimedia.org/r/561573 [08:35:14] (03PS2) 10Ammarpad: Set $wgArticleCountMethod to 'any' for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561572 (https://phabricator.wikimedia.org/T241694) [08:35:17] !log Upgrade db2090 [08:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:15] (03CR) 10Muehlenhoff: [C: 03+2] Removed LDAP access for pdrouin [puppet] - 10https://gerrit.wikimedia.org/r/561573 (owner: 10Muehlenhoff) [08:38:35] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:44:49] (03PS1) 10Ammarpad: Add throttle exception for Amical Wikimedia Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561576 (https://phabricator.wikimedia.org/T241705) [08:45:55] (03CR) 10jerkins-bot: [V: 04-1] Add throttle exception for Amical Wikimedia Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561576 (https://phabricator.wikimedia.org/T241705) (owner: 10Ammarpad) [08:54:07] (03PS3) 10Muehlenhoff: Offboard Tim Eulitz [puppet] - 10https://gerrit.wikimedia.org/r/560972 (owner: 10Ladsgroup) [08:54:21] 10Operations, 10Performance-Team, 10Traffic, 10observability, 10Patch-For-Review: Ensure graphs used by Performance account for Varnish-to-ATS migration - https://phabricator.wikimedia.org/T233474 (10ema) >>! In T233474#5761873, @Krinkle wrote: > It looks like the Apache Backend-Timing graphs dried up. >... [08:55:08] (03CR) 10Muehlenhoff: "Thanks for the patch, much appreciated! I'll merge with a small followup change" [puppet] - 10https://gerrit.wikimedia.org/r/560972 (owner: 10Ladsgroup) [08:56:34] (03PS4) 10Muehlenhoff: Offboard Tim Eulitz [puppet] - 10https://gerrit.wikimedia.org/r/560972 (owner: 10Ladsgroup) [08:58:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2076 T241647', diff saved to https://phabricator.wikimedia.org/P10021 and previous config saved to /var/cache/conftool/dbconfig/20200102-085806-marostegui.json [08:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:11] T241647: Upgrade BIOS and firmware on db2076 - https://phabricator.wikimedia.org/T241647 [08:58:39] (03CR) 10Muehlenhoff: [C: 03+2] Offboard Tim Eulitz [puppet] - 10https://gerrit.wikimedia.org/r/560972 (owner: 10Ladsgroup) [08:59:31] 10Operations, 10ops-codfw, 10DBA: Upgrade BIOS and firmware on db2076 - https://phabricator.wikimedia.org/T241647 (10Marostegui) This host is depooled and ready to get, downtimed MySQL stopped and powered off for @Papaul to proceed. [09:01:41] (03PS2) 10Ammarpad: Add throttle exception for Amical Wikimedia Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561576 (https://phabricator.wikimedia.org/T241705) [09:02:39] (03CR) 10Muehlenhoff: "I also removed him from the cn=wmde and cn=nda LDAP groups and created https://phabricator.wikimedia.org/T241713 for HDFS" [puppet] - 10https://gerrit.wikimedia.org/r/560972 (owner: 10Ladsgroup) [09:02:45] (03CR) 10jerkins-bot: [V: 04-1] Add throttle exception for Amical Wikimedia Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561576 (https://phabricator.wikimedia.org/T241705) (owner: 10Ammarpad) [09:06:22] 10Operations, 10netops: fastnetmon misreports attack type and protocol - https://phabricator.wikimedia.org/T241374 (10ayounsi) a:03ayounsi Opened https://github.com/pavel-odintsov/fastnetmon/issues/787 [09:09:17] 10Operations, 10Performance-Team, 10SRE-swift-storage, 10Traffic, and 2 others: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10fgiunchedi) Analytics now publishes media access stats, might be useful to drive some/all thumbnail cleanup: https://wikitech.wiki... [09:11:33] (03PS1) 10Ema: Rename cloud_nets to public_cloud_nets [labs/private] - 10https://gerrit.wikimedia.org/r/561580 [09:12:12] (03CR) 10Ema: [C: 03+2] Revert "vcl: rewrite cache busting Main_Page tests" [puppet] - 10https://gerrit.wikimedia.org/r/561157 (owner: 10Ema) [09:12:30] (03CR) 10Ema: [V: 03+2 C: 03+2] Rename cloud_nets to public_cloud_nets [labs/private] - 10https://gerrit.wikimedia.org/r/561580 (owner: 10Ema) [09:13:53] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: deprecate raid10-gpt.cfg [puppet] - 10https://gerrit.wikimedia.org/r/559550 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [09:14:05] (03PS4) 10Filippo Giunchedi: install_server: deprecate raid10-gpt.cfg [puppet] - 10https://gerrit.wikimedia.org/r/559550 (https://phabricator.wikimedia.org/T156955) [09:16:33] (03PS2) 10Ema: vcl: stricter rate limiting of cloud IPs [puppet] - 10https://gerrit.wikimedia.org/r/561156 [09:17:09] (03PS3) 10Ema: vcl: stricter rate limiting of cloud IPs [puppet] - 10https://gerrit.wikimedia.org/r/561156 [09:18:31] (03CR) 10Filippo Giunchedi: [C: 03+2] DHCP: Change MAC address for ms-fe2007 [puppet] - 10https://gerrit.wikimedia.org/r/560409 (https://phabricator.wikimedia.org/T239805) (owner: 10Papaul) [09:23:08] 10Operations, 10ops-codfw, 10Patch-For-Review: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts: ` ms-fe2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202001020923... [09:25:52] 10Operations, 10Traffic, 10observability: cp1083: ats-tls and varnish-fe crashed due to insufficient memory - https://phabricator.wikimedia.org/T241593 (10Volans) @ema maybe could be related to NUMA utilization? Having a quick look at `numastat` (both `-n` and `-m`) there is a general imbalance between the t... [09:32:47] 10Operations, 10ops-codfw, 10Patch-For-Review: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10fgiunchedi) @Papaul thanks! I tried to reimage the host but the NIC shows its link as offline from bios: ` Link Status * ` [09:33:37] 10Operations, 10ops-codfw, 10Patch-For-Review: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-fe2007.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-fe2007.codfw.wmnet'] ` [09:33:41] (03PS1) 10Elukey: role::analytics_cluster::coordinator: deploy presto's keytab [puppet] - 10https://gerrit.wikimedia.org/r/561584 [09:34:01] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: deploy presto's keytab [puppet] - 10https://gerrit.wikimedia.org/r/561584 (owner: 10Elukey) [09:36:03] 10Operations, 10ops-codfw, 10Patch-For-Review: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10fgiunchedi) >>! In T239805#5770424, @fgiunchedi wrote: > @Papaul thanks! I tried to reimage the host but the NIC shows its link as offline from bios: > > ` > Link Status... [09:36:13] (03CR) 10Ayounsi: netbox: skip virtual chassis without domain (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/559973 (owner: 10Volans) [09:39:18] (03CR) 10Volans: netbox: skip virtual chassis without domain (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/559973 (owner: 10Volans) [09:42:42] (03CR) 10Ayounsi: [C: 03+1] "Tested and works as expected." (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/559973 (owner: 10Volans) [09:46:43] 10Operations, 10ops-codfw, 10Patch-For-Review: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10fgiunchedi) Configured both ports to use PXE when booting, now the host is running the reimage correctly: ` NIC in Slot 2 Port 1: QLogic 577xx/578xx 10 Gb Ethernet BCM57810 - F... [09:54:52] (03PS1) 10Vgutierrez: Provide MX and SPF records for wikimedia.community [dns] - 10https://gerrit.wikimedia.org/r/561587 (https://phabricator.wikimedia.org/T241132) [09:55:15] (03CR) 10jerkins-bot: [V: 04-1] Provide MX and SPF records for wikimedia.community [dns] - 10https://gerrit.wikimedia.org/r/561587 (https://phabricator.wikimedia.org/T241132) (owner: 10Vgutierrez) [09:56:49] (03PS2) 10Vgutierrez: Provide MX and SPF records for wikimedia.community [dns] - 10https://gerrit.wikimedia.org/r/561587 (https://phabricator.wikimedia.org/T241132) [09:57:14] (03PS1) 10Elukey: role::analytics_cluster::presto::server: add kerberos config [puppet] - 10https://gerrit.wikimedia.org/r/561590 [09:58:44] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [09:58:44] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:06] (03PS4) 10Volans: eqsin: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554080 (https://phabricator.wikimedia.org/T239597) [10:02:03] (03PS2) 10Elukey: role::analytics_cluster::presto::server: add kerberos config [puppet] - 10https://gerrit.wikimedia.org/r/561590 [10:04:00] (03PS1) 10Elukey: Add fake kerberos keytabs for presto nodes [labs/private] - 10https://gerrit.wikimedia.org/r/561591 [10:04:16] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake kerberos keytabs for presto nodes [labs/private] - 10https://gerrit.wikimedia.org/r/561591 (owner: 10Elukey) [10:06:15] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [10:06:54] (03PS1) 10Muehlenhoff: Switch snapshot100[89] to standard recipe [puppet] - 10https://gerrit.wikimedia.org/r/561593 (https://phabricator.wikimedia.org/T156955) [10:07:05] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T241534 (10fgiunchedi) [10:07:07] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T241535 (10fgiunchedi) [10:08:00] ACKNOWLEDGEMENT - MD RAID on ms-be2035 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T241714 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:08:03] 10Operations, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T241714 (10ops-monitoring-bot) [10:08:35] (03PS3) 10Elukey: role::analytics_cluster::presto::server: add kerberos config [puppet] - 10https://gerrit.wikimedia.org/r/561590 [10:08:37] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T241534 (10fgiunchedi) @papaul host is in warranty and looks like an SSD failed, could we get that replaced (led is blinking), thanks! [10:09:52] 10Operations, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T241714 (10fgiunchedi) [10:09:55] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T241534 (10fgiunchedi) [10:09:57] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 31073480 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:10:49] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [10:11:21] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::presto::server: add kerberos config [puppet] - 10https://gerrit.wikimedia.org/r/561590 (owner: 10Elukey) [10:11:31] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 67624 and 70 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:17:05] (03PS2) 10Volans: netbox: skip virtual chassis without domain [software/homer] - 10https://gerrit.wikimedia.org/r/559973 [10:21:09] (03CR) 10Volans: [C: 03+2] netbox: skip virtual chassis without domain [software/homer] - 10https://gerrit.wikimedia.org/r/559973 (owner: 10Volans) [10:21:41] (03CR) 10Ema: [C: 03+1] "I did review the change by using `git show` locally on my machine given that (likely due to a bug) gerrit does not show the 75 added lines" [dns] - 10https://gerrit.wikimedia.org/r/561587 (https://phabricator.wikimedia.org/T241132) (owner: 10Vgutierrez) [10:24:32] (03Merged) 10jenkins-bot: netbox: skip virtual chassis without domain [software/homer] - 10https://gerrit.wikimedia.org/r/559973 (owner: 10Volans) [10:33:54] (03CR) 10Ammarpad: [C: 03+1] "Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561286 (https://phabricator.wikimedia.org/T241304) (owner: 10Majavah) [10:35:31] 10Operations, 10SRE-swift-storage, 10serviceops, 10Patch-For-Review, and 2 others: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10fgiunchedi) p:05High→03Normal >>! In T226373#5762068, @jcrespo wrote: > What is the right followup after... [10:37:13] (03CR) 10Filippo Giunchedi: [C: 03+1] Switch snapshot100[89] to standard recipe [puppet] - 10https://gerrit.wikimedia.org/r/561593 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:38:21] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10fgiunchedi) [10:40:54] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [10:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:01] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:31] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: thirdparty/openstack-pike-stretch: add python-novaclient [puppet] - 10https://gerrit.wikimedia.org/r/561596 (https://phabricator.wikimedia.org/T241347) [10:52:02] (03PS2) 10Muehlenhoff: Readd the late-install hack until WMCS switched to Puppet 5 / Facter 3 as well [puppet] - 10https://gerrit.wikimedia.org/r/559509 (https://phabricator.wikimedia.org/T239832) [10:52:26] 10Operations, 10cloud-services-team: Migrate Cloud VPS to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10MoritzMuehlenhoff) [10:52:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: thirdparty/openstack-pike-stretch: add python-novaclient [puppet] - 10https://gerrit.wikimedia.org/r/561596 (https://phabricator.wikimedia.org/T241347) (owner: 10Arturo Borrero Gonzalez) [10:58:46] 10Operations, 10Traffic, 10observability: cp1083: ats-tls and varnish-fe crashed due to insufficient memory - https://phabricator.wikimedia.org/T241593 (10fgiunchedi) I checked all "memory free" metrics as reported by node-exporter for the varnish case and indeed the numbers match, i.e. the kernel was report... [11:06:29] (03CR) 10Muehlenhoff: [C: 03+2] Readd the late-install hack until WMCS switched to Puppet 5 / Facter 3 as well [puppet] - 10https://gerrit.wikimedia.org/r/559509 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [11:10:42] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10fgiunchedi) 05Open→03Resolved Host is back in service! [11:18:52] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1020 - firmware upgrade: (was: host went down) - https://phabricator.wikimedia.org/T234698 (10fgiunchedi) [11:23:09] !log import more openstack packages into stretch-wikimedia thirdparty/openstack-pike-stretch (T241347) [11:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:13] T241347: upgrade cloud-vps openstack to Openstack version 'Pike' - https://phabricator.wikimedia.org/T241347 [11:26:14] (03PS1) 10Muehlenhoff: Switch mc* to standard recipe [puppet] - 10https://gerrit.wikimedia.org/r/561599 (https://phabricator.wikimedia.org/T156955) [11:27:08] (03CR) 10Vgutierrez: [C: 03+2] Provide MX and SPF records for wikimedia.community [dns] - 10https://gerrit.wikimedia.org/r/561587 (https://phabricator.wikimedia.org/T241132) (owner: 10Vgutierrez) [11:29:24] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez `$ host -t mx wikimedia.community wikimedia.community mail is handled by 10 mx1001.wi... [11:32:16] (03PS1) 10DCausse: [cirrus] force phrase_suggest fallback profile for all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561600 (https://phabricator.wikimedia.org/T241487) [11:35:33] (03PS1) 10Volans: dns: include all IP addresses with FQDN [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/561601 (https://phabricator.wikimedia.org/T233183) [11:35:35] (03PS1) 10Volans: dns: generate correct zone name in all cases [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/561602 (https://phabricator.wikimedia.org/T233183) [11:35:37] (03PS1) 10Volans: dns: sort records by the rightmost part [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/561603 (https://phabricator.wikimedia.org/T233183) [11:36:58] (03PS5) 10Volans: eqsin: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554080 (https://phabricator.wikimedia.org/T239597) [11:37:00] (03PS4) 10Volans: eqiad: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554081 (https://phabricator.wikimedia.org/T239597) [11:42:18] !log reimaging mw2277 to validate fix for puppet5/facter3 installation on new installs T239832 [11:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:38] 10Operations, 10Acme-chief, 10Traffic: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Vgutierrez) @Volans I think we could close this task already as everything seems healthy on acmechief1001 [11:43:09] 10Operations, 10Acme-chief, 10Traffic: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Volans) 05Open→03Resolved a:03Volans Indeed, done :) [11:43:15] T239832: Fix installation of Puppet 5/Facter 3 on new stretch installs/reimages - https://phabricator.wikimedia.org/T239832 [11:49:02] (03CR) 10Filippo Giunchedi: [C: 03+1] Switch mc* to standard recipe [puppet] - 10https://gerrit.wikimedia.org/r/561599 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [11:53:59] !log restarting FPM on scandium to clear opcache health [11:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:02] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:56:44] (03PS5) 10Filippo Giunchedi: swift: Fix icinga+prometheus+grafana alert link (Dashboard not found) [puppet] - 10https://gerrit.wikimedia.org/r/560538 (owner: 10Jcrespo) [11:57:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "Updated panelid on one alert, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/560538 (owner: 10Jcrespo) [12:00:12] (03CR) 10Jcrespo: [C: 03+1] swift: Fix icinga+prometheus+grafana alert link (Dashboard not found) [puppet] - 10https://gerrit.wikimedia.org/r/560538 (owner: 10Jcrespo) [12:00:40] (03CR) 10Jcrespo: [C: 03+1] "Please deploy at your convenience." [puppet] - 10https://gerrit.wikimedia.org/r/560538 (owner: 10Jcrespo) [12:04:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/561437 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [12:04:12] jynus: thanks! will deploy now [12:04:24] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: Fix icinga+prometheus+grafana alert link (Dashboard not found) [puppet] - 10https://gerrit.wikimedia.org/r/560538 (owner: 10Jcrespo) [12:07:06] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:53] (03PS6) 10Volans: eqsin: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554080 (https://phabricator.wikimedia.org/T239597) [12:19:55] (03PS5) 10Volans: eqiad: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554081 (https://phabricator.wikimedia.org/T239597) [12:19:57] (03PS1) 10Volans: esams: add missing asset tag records [dns] - 10https://gerrit.wikimedia.org/r/561605 (https://phabricator.wikimedia.org/T239597) [12:19:59] (03PS1) 10Volans: frack: add missing asset tag records [dns] - 10https://gerrit.wikimedia.org/r/561606 (https://phabricator.wikimedia.org/T239597) [12:20:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Hardware asset tag Netbox/DNS mgmt inconsistencies - https://phabricator.wikimedia.org/T239597 (10Volans) @Jclark-ctr by any chance do you have an ETA for this task? Just to know and to plan accordingly something related. [12:32:18] !log upgrade recently reimaged hosts to facter 3 T239832 [12:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:22] T239832: Fix installation of Puppet 5/Facter 3 on new stretch installs/reimages - https://phabricator.wikimedia.org/T239832 [12:41:27] !log upgrade recently reimaged hosts to puppet 5 T239832 [12:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:31] T239832: Fix installation of Puppet 5/Facter 3 on new stretch installs/reimages - https://phabricator.wikimedia.org/T239832 [13:00:30] !log enable BFD traceoptions on cr1-eqiad and cr3-knams - T240659 [13:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:42] T240659: BFD session alerts due to inconsistent status on cr3-knams - https://phabricator.wikimedia.org/T240659 [13:11:17] 10Operations: Fix installation of Puppet 5/Facter 3 on new stretch installs/reimages - https://phabricator.wikimedia.org/T239832 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff New Stretch/Jessie installations are now fixed by reintroducing the late-install.sh hack. I also upgraded all recent... [13:14:49] !log scramble password for Windy906 [13:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:37] !log upgrading jessie servers to intel-microcode 3.20191115.2 [13:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:44] !log Deploy schema change on s5 codfw master (db2123) with replication - T234052 [13:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:56] T234052: Add abuse_filter_log.afl_filter_id and afl_global columns - https://phabricator.wikimedia.org/T234052 [13:29:19] 10Operations, 10netops: Routinator RSYNC errors - https://phabricator.wikimedia.org/T240817 (10ayounsi) 05Resolved→03Open Opened https://github.com/NLnetLabs/routinator/issues/267 upstream. As `rsync://localhost/repo/` has been alerting for 10 days now. And there is not much we can do. [13:33:00] 10Operations, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee - Kris Litson - https://phabricator.wikimedia.org/T241722 (10Aklapper) Adding #LDAP-Access-Requests to the project tags, so someone could find this ticket. [13:36:30] 10Operations, 10Traffic: Track TLS related ATS metrics in prometheus - https://phabricator.wikimedia.org/T231286 (10Vgutierrez) 05Open→03Resolved We currently have 5 SSL/TLS related panels in the [[ https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown | ATS instance drilldown ]] [13:36:32] 10Operations, 10Traffic: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594 (10Vgutierrez) [13:44:20] 10Operations, 10Elasticsearch, 10Traffic, 10Discovery-Search (Current work), and 2 others: Sustained periods (2-4h) of bad latency on production-search eqiad - https://phabricator.wikimedia.org/T241421 (10dcausse) [13:47:58] !log installing cyrus-sasl security updates on Stretch/Buster [13:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:32] 10Operations, 10Goal: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) - https://phabricator.wikimedia.org/T65899 (10Aklapper) [13:50:10] 10Operations, 10Goal: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) - https://phabricator.wikimedia.org/T65899 (10Aklapper) [14:02:10] ACKNOWLEDGEMENT - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. Ayounsi https://phabricator.wikimedia.org/T240817 https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [14:04:18] (03PS11) 10Jbond: stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 [14:04:33] (03PS3) 10Jbond: apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [14:05:39] !log restarting PHP/Apache on mw canaries to pick up SASL security update [14:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:29] (03CR) 10jerkins-bot: [V: 04-1] stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 (owner: 10Jbond) [14:06:40] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 (owner: 10Jbond) [14:06:51] (03PS2) 10Andrew Bogott: move codfw1-dev cluster to openstack 'pike' [puppet] - 10https://gerrit.wikimedia.org/r/561277 (https://phabricator.wikimedia.org/T241347) [14:08:37] (03CR) 10Andrew Bogott: [C: 03+2] move codfw1-dev cluster to openstack 'pike' [puppet] - 10https://gerrit.wikimedia.org/r/561277 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [14:13:56] (03CR) 10Ema: [C: 03+1] eqsin: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554080 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [14:22:11] !log Deploy schema change on s5 eqiad hosts - T234052 [14:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:14] T234052: Add abuse_filter_log.afl_filter_id and afl_global columns - https://phabricator.wikimedia.org/T234052 [14:22:59] (03PS2) 10Muehlenhoff: Add a define to install a package from a repository component (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/560458 (https://phabricator.wikimedia.org/T240324) [14:25:08] (03CR) 10jerkins-bot: [V: 04-1] Add a define to install a package from a repository component (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/560458 (https://phabricator.wikimedia.org/T240324) (owner: 10Muehlenhoff) [14:27:10] (03PS12) 10Jbond: stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 [14:27:34] !log Deploy schema change on s6 codfw master (db2129) with replication - T234052 [14:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:37] T234052: Add abuse_filter_log.afl_filter_id and afl_global columns - https://phabricator.wikimedia.org/T234052 [14:27:58] (03PS1) 10Bartosz Dziewoński: Remove 2017 wikitext editor as default on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561649 [14:29:13] (03CR) 10jerkins-bot: [V: 04-1] stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 (owner: 10Jbond) [14:31:35] (03CR) 10Volans: [C: 03+2] eqsin: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554080 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [14:31:47] (03PS13) 10Jbond: stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 [14:32:12] (03PS4) 10Jbond: apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [14:33:01] (03PS3) 10Muehlenhoff: Add a define to install a package from a repository component (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/560458 (https://phabricator.wikimedia.org/T240324) [14:33:53] (03CR) 10jerkins-bot: [V: 04-1] stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 (owner: 10Jbond) [14:34:26] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 (owner: 10Jbond) [14:35:59] (03PS14) 10Jbond: stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 [14:36:16] (03PS5) 10Jbond: apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [14:38:21] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 (owner: 10Jbond) [14:39:41] (03PS4) 10Muehlenhoff: Add a define to install a package from a repository component [puppet] - 10https://gerrit.wikimedia.org/r/560458 (https://phabricator.wikimedia.org/T240324) [14:43:02] (03PS1) 10Arturo Borrero Gonzalez: toolforge: new k8s: drop misleading comment [puppet] - 10https://gerrit.wikimedia.org/r/561653 [14:43:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: new k8s: drop misleading comment [puppet] - 10https://gerrit.wikimedia.org/r/561653 (owner: 10Arturo Borrero Gonzalez) [14:44:09] (03PS1) 10Arturo Borrero Gonzalez: toolforge: new k8s: deploy cadvisor.yaml [puppet] - 10https://gerrit.wikimedia.org/r/561654 (https://phabricator.wikimedia.org/T237643) [14:46:13] (03PS1) 10Ema: varnish: update 01-basic-caching.vtc [puppet] - 10https://gerrit.wikimedia.org/r/561655 (https://phabricator.wikimedia.org/T241653) [14:46:21] (03PS6) 10Jbond: apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [14:46:43] 10Operations, 10Traffic, 10Patch-For-Review: two failing upload VTC tests - https://phabricator.wikimedia.org/T241653 (10ema) p:05Triage→03Normal [14:49:34] (03PS2) 10Arturo Borrero Gonzalez: toolforge: new k8s: deploy cadvisor.yaml [puppet] - 10https://gerrit.wikimedia.org/r/561654 (https://phabricator.wikimedia.org/T237643) [14:49:44] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 (owner: 10Jbond) [14:55:23] (03PS3) 10Arturo Borrero Gonzalez: toolforge: new k8s: deploy cadvisor.yaml [puppet] - 10https://gerrit.wikimedia.org/r/561654 (https://phabricator.wikimedia.org/T237643) [14:56:00] 10Operations, 10Security-Team, 10Patch-For-Review: (2019-09) Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10herron) Removing the #SRE-Access-Requests project tag for now. Please update and re-add if/when any further action is needed. Thanks! [14:56:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/559553 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [14:58:42] (03CR) 10Vgutierrez: [C: 03+1] esams: add missing asset tag records [dns] - 10https://gerrit.wikimedia.org/r/561605 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [14:59:03] (03PS1) 10Zoranzoki21: Enable GeoData extension in ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561657 (https://phabricator.wikimedia.org/T239000) [14:59:40] (03PS15) 10Jbond: stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 [14:59:56] (03PS7) 10Jbond: apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [15:00:00] (03PS2) 10Zoranzoki21: Enable GeoData extension in ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561657 (https://phabricator.wikimedia.org/T239000) [15:00:37] (03PS1) 10Zoranzoki21: Rearrange of wmgEnableGeoData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561658 [15:00:54] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 (owner: 10Jbond) [15:01:11] (03PS4) 10Arturo Borrero Gonzalez: toolforge: new k8s: deploy cadvisor.yaml [puppet] - 10https://gerrit.wikimedia.org/r/561654 (https://phabricator.wikimedia.org/T237643) [15:02:09] (03PS1) 10Andrew Bogott: neutron: update l3_agent_hacks for Pike [puppet] - 10https://gerrit.wikimedia.org/r/561660 (https://phabricator.wikimedia.org/T241347) [15:02:34] (03PS2) 10Zoranzoki21: Rearrange of wmgEnableGeoData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561658 [15:02:57] (03CR) 10jerkins-bot: [V: 04-1] neutron: update l3_agent_hacks for Pike [puppet] - 10https://gerrit.wikimedia.org/r/561660 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [15:03:11] (03PS5) 10Arturo Borrero Gonzalez: toolforge: new k8s: deploy cadvisor.yaml [puppet] - 10https://gerrit.wikimedia.org/r/561654 (https://phabricator.wikimedia.org/T237643) [15:03:14] (03CR) 10jerkins-bot: [V: 04-1] Rearrange of wmgEnableGeoData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561658 (owner: 10Zoranzoki21) [15:04:50] (03PS8) 10Jbond: apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [15:05:40] (03PS2) 10Andrew Bogott: neutron: update l3_agent_hacks for Pike [puppet] - 10https://gerrit.wikimedia.org/r/561660 (https://phabricator.wikimedia.org/T241347) [15:05:41] (03PS9) 10Jbond: apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [15:05:54] (03PS10) 10Jbond: apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [15:06:51] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 (owner: 10Jbond) [15:08:29] (03PS2) 10Ottomata: Switch eventgate-analytics LVS to use TLS port 4192 [puppet] - 10https://gerrit.wikimedia.org/r/559167 (https://phabricator.wikimedia.org/T241073) [15:08:34] (03PS11) 10Jbond: apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [15:08:47] (03PS12) 10Jbond: apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [15:10:26] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 85771600 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:11:01] (03CR) 10Ema: [C: 03+2] varnish: update 01-basic-caching.vtc [puppet] - 10https://gerrit.wikimedia.org/r/561655 (https://phabricator.wikimedia.org/T241653) (owner: 10Ema) [15:12:12] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 20712 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:12:14] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 (owner: 10Jbond) [15:15:47] (03PS13) 10Jbond: apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [15:18:11] (03PS1) 10Muehlenhoff: Add Cumin alias for netflow* hosts [puppet] - 10https://gerrit.wikimedia.org/r/561661 [15:22:29] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for netflow* hosts [puppet] - 10https://gerrit.wikimedia.org/r/561661 (owner: 10Muehlenhoff) [15:23:16] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Miriam) [15:24:33] (03PS2) 10Ottomata: Enable envoyproxy tls for schema.svc [puppet] - 10https://gerrit.wikimedia.org/r/558660 (https://phabricator.wikimedia.org/T233630) [15:25:53] (03PS1) 10Elukey: graphite::wmcs::archiver: add python-novaclient to list of deps [puppet] - 10https://gerrit.wikimedia.org/r/561665 [15:30:08] (03CR) 10Elukey: [C: 03+2] graphite::wmcs::archiver: add python-novaclient to list of deps [puppet] - 10https://gerrit.wikimedia.org/r/561665 (owner: 10Elukey) [15:30:15] (03CR) 10Jbond: "some minor comments/nits and a small optimisation recommendation" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/560458 (https://phabricator.wikimedia.org/T240324) (owner: 10Muehlenhoff) [15:32:55] (03PS16) 10Jbond: stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 [15:33:52] (03PS14) 10Jbond: apereo_cas: update to use stunnel client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [15:36:11] (03PS1) 10Elukey: hue: add row limit threshold for hive queries [puppet] - 10https://gerrit.wikimedia.org/r/561670 (https://phabricator.wikimedia.org/T241649) [15:36:58] 10Operations, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee - Kris Litson - https://phabricator.wikimedia.org/T241722 (10herron) Hello! Looping in @RStallman-legalteam to coordinate getting your NDA on file. [15:46:21] !log restarting FPM on parsoid canary to pick up SASL security update [15:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:28] 10Operations, 10Wikimedia-Mailing-lists: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman - https://phabricator.wikimedia.org/T240929 (10herron) a:05herron→03jcrespo Sounds good @jcrespo, please pass back to me when you've received the export and uploaded it to the mailman... [15:52:10] !log restarting Apache on puppetboard* hosts to pick up SASL security update [15:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:51] (03CR) 10Jhedden: [C: 03+1] wiki replicas: Remove outdated comment about spamblacklist [puppet] - 10https://gerrit.wikimedia.org/r/561352 (https://phabricator.wikimedia.org/T241668) (owner: 10BryanDavis) [16:09:13] (03CR) 10Muehlenhoff: Add a define to install a package from a repository component (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/560458 (https://phabricator.wikimedia.org/T240324) (owner: 10Muehlenhoff) [16:09:14] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 62424432 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:10:34] (03CR) 10Ema: [C: 03+2] vcl: stricter rate limiting of cloud IPs [puppet] - 10https://gerrit.wikimedia.org/r/561156 (owner: 10Ema) [16:11:00] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3904 and 67 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:11:00] 10Operations, 10Wikimedia-Mailing-lists: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman - https://phabricator.wikimedia.org/T240929 (10jcrespo) I still got no answer yet :-/. [16:11:27] !log restarting Apache on webperf* hosts to pick up SASL security update [16:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:13] (03PS4) 10Jbond: ci - flake8: update flake8 rules to be compatible with black [puppet] - 10https://gerrit.wikimedia.org/r/554827 (https://phabricator.wikimedia.org/T221083) [16:13:15] (03CR) 10jerkins-bot: [V: 04-1] ci - flake8: update flake8 rules to be compatible with black [puppet] - 10https://gerrit.wikimedia.org/r/554827 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [16:14:13] (03PS6) 10Jbond: CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) [16:15:18] (03PS7) 10Jbond: CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) [16:15:20] (03CR) 10jerkins-bot: [V: 04-1] CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [16:15:22] (03CR) 10jerkins-bot: [V: 04-1] CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [16:15:24] (03PS13) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [16:15:26] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [16:15:46] !log restarting Apache on graphite* hosts to pick up SASL security update [16:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:11] (03PS3) 10Ottomata: Switch eventgate-analytics LVS to use TLS port 4192 [puppet] - 10https://gerrit.wikimedia.org/r/559167 (https://phabricator.wikimedia.org/T241073) [16:17:32] (03PS3) 10Ottomata: Switch eventgate-main LVS to use TLS port 4292 [puppet] - 10https://gerrit.wikimedia.org/r/559168 (https://phabricator.wikimedia.org/T241073) [16:22:19] (03PS5) 10Jbond: ci - flake8: update flake8 rules to be compatible with black [puppet] - 10https://gerrit.wikimedia.org/r/554827 (https://phabricator.wikimedia.org/T221083) [16:22:21] (03PS7) 10Jbond: CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) [16:22:23] (03PS8) 10Jbond: CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) [16:22:25] (03PS14) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [16:23:00] (03CR) 10jerkins-bot: [V: 04-1] ci - flake8: update flake8 rules to be compatible with black [puppet] - 10https://gerrit.wikimedia.org/r/554827 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [16:23:19] (03CR) 10jerkins-bot: [V: 04-1] CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [16:23:44] (03CR) 10jerkins-bot: [V: 04-1] CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [16:24:23] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [16:35:50] PROBLEM - mediawiki-installation DSH group on mw2277 is CRITICAL: Host mw2277 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:36:53] ^ fixing [16:39:57] (03PS1) 10Elukey: profile::analytics::client::limits: add cpu limits to Analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/561675 (https://phabricator.wikimedia.org/T240440) [16:40:10] (03CR) 10RobH: [C: 03+1] "Huzzah for consistency!" [dns] - 10https://gerrit.wikimedia.org/r/554081 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [16:41:52] (03CR) 10jerkins-bot: [V: 04-1] profile::analytics::client::limits: add cpu limits to Analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/561675 (https://phabricator.wikimedia.org/T240440) (owner: 10Elukey) [16:43:30] (03PS15) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [16:47:07] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [16:50:29] (03PS2) 10Elukey: profile::analytics::client::limits: add cpu limits to Analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/561675 (https://phabricator.wikimedia.org/T240440) [16:56:11] (03PS3) 10Elukey: profile::analytics::client::limits: add cpu limits to Analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/561675 (https://phabricator.wikimedia.org/T240440) [16:58:24] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/20153/" [puppet] - 10https://gerrit.wikimedia.org/r/561675 (https://phabricator.wikimedia.org/T240440) (owner: 10Elukey) [17:00:29] 10Operations, 10ops-codfw: codfw: rack/setup/install elastic20{55,56,57,58,59,60}.wikimedia.org - https://phabricator.wikimedia.org/T241337 (10Papaul) @Gehel you mentioned in the procurement task :"but we won't be able to use the 10G until the whole cluster is upgraded," How close are you on doing this? what... [17:03:12] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:08:07] (03CR) 10Bstorm: [C: 04-1] toolforge: new k8s: deploy cadvisor.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/561654 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [17:15:55] 10Operations, 10ops-codfw, 10DBA: codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Papaul) [17:17:12] (03PS1) 10Volans: Remove stale management records [dns] - 10https://gerrit.wikimedia.org/r/561679 (https://phabricator.wikimedia.org/T239597) [17:24:40] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:28:45] (03CR) 10Volans: [C: 03+2] eqiad: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554081 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [17:30:22] (03CR) 10Volans: [C: 03+2] esams: add missing asset tag records [dns] - 10https://gerrit.wikimedia.org/r/561605 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [17:36:36] RECOVERY - mediawiki-installation DSH group on mw2277 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:36:53] (03CR) 10Arturo Borrero Gonzalez: "I think I understand the changes, but just for the record, could you state somewhere in the commit message what is changing and for what r" [puppet] - 10https://gerrit.wikimedia.org/r/561660 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [17:39:00] (03CR) 10RobH: [C: 03+1] "Please note I spot checked this (snagging about one in every 12) to lookup and confirm. All correct!" [dns] - 10https://gerrit.wikimedia.org/r/561679 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [17:46:43] (03CR) 10Elukey: [C: 03+1] Add a note to manage_principals for added/removed Kerberos principals (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559731 (owner: 10Muehlenhoff) [18:04:49] 10Operations, 10Analytics, 10Analytics-Kanban: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 (10Milimetric) 05Open→03Resolved Good job! [18:05:09] (03CR) 10Ottomata: [C: 03+1] hue: add row limit threshold for hive queries [puppet] - 10https://gerrit.wikimedia.org/r/561670 (https://phabricator.wikimedia.org/T241649) (owner: 10Elukey) [18:08:06] (03CR) 10Dzahn: [C: 03+2] xhgui: use ensure=>present instead of ensure=>latest [puppet] - 10https://gerrit.wikimedia.org/r/560364 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [18:08:52] (03CR) 10Ammarpad: [C: 03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561657 (https://phabricator.wikimedia.org/T239000) (owner: 10Zoranzoki21) [18:12:52] (03PS1) 10Dzahn: netbox: use ensure=>present instead of ensure=>latest for git cloning extras [puppet] - 10https://gerrit.wikimedia.org/r/561681 (https://phabricator.wikimedia.org/T218900) [18:13:15] hi mutante [18:18:58] (03PS1) 10Dzahn: contint: use ensure=>present when cloning slave scripts [puppet] - 10https://gerrit.wikimedia.org/r/561683 (https://phabricator.wikimedia.org/T218900) [18:19:23] (03CR) 10Elukey: [C: 03+2] hue: add row limit threshold for hive queries [puppet] - 10https://gerrit.wikimedia.org/r/561670 (https://phabricator.wikimedia.org/T241649) (owner: 10Elukey) [18:20:04] 10Operations, 10ops-codfw: codfw: rack/setup/install mc203[7,8,9].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10Papaul) [18:20:09] 10Operations, 10ops-codfw, 10Wikimedia-Logstash: rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10Papaul) [18:20:37] (03PS1) 10Dzahn: contint: use ensure=>present when cloning composer [puppet] - 10https://gerrit.wikimedia.org/r/561684 (https://phabricator.wikimedia.org/T218900) [18:23:09] (03CR) 10Ammarpad: Rearrange of wmgEnableGeoData (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561658 (owner: 10Zoranzoki21) [18:25:47] (03PS1) 10Dzahn: apereo_cas: use ensure=>present when cloning overlay template [puppet] - 10https://gerrit.wikimedia.org/r/561685 (https://phabricator.wikimedia.org/T218900) [18:27:17] (03CR) 10Volans: "I'm actually working on a centralized proposal to make this workflow safe and sound. No blocker for me to change this in the meanwhile, bu" [puppet] - 10https://gerrit.wikimedia.org/r/561681 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [18:33:29] (03Abandoned) 10Dzahn: netbox: use ensure=>present instead of ensure=>latest for git cloning extras [puppet] - 10https://gerrit.wikimedia.org/r/561681 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [18:34:18] (03Abandoned) 10Dzahn: apereo_cas: use ensure=>present when cloning overlay template [puppet] - 10https://gerrit.wikimedia.org/r/561685 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [18:37:20] 10Operations, 10ops-codfw: codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10Papaul) [18:43:37] 10Operations, 10Traffic, 10HTTPS, 10Voice & Tone: sec-warning page is Wikipedia-specific and dubiously worded - https://phabricator.wikimedia.org/T241656 (10Dzahn) This ticket seems to be a duplicate of T241309. [18:44:39] 10Operations, 10Traffic: Add more detailed instructions to the "sec-advice" page - https://phabricator.wikimedia.org/T241309 (10Dzahn) See also T241656 which might be a duplicate. [18:57:31] 10Operations, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee - Kris Litson - https://phabricator.wikimedia.org/T241722 (10RStallman-legalteam) @Kris_Litson_WMDE I have sent you the NDA for signature via Docusign. Let me know if you have any questions and thanks! [19:12:50] (03CR) 10Bstorm: [C: 03+1] "This looks good. Since six is already required by requests, nothing should be needed to change on the Debian package side." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/561262 (owner: 10Legoktm) [19:18:36] jouncebot: now [19:18:36] No deployments scheduled for the forseeable future! [19:18:40] jouncebot: next [19:18:40] No deployments scheduled for the forseeable future! [19:18:47] how informative :P [19:19:09] are we just running this version forever? that'll make a lot of things easier [19:27:50] (03CR) 10Bstorm: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555740 (https://phabricator.wikimedia.org/T239415) (owner: 10BryanDavis) [19:28:28] (03PS5) 10Bstorm: cloud: update maintain-views to handle dblists with comments [puppet] - 10https://gerrit.wikimedia.org/r/555740 (https://phabricator.wikimedia.org/T239415) (owner: 10BryanDavis) [19:36:45] (03CR) 10Bstorm: [C: 03+2] cloud: update maintain-views to handle dblists with comments [puppet] - 10https://gerrit.wikimedia.org/r/555740 (https://phabricator.wikimedia.org/T239415) (owner: 10BryanDavis) [19:46:10] (03PS1) 10MarcoAurelio: Temporarily add back old 'abusefilter-private(?:-log) permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561686 [19:48:06] (03PS2) 10MarcoAurelio: Temporarily add back old 'abusefilter-private(?:-log) permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561686 (https://phabricator.wikimedia.org/T241503) [19:49:23] (03CR) 10SBassett: [C: 03+2] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561686 (https://phabricator.wikimedia.org/T241503) (owner: 10MarcoAurelio) [19:50:14] Hey all - need to deploy config change https://gerrit.wikimedia.org/r/561686 now for some minor, security-related clean-up (see above, private Phab bug T241503) [19:50:21] (03Merged) 10jenkins-bot: Temporarily add back old 'abusefilter-private(?:-log) permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561686 (https://phabricator.wikimedia.org/T241503) (owner: 10MarcoAurelio) [19:57:02] !log sbassett@deploy1001 Synchronized wmf-config/CommonSettings.php: Deploying temporary patch for T241503 (permissions clean-up) (duration: 00m 54s) [19:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:39] 10Operations, 10Traffic: Add more detailed instructions to the "sec-advice" page - https://phabricator.wikimedia.org/T241309 (10jcrespo) [19:58:41] 10Operations, 10Traffic, 10HTTPS, 10Voice & Tone: sec-warning page is Wikipedia-specific and dubiously worded - https://phabricator.wikimedia.org/T241656 (10jcrespo) [19:59:29] 10Operations, 10Traffic: Add more detailed instructions to the "sec-advice" page - https://phabricator.wikimedia.org/T241309 (10jcrespo) Feel free to edit the body with a complete list of changes, but I beleive a single task would be enough to track all improvements requested. [20:08:53] (03PS1) 10MarcoAurelio: Revert "Temporarily add back old 'abusefilter-private(?:-log) permissions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561691 (https://phabricator.wikimedia.org/T241503) [20:12:44] (03CR) 10DannyS712: [C: 03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561691 (https://phabricator.wikimedia.org/T241503) (owner: 10MarcoAurelio) [20:14:20] sbassett Reedy revert for T241503 should be ready [20:15:31] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/561412 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [20:18:23] (03PS7) 10Ottomata: New eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) [20:25:38] (03CR) 10SBassett: [C: 03+2] Revert "Temporarily add back old 'abusefilter-private(?:-log) permissions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561691 (https://phabricator.wikimedia.org/T241503) (owner: 10MarcoAurelio) [20:26:13] ...and now deploying the revert for T241503. [20:26:21] <3 [20:26:30] (03Merged) 10jenkins-bot: Revert "Temporarily add back old 'abusefilter-private(?:-log) permissions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561691 (https://phabricator.wikimedia.org/T241503) (owner: 10MarcoAurelio) [20:28:16] (03PS1) 10Ottomata: Clone new primary and secondary event schemas repos for eventschemas site [puppet] - 10https://gerrit.wikimedia.org/r/561693 (https://phabricator.wikimedia.org/T206789) [20:28:55] (03PS2) 10Ottomata: Clone new primary and secondary event schemas repos for eventschemas site [puppet] - 10https://gerrit.wikimedia.org/r/561693 (https://phabricator.wikimedia.org/T206789) [20:30:32] !log sbassett@deploy1001 Synchronized wmf-config/CommonSettings.php: Deploying revert of temporary patch for T241503 (permissions clean-up) (duration: 00m 53s) [20:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:31] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/20154/schema1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/561693 (https://phabricator.wikimedia.org/T206789) (owner: 10Ottomata) [20:53:31] 10Operations, 10Cloud-Services, 10SDC General, 10Wikidata: Explore hosting the multimedia commons use case - https://phabricator.wikimedia.org/T152632 (10chasemp) 05Open→03Declined It seems this went nowhere, so I'm declining for now until movement. [20:53:53] 10Operations, 10Cloud-Services: maintain-meta_p hangs on connecting to wikimedia.org.uk - https://phabricator.wikimedia.org/T164490 (10chasemp) 05Open→03Resolved a:03chasemp [20:57:03] (03PS2) 10Jhedden: lvs ceph: add cloudceph service and cluster [puppet] - 10https://gerrit.wikimedia.org/r/559110 (https://phabricator.wikimedia.org/T240715) [20:58:35] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10chasemp) [20:59:15] 10Operations, 10Mail: Get mail relay out of Yahoo! blacklist: apply to Yahoo for whitelisting bulk mail - https://phabricator.wikimedia.org/T58414 (10chasemp) 05Open→03Resolved a:03chasemp seems resolved from comments [21:00:54] 10Operations: Admin module should allow group management of system users - https://phabricator.wikimedia.org/T84279 (10chasemp) @Ottomata I think you found a way to do this or route around it? [21:01:05] (03CR) 10Jhedden: "Do I need to add discovery records in the operations/dns wmnet template for services outside of kubernetes?" [puppet] - 10https://gerrit.wikimedia.org/r/559110 (https://phabricator.wikimedia.org/T240715) (owner: 10Jhedden) [21:04:54] 10Operations: Admin module should allow group management of system users - https://phabricator.wikimedia.org/T84279 (10Ottomata) 05Open→03Resolved a:03Ottomata We did! https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml#L288 [21:09:33] 10Operations, 10Data-Services, 10cloud-services-team (Kanban): evaluate possibility for nscd use with useldap - https://phabricator.wikimedia.org/T124991 (10chasemp) 05Open→03Declined [21:09:37] 10Operations, 10Data-Services, 10Tracking-Neverending, 10cloud-services-team (Kanban): overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083 (10chasemp) [21:14:23] 10Operations, 10Security-Team, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10chasemp) [21:16:25] 10Operations, 10User-jbond: Banning IPs / subnets from accessing login/validation endpoint - https://phabricator.wikimedia.org/T233945 (10chasemp) I wonder if this work supercedes {T224887} as the sole purpose of those rules is to ban IPs / subnets from access [21:16:41] 10Operations, 10Security-Team, 10User-jbond: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938 (10chasemp) [21:16:50] 10Operations, 10Security-Team, 10User-jbond: Validate Single Logout Flow - https://phabricator.wikimedia.org/T233941 (10chasemp) [21:16:59] 10Operations, 10Security-Team, 10User-jbond: Log / alert on too many failing logins / Throttling login attempts - https://phabricator.wikimedia.org/T233944 (10chasemp) [21:17:12] 10Operations, 10Security-Team, 10User-jbond: Maintain session history / audit log - https://phabricator.wikimedia.org/T233942 (10chasemp) [21:17:55] 10Operations, 10Security-Team, 10User-jbond: Banning IPs / subnets from accessing login/validation endpoint - https://phabricator.wikimedia.org/T233945 (10chasemp) [21:21:40] 10Operations, 10Security-Team: Establish secteam production norms - https://phabricator.wikimedia.org/T224886 (10chasemp) [21:23:21] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Ottomata) PSA! I've noticed that usages of envoyproxy for service TLS termination uses unencrypted private key files, but the cergen certificate manifests for these are c... [21:28:08] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10Bstorm) >>! In T210993#5769021, @bd808 wrote: > > @Bstorm are there existing NFS client dashboards that you use i... [21:31:26] 10Operations, 10ops-eqiad, 10Core Platform Team: rack/setup/install restbase1029, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10RobH) p:05Triage→03Normal [21:32:23] (03PS3) 10Ottomata: Enable envoyproxy tls for schema.svc [puppet] - 10https://gerrit.wikimedia.org/r/558660 (https://phabricator.wikimedia.org/T233630) [21:33:33] 10Operations, 10ops-codfw, 10Core Platform Team: rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10RobH) p:05Triage→03Normal [21:35:25] (03CR) 10Ottomata: [C: 03+2] Enable envoyproxy tls for schema.svc [puppet] - 10https://gerrit.wikimedia.org/r/558660 (https://phabricator.wikimedia.org/T233630) (owner: 10Ottomata) [21:40:20] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Ottomata) Ah hm, I also just realized the public cert is manually committed to public puppet in files/ssl. Should we maybe just change sslcert::certificate to be smart(... [21:43:41] (03PS1) 10Ottomata: Add public cert for schema.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/561700 (https://phabricator.wikimedia.org/T233630) [21:44:11] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10RobH) p:05Triage→03Normal [21:48:04] (03PS2) 10Ottomata: Add public cert for schema.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/561700 (https://phabricator.wikimedia.org/T233630) [21:51:14] (03CR) 10Ottomata: [C: 03+2] Add public cert for schema.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/561700 (https://phabricator.wikimedia.org/T233630) (owner: 10Ottomata) [21:53:22] (03PS3) 10BryanDavis: toolforge: Monitor local crontabs with Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/561412 (https://phabricator.wikimedia.org/T210993) [22:09:27] (03PS1) 10BryanDavis: support tools: Update Vagrantfile and run-image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561701 [22:10:10] (03CR) 10BryanDavis: [C: 03+2] support tools: Update Vagrantfile and run-image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561701 (owner: 10BryanDavis) [22:11:19] (03Merged) 10jenkins-bot: support tools: Update Vagrantfile and run-image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561701 (owner: 10BryanDavis) [22:20:38] (03PS6) 10Ottomata: Set up cache routing for schema.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/549177 (https://phabricator.wikimedia.org/T233630) [22:23:48] (03CR) 10Ottomata: "The app will respond on 443 now, I think we can merge this one." [puppet] - 10https://gerrit.wikimedia.org/r/549177 (https://phabricator.wikimedia.org/T233630) (owner: 10Ottomata) [22:27:42] (03Abandoned) 10BryanDavis: sudo: Allow root to assume any group [puppet] - 10https://gerrit.wikimedia.org/r/501043 (owner: 10BryanDavis) [22:28:07] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T241534 (10Papaul) p:05Triage→03Normal [22:31:37] 10Operations, 10ops-codfw: codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10Papaul) [22:41:04] 10Operations, 10ops-codfw, 10Wikimedia-Logstash: rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10Papaul) [22:44:22] 10Operations, 10ops-eqiad, 10Dumps-Generation: rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10RobH) p:05Triage→03Normal [22:50:29] 10Operations, 10ops-eqiad: rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10RobH) p:05Triage→03Normal [23:02:07] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T241796 (10RobH) p:05Triage→03Normal [23:12:31] 10Operations, 10ops-eqiad, 10serviceops: rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10RobH) [23:16:16] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 36720936 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:17:02] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 24357376 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:18:36] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 77291640 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:21:40] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 101205544 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:21:40] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3536 and 71 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:22:12] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 25240 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:22:26] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 46128 and 28 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:23:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 85760 and 90 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:26:49] (03PS1) 10BryanDavis: Add busybox to buster and stretch images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561709 [23:32:32] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10wiki_willy) [23:33:40] (03CR) 10Bstorm: [C: 03+1] "My only wish is that it only appeared in the interactive shell containers. However, I think I'd like to put that on my wishlist for the w" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561709 (owner: 10BryanDavis) [23:33:52] 10Operations, 10ops-eqiad, 10Core Platform Team: (No Need By Date) rack/setup/install restbase1029, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10wiki_willy) [23:34:37] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (No Need By Date) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10wiki_willy) [23:36:27] 10Operations, 10ops-eqiad, 10Dumps-Generation: (No Need By Date) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10wiki_willy) [23:36:49] 10Operations, 10ops-eqiad, 10DBA: (No Need By Date) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10wiki_willy) [23:37:37] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (No Need By Date) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10wiki_willy) [23:38:02] (03CR) 10BryanDavis: [C: 03+2] Add busybox to buster and stretch images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561709 (owner: 10BryanDavis) [23:38:39] (03Merged) 10jenkins-bot: Add busybox to buster and stretch images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561709 (owner: 10BryanDavis) [23:39:37] 10Operations, 10ops-codfw, 10serviceops: (Need By: Jan 15) rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T241796 (10wiki_willy) [23:40:30] 10Operations, 10ops-codfw, 10Wikimedia-Logstash: (No Need By Date Provided) rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10wiki_willy) [23:41:57] 10Operations, 10ops-codfw, 10DBA: (No Need By Date Provided) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10wiki_willy) [23:43:09] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install mc203[7,8,9].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10wiki_willy) [23:43:55] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10wiki_willy)