[00:02:47] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:25:13] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Puppet has 11 failures. Last run 3 minutes ago with 11 failures. Failed resources (up to 3 shown): Service[cpjobqueue],Service[recommendation_api],Service[mobileapps],Service[mathoid] [00:56:07] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:34:27] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:49:55] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:58:11] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:29:03] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [03:31:37] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 917.99 seconds [03:59:13] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 261.15 seconds [04:10:22] (03PS7) 10Krinkle: errorpages: Use service discovery for statsd in hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) [06:13:49] !log Deploy schema change on db1067 (s1) master - T86339 [06:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:53] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [06:15:40] !log Deploy schema change on s3 codfw master (db2043) with replication - T86339 [06:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:11] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Kubernetes: set up a test node with new version, Redis as cache, a new Swift container and export metrics over Fraphana - https://phabricator.wikimedia.org/T210076 (10Joe) As far as redis goes - do we need to replicate cross-dc? if not, we c... [06:20:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475435 (https://phabricator.wikimedia.org/T86339) [06:26:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475435 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [06:27:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475435 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [06:29:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 - T86339 (duration: 00m 48s) [06:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:20] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [06:31:32] !log Deploy schema change dbstore1002:s3 - T86339 [06:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:03] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475435 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [06:32:40] !log Deploy schema change db1123 - T86339 [06:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475436 [06:50:40] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) [06:51:01] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) @Cmjohnson reminder: this is RAID5 instead of 10 as noted on top of the task. [06:57:29] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475436 (owner: 10Marostegui) [06:58:32] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475436 (owner: 10Marostegui) [06:58:46] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475436 (owner: 10Marostegui) [06:59:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1123 - T86339 (duration: 00m 49s) [06:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:50] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [07:00:31] !log Deploy schema change db1095 - T86339 [07:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475438 (https://phabricator.wikimedia.org/T86339) [07:12:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475438 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [07:13:19] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475438 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [07:14:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1078 - T86339 (duration: 00m 45s) [07:14:21] !log Deploy schema change db1078 - T86339 [07:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:23] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [07:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:06] 10Operations, 10DBA, 10monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427 (10Marostegui) Just to be on the safe side I have created Calendar events (personals and on Ops maintenance) as a reminder: 1 year before expiration 6 months before expiration 3... [07:25:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475438 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [07:29:13] (03CR) 10Urbanecm: [C: 031] "Only safe if the namespace is empty." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475367 (https://phabricator.wikimedia.org/T210171) (owner: 10Zoranzoki21) [07:38:47] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475439 [07:47:16] (03CR) 10Alexandros Kosiaris: "Again +1 from me" [puppet] - 10https://gerrit.wikimedia.org/r/396055 (https://phabricator.wikimedia.org/T182249) (owner: 10Awight) [07:48:42] !log installing libtirpc security updates [07:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:22] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce zoterov2 LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/473733 (https://phabricator.wikimedia.org/T201611) (owner: 10Alexandros Kosiaris) [07:59:29] (03PS4) 10Alexandros Kosiaris: Introduce zoterov2 LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/473733 (https://phabricator.wikimedia.org/T201611) [07:59:32] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Introduce zoterov2 LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/473733 (https://phabricator.wikimedia.org/T201611) (owner: 10Alexandros Kosiaris) [08:04:04] (03PS1) 10Gehel: wdqs: disable test queries on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/475440 (https://phabricator.wikimedia.org/T207665) [08:05:42] (03CR) 10Mathew.onipe: [C: 031] wdqs: disable test queries on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/475440 (https://phabricator.wikimedia.org/T207665) (owner: 10Gehel) [08:05:48] 10Operations, 10Traffic: Update certspotter - https://phabricator.wikimedia.org/T204993 (10MoritzMuehlenhoff) The Icinga servers in production are now running 0.9-1~bpo9+1, but the Cron job still needs to be re-instated. [08:05:53] (03CR) 10Gehel: [C: 032] wdqs: disable test queries on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/475440 (https://phabricator.wikimedia.org/T207665) (owner: 10Gehel) [08:06:58] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 42 connections established with conf1004.eqiad.wmnet:4001 (min=43) [08:08:44] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475439 (owner: 10Marostegui) [08:09:16] (03PS1) 10Muehlenhoff: Record deskana as LDAP user [puppet] - 10https://gerrit.wikimedia.org/r/475441 [08:09:36] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475439 (owner: 10Marostegui) [08:09:58] PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 31 connections established with conf2001.codfw.wmnet:2379 (min=32) [08:10:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 - T86339 (duration: 00m 46s) [08:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:47] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [08:11:01] (03PS2) 10Muehlenhoff: Record deskana as LDAP user [puppet] - 10https://gerrit.wikimedia.org/r/475441 [08:12:06] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 43 connections established with conf1004.eqiad.wmnet:4001 (min=43) [08:12:09] (03PS25) 10Mathew.onipe: elasticsearch_cluster: Added multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) [08:12:18] (03CR) 10Muehlenhoff: [C: 032] Record deskana as LDAP user [puppet] - 10https://gerrit.wikimedia.org/r/475441 (owner: 10Muehlenhoff) [08:12:32] PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 42 connections established with conf1004.eqiad.wmnet:4001 (min=43) [08:12:33] (03CR) 10Mathew.onipe: elasticsearch_cluster: Added multi-cluster/multi-instance support (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [08:18:33] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475439 (owner: 10Marostegui) [08:29:56] pybal complaining was me [08:32:44] RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 43 connections established with conf1004.eqiad.wmnet:4001 (min=43) [08:35:10] (03CR) 10DCausse: [C: 031] elasticsearch_cluster: Added multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [08:35:10] PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 31 connections established with conf2001.codfw.wmnet:2379 (min=32) [08:35:29] 10Operations, 10Wikimedia-Mailing-lists: Post hold because of "invalid headers" in wikimediacz-l - https://phabricator.wikimedia.org/T210223 (10Urbanecm) [08:37:24] PROBLEM - Confd template for /srv/config-master/pybal/codfw/zoterov2 on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/zoterov2 [08:37:57] fixing ^ [08:38:14] (03PS1) 10Alexandros Kosiaris: conftool: Add forgotten zotero codfw data [puppet] - 10https://gerrit.wikimedia.org/r/475442 [08:39:00] (03CR) 10Alexandros Kosiaris: [C: 032] conftool: Add forgotten zotero codfw data [puppet] - 10https://gerrit.wikimedia.org/r/475442 (owner: 10Alexandros Kosiaris) [08:40:20] RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 32 connections established with conf2001.codfw.wmnet:2379 (min=32) [08:40:50] RECOVERY - Confd template for /srv/config-master/pybal/codfw/zoterov2 on puppetmaster2001 is OK: No errors detected [08:41:01] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=.*,service=zotero,cluster=kubernetes,name=.* [08:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:51] (03PS1) 10Muehlenhoff: Remove Diamond from several analytics roles [puppet] - 10https://gerrit.wikimedia.org/r/475444 (https://phabricator.wikimedia.org/T183454) [08:44:18] akosiaris: so the alerts are because of zoterov2 being added, right? Did you restart pybal to fix them? [08:44:28] yup [08:44:31] k [08:44:33] all fixed [08:45:27] all but lvs2006? [08:45:34] RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 32 connections established with conf2001.codfw.wmnet:2379 (min=32) [08:45:37] :-) [08:45:37] hehe [08:45:53] just icinga being slow [08:49:27] (03CR) 10Elukey: [C: 031] Remove Diamond from several analytics roles [puppet] - 10https://gerrit.wikimedia.org/r/475444 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [08:59:54] (03PS2) 10Muehlenhoff: Remove Diamond from several analytics roles [puppet] - 10https://gerrit.wikimedia.org/r/475444 (https://phabricator.wikimedia.org/T183454) [09:00:39] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from several analytics roles [puppet] - 10https://gerrit.wikimedia.org/r/475444 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [09:05:01] PROBLEM - Check systemd state on druid1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:06:09] RECOVERY - Check systemd state on druid1003 is OK: OK - running: The system is fully operational [09:07:21] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10akosiaris) [09:07:25] 10Operations, 10Citoid, 10Patch-For-Review, 10Service-deployment-requests, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10akosiaris) 05Open>03Resolved Finally deployed to production Things to note: * zotero is configured to use the same outoing proxy (url-do... [09:15:48] (03PS2) 10Volans: netbox: allow read-only access to wmf ldap group [puppet] - 10https://gerrit.wikimedia.org/r/475295 (https://phabricator.wikimedia.org/T208267) [09:17:31] (03CR) 10Volans: [C: 032] netbox: allow read-only access to wmf ldap group [puppet] - 10https://gerrit.wikimedia.org/r/475295 (https://phabricator.wikimedia.org/T208267) (owner: 10Volans) [09:26:40] (03PS2) 10Filippo Giunchedi: WIP rsyslog: udp input json_lines shim [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851) [09:29:50] (03PS1) 10Volans: netbox: notify the uwsgi app on config change [puppet] - 10https://gerrit.wikimedia.org/r/475449 (https://phabricator.wikimedia.org/T208267) [09:37:54] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to netbox for bd808 - https://phabricator.wikimedia.org/T208267 (10Volans) Added read-only access to `cn=wmf` and confirmed it works as expected allowing people to login but in read-only mode. Edit/delete/add... [10:00:19] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/475449 (https://phabricator.wikimedia.org/T208267) (owner: 10Volans) [10:13:08] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi) A first iteration on this might look like https://gerrit.wikimedia.org/r/475352 in which a new udp listener is added on localho... [10:16:27] (03CR) 10Volans: [C: 032] netbox: notify the uwsgi app on config change [puppet] - 10https://gerrit.wikimedia.org/r/475449 (https://phabricator.wikimedia.org/T208267) (owner: 10Volans) [10:33:00] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to netbox for bd808 - https://phabricator.wikimedia.org/T208267 (10Volans) 05Open>03Resolved [10:50:02] 10Operations, 10DBA, 10StructuredDiscussions, 10Growth-Team (Current Sprint), and 2 others: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Banyek) As I plan to get involved this I read back the ticket, and put the following tldr together. Did I... [11:01:54] (03PS1) 10Alex Monk: Revert "certspotter: temporarily disable cron job" [puppet] - 10https://gerrit.wikimedia.org/r/475453 (https://phabricator.wikimedia.org/T204993) [11:10:09] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [11:12:42] (03PS1) 10Elukey: profile::hadoop::spark2: get security settings via hiera [puppet] - 10https://gerrit.wikimedia.org/r/475454 [11:19:27] (03PS2) 10Elukey: profile::hadoop::spark2: get security settings via hiera [puppet] - 10https://gerrit.wikimedia.org/r/475454 [11:19:29] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13689/" [puppet] - 10https://gerrit.wikimedia.org/r/475454 (owner: 10Elukey) [11:22:41] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) hey @Bstorm any suggestions to handle labstore1006 and labstore1007 reboots? Those are for Dumps, right? cc @ArielGlenn [11:28:21] PROBLEM - HHVM jobrunner on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [11:29:31] RECOVERY - HHVM jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [11:34:01] PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:34:05] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:34:09] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:34:51] PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:34:55] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:35:01] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:35:02] * gehel is looking at wdqs [11:35:57] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:35:59] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:36:07] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:36:15] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:36:19] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:36:48] updater failing on all wdqs nodes [11:37:13] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational [11:37:29] !log restarting updater on all wdqs ndoes [11:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:35] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational [11:38:15] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational [11:38:16] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational [11:38:19] RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational [11:38:25] RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational [11:38:27] RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational [11:38:33] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational [11:38:37] RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational [11:38:39] RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational [11:38:41] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [11:39:46] lot of "Weird reference" in the logs, updater seems to crash and restart, but not much in logs [11:41:37] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:42:59] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:43:44] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: wdqs-updater crashing on all wdqs servers - https://phabricator.wikimedia.org/T210235 (10Gehel) [11:44:05] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:44:07] PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:44:13] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:44:15] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:44:17] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:44:21] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:44:27] PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:44:31] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:44:34] gehel: can I help? [11:44:35] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:44:42] restarting updater does not seem to be sufficient, it is probably getting data from wikidata that it does not understand [11:44:59] anyone knows of something that would have changed on the wikidata side? [11:45:15] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:45:35] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:45:36] volans: not an emergency yet (updater is an async process, so we'll have some lag, not good but not critical) [11:46:10] volans: if you could silence that icinga systemd check for all wdqs nodes while I did in the logs, that would be great! [11:46:21] gehel: sure, on it [11:46:26] thanks! [11:46:37] RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational [11:47:50] looks like some bot uploading some problematic data about fish [11:47:52] gehel: how long the downtime? [11:47:55] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational [11:48:05] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10ArielGlenn) They are indeed for dumps; one is set up to be the web server, and a reboot of that means that the broader public will notice. The other provides NFS service of dumps to stat100... [11:48:05] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational [11:48:26] volans: 4 hours at the moment, I'll either remove or extend [11:48:30] ack [11:48:33] I thought the same [11:48:37] wat? now it is recovering? [11:49:09] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:49:21] na... it is going to die again [11:49:48] {done} on wdqs[12]* [11:49:54] only for the systemd check [11:50:01] yep, that's good [11:50:12] we might get an update lag alert soon [11:50:25] they are warning right now [11:50:53] volans: can you silence them as well? no need to add noise, we know that something is wrong already [11:51:21] also the updater process? [11:51:27] yep [11:52:13] ack [11:52:31] RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational [11:53:21] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational [11:53:41] eh, failed alert before the downtime will spam with the recovery anyway [11:54:18] ack [11:54:33] RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational [11:54:36] logs are useless, how can the updater crash without a message? [11:54:53] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [11:55:17] Oh, out of memory! [11:55:43] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational [11:56:36] gehel: any reason why high lag check has disabled notifications on wdqs2003? [11:57:01] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational [11:57:02] volans: who disabled it and is there a comment? [11:57:08] (probably me) [11:57:17] no good reason that I know of [11:57:31] who I don't know, there is an old message referring to T207817 but is on multiple host [11:57:31] T207817: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 [11:57:34] *hosts [11:57:42] this is the only one with the check disabled [11:57:43] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: wdqs-updater crashing on all wdqs servers - https://phabricator.wikimedia.org/T210235 (10Gehel) Out of memory error on a bind: ` Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space Nov 23... [11:57:49] I'll re-enable if it's ok for you [11:57:53] yep [12:00:20] and of course, heap size isn't configureable via puppet atm [12:00:31] if I cannot help with much more, I'd go for lunch [12:00:31] ofc [12:00:50] volans: yep, enjoy the lunch, I'm on it, we're all good! [12:00:55] ack, thanks! [12:00:57] and thanks for the help! [12:01:05] bon appétit [12:01:22] merci [12:04:32] !log manually increasing wdqs-updater heap to 4G - T210235 [12:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:36] T210235: wdqs-updater crashing on all wdqs servers - https://phabricator.wikimedia.org/T210235 [12:04:45] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational [12:04:45] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational [12:04:51] RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational [12:05:07] RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational [12:05:41] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational [12:07:33] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: wdqs-updater crashing on all wdqs servers - https://phabricator.wikimedia.org/T210235 (10Gehel) increasing heap stops the updater from crashing, but blazegraph refuses updates > 200M (we probably don't want to increase this limit) [12:11:12] (03PS1) 10Gehel: wdqs: reduce batch size to 300 temporarily [puppet] - 10https://gerrit.wikimedia.org/r/475455 (https://phabricator.wikimedia.org/T210235) [12:13:32] (03PS2) 10Gehel: wdqs: reduce batch size to 300 temporarily [puppet] - 10https://gerrit.wikimedia.org/r/475455 (https://phabricator.wikimedia.org/T210235) [12:14:15] (03CR) 10Gehel: [C: 032] wdqs: reduce batch size to 300 temporarily [puppet] - 10https://gerrit.wikimedia.org/r/475455 (https://phabricator.wikimedia.org/T210235) (owner: 10Gehel) [12:14:26] (03PS1) 10Muehlenhoff: Remove Diamond from Ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/475456 [12:15:01] (03CR) 10jerkins-bot: [V: 04-1] Remove Diamond from Ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/475456 (owner: 10Muehlenhoff) [12:16:13] (03PS1) 10Giuseppe Lavagetto: mediawiki::appserver: add php monitoring [puppet] - 10https://gerrit.wikimedia.org/r/475457 (https://phabricator.wikimedia.org/T209573) [12:17:58] (03PS2) 10Muehlenhoff: Remove Diamond from Ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/475456 (https://phabricator.wikimedia.org/T183454) [12:20:06] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: wdqs-updater crashing on all wdqs servers - https://phabricator.wikimedia.org/T210235 (10Gehel) Reducing batch size seems to work, updates are processed again. [12:21:32] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/13690/mw1261.eqiad.wmnet/ the patch is correct in puppet terms, I'd like someone to take " [puppet] - 10https://gerrit.wikimedia.org/r/475457 (https://phabricator.wikimedia.org/T209573) (owner: 10Giuseppe Lavagetto) [12:21:50] (03CR) 10Alexandros Kosiaris: [C: 031] "NOOP per https://puppet-compiler.wmflabs.org/compiler1002/13688/, let's merge on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/475258 (owner: 10Dzahn) [12:39:05] 10Operations, 10Analytics-Kanban, 10Discovery, 10Product-Analytics, and 3 others: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682 (10Liuxinyu970226) [12:39:08] (03CR) 10Jcrespo: [C: 04-1] "This is wrong, you are removing hosts from being spare." [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [12:40:48] (03CR) 10Banyek: "> This is wrong, you are removing hosts from being spare." [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [12:43:33] (03CR) 10Jcrespo: [C: 04-1] "You are removing dbproxy1012,13,14,17 from site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [12:44:48] (03CR) 10Banyek: "> You are removing dbproxy1012,13,14,17 from site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [12:51:48] 10Operations, 10ORES, 10Scoring-platform-team (Current): Investigate memory usage of ORES in kubernetes - https://phabricator.wikimedia.org/T210264 (10Ladsgroup) [12:54:13] (03PS3) 10Banyek: mariadb: productionize dbproxy1015 and dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) [13:01:09] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber): Blubber should be able to make multi docker files per repo - https://phabricator.wikimedia.org/T210267 (10Ladsgroup) [13:02:18] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10Ladsgroup) p:05Triage>03Low [13:02:43] 10Operations, 10ORES, 10Scoring-platform-team: Build helm charts for ORES - https://phabricator.wikimedia.org/T210269 (10Ladsgroup) [13:18:48] (03PS1) 10Muehlenhoff: Add mapped IPv6 to labmon1002 [puppet] - 10https://gerrit.wikimedia.org/r/475461 [13:18:50] (03PS1) 10Muehlenhoff: Add mapped IPv6 to labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/475462 [13:22:07] (03CR) 10Arturo Borrero Gonzalez: [C: 031] Add mapped IPv6 to labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/475462 (owner: 10Muehlenhoff) [13:32:36] !log installing confuse security updates [13:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:55] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe) [13:35:43] (03PS1) 10Muehlenhoff: Add library hint for confuse [puppet] - 10https://gerrit.wikimedia.org/r/475465 [13:36:47] (03PS2) 10Muehlenhoff: Add library hint for confuse [puppet] - 10https://gerrit.wikimedia.org/r/475465 [13:38:18] 10Operations, 10Traffic, 10Continuous-Integration-Infrastructure (Slipway): CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10hashar) Part of the project to migrate all CI jobs to Docker containers (#ci-slipway ) [13:38:23] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe) [13:41:04] I wonder what we use `confuse` for [13:42:27] (03CR) 10Muehlenhoff: [C: 032] Add library hint for confuse [puppet] - 10https://gerrit.wikimedia.org/r/475465 (owner: 10Muehlenhoff) [13:47:23] (03CR) 10Filippo Giunchedi: [C: 031] Add mapped IPv6 to labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/475462 (owner: 10Muehlenhoff) [13:48:54] (03CR) 10Filippo Giunchedi: [C: 031] "Thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/475461 (owner: 10Muehlenhoff) [13:53:31] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 406, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:59:31] 10Operations: Upgrade Ganeti clusters to 2.15.2-7+deb9u3 - https://phabricator.wikimedia.org/T210289 (10MoritzMuehlenhoff) [14:00:23] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: make wdqs-updater heap size configurable from puppet - https://phabricator.wikimedia.org/T210290 (10Gehel) [14:02:23] !log restor wdqs-updater heap to 2G - T210235 [14:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:27] T210235: wdqs-updater crashing on all wdqs servers - https://phabricator.wikimedia.org/T210235 [14:14:08] (03CR) 10Elukey: [C: 032] "Andrew: going to merge this to start testing some settings in labs, I can amend/revert/etc.. if you don't like it :)" [puppet] - 10https://gerrit.wikimedia.org/r/475454 (owner: 10Elukey) [14:14:14] (03PS3) 10Elukey: profile::hadoop::spark2: get security settings via hiera [puppet] - 10https://gerrit.wikimedia.org/r/475454 [14:27:19] (03PS1) 10Vgutierrez: certcentral: Provide TLS certificates for tendril [puppet] - 10https://gerrit.wikimedia.org/r/475468 (https://phabricator.wikimedia.org/T207050) [14:27:21] (03PS1) 10Vgutierrez: tendril: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475469 (https://phabricator.wikimedia.org/T207050) [14:28:21] (03CR) 10Alex Monk: [C: 031] certcentral: Provide TLS certificates for tendril [puppet] - 10https://gerrit.wikimedia.org/r/475468 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:28:33] (03CR) 10Alex Monk: [C: 031] tendril: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475469 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:29:38] vgutierrez, they've all been apache2 rather than nginx so far right? [14:29:46] indeed [14:32:11] (03CR) 10Vgutierrez: [C: 032] certcentral: Provide TLS certificates for tendril [puppet] - 10https://gerrit.wikimedia.org/r/475468 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:32:19] (03PS2) 10Vgutierrez: certcentral: Provide TLS certificates for tendril [puppet] - 10https://gerrit.wikimedia.org/r/475468 (https://phabricator.wikimedia.org/T207050) [14:33:56] vgutierrez, btw I tried to clean up the phab task structure and close some tickets we're pretty much done with [14:34:01] hopefully I didn't muck up anything too badly [14:34:18] I don't think so.. that's for taking care of that :) [14:34:33] s/that's/thanks/ [14:34:34] damn OS X... [14:35:16] at first I was cautious and then I was like eh... it should be uncontroversial and easily-revertible and pretty much bikeshedding, so [14:35:24] yay closed tickets [14:37:07] (03PS2) 10Jcrespo: mariadb: Create socket dir also on puppet run, & for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) [14:37:09] (03PS1) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) [14:38:21] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [14:48:53] (03CR) 10Jcrespo: [C: 031] tendril: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475469 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:49:36] (03CR) 10Vgutierrez: [C: 032] tendril: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475469 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:49:44] (03PS2) 10Vgutierrez: tendril: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475469 (https://phabricator.wikimedia.org/T207050) [14:58:51] (03PS2) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) [14:59:33] (03PS1) 10Vgutierrez: tendril: Use the cercentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475478 (https://phabricator.wikimedia.org/T207050) [14:59:50] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [15:04:38] (03PS1) 10Hashar: tox: add 'venv' to run any command in a venv [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475479 [15:05:29] (03CR) 10jerkins-bot: [V: 04-1] tox: add 'venv' to run any command in a venv [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475479 (owner: 10Hashar) [15:10:49] herron, about? I just noticed a wikimedia wiki password reset email got DKIM fail from google. I don't think that's expected? [15:11:04] (03PS1) 10Hashar: Fix invalid escape sequence in a regular expression [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475480 [15:11:28] (03PS2) 10Hashar: tox: add 'venv' to run any command in a venv [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475479 [15:12:14] idle 68h, nope [15:13:08] (03CR) 10Vgutierrez: [C: 032] tendril: Use the cercentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475478 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:13:17] (03PS2) 10Vgutierrez: tendril: Use the cercentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475478 (https://phabricator.wikimedia.org/T207050) [15:13:35] PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:13:42] guess I should file a ticket [15:14:44] 10Operations, 10Traffic: ATS path normalization - https://phabricator.wikimedia.org/T210295 (10ema) p:05Triage>03Normal [15:17:19] (03PS1) 10Vgutierrez: tendril: get rid of the old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475481 (https://phabricator.wikimedia.org/T207050) [15:19:34] 10Operations, 10Traffic: ATS path normalization - https://phabricator.wikimedia.org/T210295 (10ema) [15:21:29] (03Abandoned) 10Giuseppe Lavagetto: Revert "Unvendorize wherever possible" [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475075 (owner: 10Giuseppe Lavagetto) [15:21:44] (03PS3) 10Jcrespo: mariadb: Create socket dir also on puppet run, & for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) [15:21:46] (03PS3) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) [15:21:54] (03PS1) 10Vgutierrez: librenms: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475482 (https://phabricator.wikimedia.org/T207050) [15:21:57] (03PS1) 10Vgutierrez: netbox: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475483 (https://phabricator.wikimedia.org/T207050) [15:21:59] (03Abandoned) 10Giuseppe Lavagetto: site.pp: move slave redises to system::spare [puppet] - 10https://gerrit.wikimedia.org/r/443789 (https://phabricator.wikimedia.org/T198220) (owner: 10Giuseppe Lavagetto) [15:22:48] (03Abandoned) 10Giuseppe Lavagetto: mediawiki: move mediawiki::web to a profile [puppet] - 10https://gerrit.wikimedia.org/r/395715 (owner: 10Giuseppe Lavagetto) [15:22:55] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [15:23:57] (03Abandoned) 10Giuseppe Lavagetto: profile::mediawiki::web: explicitly set log retention days [puppet] - 10https://gerrit.wikimedia.org/r/395716 (owner: 10Giuseppe Lavagetto) [15:24:58] (03PS4) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) [15:26:18] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [15:30:41] (03CR) 10Alex Monk: [C: 031] librenms: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475482 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:30:43] (03CR) 10Alex Monk: [C: 031] netbox: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475483 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:33:48] 10Operations, 10Release-Engineering-Team (Backlog): Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10faidon) >>! In T203003#4768791, @hashar wrote: > I guess we can close rKEYHOLDER. Seems to me keyholder code will be moved out of operations/puppet to `operations/software/keyho... [15:34:55] (03CR) 10Alex Monk: [C: 031] tendril: get rid of the old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475481 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:35:52] (03CR) 10Vgutierrez: [C: 032] tendril: get rid of the old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475481 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:36:07] (03PS2) 10Vgutierrez: tendril: get rid of the old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475481 (https://phabricator.wikimedia.org/T207050) [15:38:45] (03CR) 10Vgutierrez: [C: 032] librenms: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475482 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:38:52] (03PS2) 10Vgutierrez: librenms: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475482 (https://phabricator.wikimedia.org/T207050) [15:39:17] RECOVERY - puppet last run on elastic1029 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:40:21] (03CR) 10Vgutierrez: [C: 032] netbox: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475483 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:40:30] (03PS2) 10Vgutierrez: netbox: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475483 (https://phabricator.wikimedia.org/T207050) [15:40:56] 10Operations, 10Traffic, 10Patch-For-Review: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Vgutierrez) [15:41:33] (03PS4) 10Jcrespo: mariadb: Create socket dir also on puppet run, & for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) [15:41:35] (03PS5) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) [15:41:45] (03PS1) 10Gehel: Revert "wdqs: reduce batch size to 300 temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/475484 [15:42:06] (03PS2) 10Gehel: Revert "wdqs: reduce batch size to 300 temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/475484 [15:42:23] (03PS3) 10Muehlenhoff: Script to generate service principals/keytabs (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/470566 [15:42:45] (03CR) 10Gehel: [C: 032] Revert "wdqs: reduce batch size to 300 temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/475484 (owner: 10Gehel) [15:45:13] (03PS3) 10Vgutierrez: netbox: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475483 (https://phabricator.wikimedia.org/T207050) [16:03:15] 10Operations, 10User-Elukey: Apply interface::rps to all the mc hosts - https://phabricator.wikimedia.org/T209489 (10elukey) @BBlack do you have any suggestion about things to check / precautions to take before enabling interface::rps to all the memcached shards? I am asking since it is a very delicate piece o... [16:05:05] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Aboubacarkhoraa) [16:12:53] (03PS1) 10Ladsgroup: Refactor ORES uWSGI workers to use an absolute count [puppet] - 10https://gerrit.wikimedia.org/r/475487 (https://phabricator.wikimedia.org/T182249) [16:15:59] (03Abandoned) 10Ladsgroup: Refactor ORES uWSGI workers to use an absolute count [puppet] - 10https://gerrit.wikimedia.org/r/475487 (https://phabricator.wikimedia.org/T182249) (owner: 10Ladsgroup) [16:16:44] 10Operations, 10Traffic: ATS path normalization - https://phabricator.wikimedia.org/T210295 (10ema) I now see that @BBlack prepped https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/407643/ to bring reality closer to theory by adding the missing characters to `{mediawiki,restbase}_encode` -- with a caveat... [16:18:35] 10Operations, 10Traffic, 10Patch-For-Review: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema) [16:24:17] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [16:24:35] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:26:24] 10Operations, 10Traffic: Renew Digicert Unified in 2019 - https://phabricator.wikimedia.org/T209515 (10BBlack) Downtimes set, we shouldn't get cert alerts in icinga [16:26:43] (03PS1) 10BBlack: Remove old globalsign unified certs from config [puppet] - 10https://gerrit.wikimedia.org/r/475489 (https://phabricator.wikimedia.org/T206804) [16:26:45] (03PS1) 10BBlack: Remove old globalsign unified cert files [puppet] - 10https://gerrit.wikimedia.org/r/475490 (https://phabricator.wikimedia.org/T206804) [16:26:53] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [16:27:43] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [16:27:56] (03CR) 10BBlack: [C: 032] Remove old globalsign unified certs from config [puppet] - 10https://gerrit.wikimedia.org/r/475489 (https://phabricator.wikimedia.org/T206804) (owner: 10BBlack) [16:28:34] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10MoritzMuehlenhoff) Status u... [16:32:29] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) This was already a... [16:38:27] (03CR) 10BBlack: [C: 032] Remove old globalsign unified cert files [puppet] - 10https://gerrit.wikimedia.org/r/475490 (https://phabricator.wikimedia.org/T206804) (owner: 10BBlack) [16:50:43] (03PS1) 10Giuseppe Lavagetto: wmflib: make the role() function store a path in $::_role [puppet] - 10https://gerrit.wikimedia.org/r/475498 [16:50:45] (03PS1) 10Giuseppe Lavagetto: hiera: remove the role backend in production [puppet] - 10https://gerrit.wikimedia.org/r/475499 [16:50:47] (03PS1) 10Giuseppe Lavagetto: hiera: fix the hierarchical order of lookups [puppet] - 10https://gerrit.wikimedia.org/r/475500 [16:53:42] !log cleaned up remnants of globalsign-2017 unified cert (OCSP cache/config, unmanaged cert files, etc) on all cpNNNN - T206804 [16:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:46] T206804: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 [16:55:36] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Aklapper) a:05WMEE-Phabricator-List>03None [Please do not set random assignees](https://www.mediawiki.org/wiki/How_to_report_a_bug) who cannot work on this. Thanks. :) [16:57:49] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Aklapper) @aboubacarkhoraa: Could you explain what the "GIN" in "Wikimedia-GIN" means? Also see https://meta.wikimedia.org/wiki/Mailing_lists/Standardization . Thanks! [16:58:33] (03PS6) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) [17:01:24] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Krenair) appears to be the ISO 3166-1 alpha-3 code for Guinea [17:06:19] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Aboubacarkhoraa) Comme la dit Kreinair GIN est le code de la Guinée qui a recu sont affiliation hier et nous souhaitons crée une liste de diffusion du groupe. [17:17:48] (03PS7) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) [17:29:16] (03PS8) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) [18:42:56] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Aklapper) Ah, merci pour l'expliquer! J'ai seulement su "GN". [19:16:48] (03PS1) 10Ema: ATS: path normalization [puppet] - 10https://gerrit.wikimedia.org/r/475508 (https://phabricator.wikimedia.org/T210295) [19:56:58] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Aboubacarkhoraa) Oui d'habitude ces GN mais pour le comité d'affiliation le codes qu'on m'a donné est GIN et ces pourquoi je les ajoute ici [21:12:04] * bawolff deploying a security thing [21:15:19] !log deploy patch for T210192 [21:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:23] PROBLEM - graphite-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [22:39:33] RECOVERY - graphite-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.023 second response time