[00:01:36] (03CR) 10BryanDavis: [C: 04-1] "Its a start, but not quite there yet." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437054 (https://phabricator.wikimedia.org/T95097) (owner: 10Nehajha) [00:43:24] (03PS1) 10Esanders: Enable mobile section editing on bnwiki, hewiki, zh_yuewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496696 (https://phabricator.wikimedia.org/T218375) [04:00:04] kart_: #bothumor My software never has bugs. It just develops random features. Rise for . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190315T0400). [04:01:33] !log Started manual run of unpublished ContentTranslation draft purge script (T217818) [04:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:37] T217818: Run unpublished draft purge script for CX (Week of 03/10) - https://phabricator.wikimedia.org/T217818 [04:40:26] (03PS1) 10Mholloway: Update production SSH key for Michael Holloway [puppet] - 10https://gerrit.wikimedia.org/r/496706 [06:01:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496712 [06:02:12] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496712 (owner: 10Marostegui) [06:03:11] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496712 (owner: 10Marostegui) [06:04:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 (duration: 00m 50s) [06:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:23] !log Upgrade db1091 [06:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496712 (owner: 10Marostegui) [06:16:09] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496713 [06:31:41] PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/aphlict] [06:37:35] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) > After talking with Stas this apparently makes updating within the updater harder etc as it might result in more wr... [06:40:54] (03PS1) 10Jcrespo: backups: Make rentention policy configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/496714 (https://phabricator.wikimedia.org/T210292) [06:41:19] (03CR) 10jerkins-bot: [V: 04-1] backups: Make rentention policy configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/496714 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [06:42:05] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496713 (owner: 10Marostegui) [06:43:03] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496713 (owner: 10Marostegui) [06:43:16] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496713 (owner: 10Marostegui) [06:44:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1091 (duration: 00m 48s) [06:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:44] (03PS2) 10Jcrespo: backups: Make rentention policy configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/496714 (https://phabricator.wikimedia.org/T210292) [06:45:08] (03CR) 10jerkins-bot: [V: 04-1] backups: Make rentention policy configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/496714 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [06:51:44] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496715 [06:52:52] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496715 (owner: 10Marostegui) [06:53:49] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496715 (owner: 10Marostegui) [06:54:37] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496715 (owner: 10Marostegui) [06:55:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1091 (duration: 00m 48s) [06:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:41] RECOVERY - puppet last run on phab1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:00:13] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496716 [07:01:44] !log Finished manual run of unpublished ContentTranslation draft purge script (T217818) [07:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:47] T217818: Run unpublished draft purge script for CX (Week of 03/10) - https://phabricator.wikimedia.org/T217818 [07:03:09] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496716 (owner: 10Marostegui) [07:04:06] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496716 (owner: 10Marostegui) [07:05:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1091 (duration: 00m 48s) [07:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:47] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496716 (owner: 10Marostegui) [07:07:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] osmdb: Switch the replica to the VM that needs to become the master [puppet] - 10https://gerrit.wikimedia.org/r/495290 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [07:10:25] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496717 [07:15:06] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496717 (owner: 10Marostegui) [07:16:04] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496717 (owner: 10Marostegui) [07:17:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1091 (duration: 00m 48s) [07:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:17] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496717 (owner: 10Marostegui) [07:20:01] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496718 [07:26:06] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496718 (owner: 10Marostegui) [07:27:05] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496718 (owner: 10Marostegui) [07:28:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1091 (duration: 00m 49s) [07:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:22] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) [07:28:34] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496718 (owner: 10Marostegui) [07:29:18] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:31:30] (03PS3) 10Jcrespo: backups: Make rentention policy configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/496714 (https://phabricator.wikimedia.org/T210292) [07:31:53] (03CR) 10jerkins-bot: [V: 04-1] backups: Make rentention policy configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/496714 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [07:32:23] (03PS1) 10Marostegui: mariadb: Failover db1066 to db1076 on s2 [puppet] - 10https://gerrit.wikimedia.org/r/496720 (https://phabricator.wikimedia.org/T187960) [07:32:42] (03CR) 10Marostegui: [C: 04-2] "Do not submit unless it is necessary" [puppet] - 10https://gerrit.wikimedia.org/r/496720 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [07:33:16] (03PS1) 10Marostegui: db-eqiad.php: Failover db1066 to db1076 on s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496721 (https://phabricator.wikimedia.org/T187960) [07:33:46] (03CR) 10Marostegui: [C: 04-2] "Do not submit unless it is necessary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496721 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [07:36:04] (03PS4) 10Jcrespo: backups: Make rentention policy configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/496714 (https://phabricator.wikimedia.org/T210292) [07:36:26] (03CR) 10jerkins-bot: [V: 04-1] backups: Make rentention policy configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/496714 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [07:40:31] (03PS24) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [07:41:01] 10Operations, 10Analytics, 10EventBus, 10Prod-Kubernetes, and 2 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10akosiaris) I can not rule out a networking issue but it seems improbable, as after all the logs did make it to logstash. Al... [07:41:23] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [07:45:04] (03PS5) 10Jcrespo: backups: Make rentention policy configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/496714 (https://phabricator.wikimedia.org/T210292) [07:45:43] (03CR) 10jerkins-bot: [V: 04-1] backups: Make rentention policy configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/496714 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [07:50:51] (03PS25) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [07:51:47] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [07:52:00] 10Operations, 10Gerrit, 10Phabricator, 10Security-Team, 10Traffic: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10Jdrewniak) [07:52:19] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) [07:52:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [07:53:12] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:55:00] (03CR) 10Alexandros Kosiaris: [C: 04-1] eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [07:59:46] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) [08:00:42] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:03:18] (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) [08:04:13] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:11:51] (03PS1) 10Marostegui: mariadb: Promote db1120 as x1 master [puppet] - 10https://gerrit.wikimedia.org/r/496723 (https://phabricator.wikimedia.org/T187960) [08:12:48] (03PS1) 10Marostegui: db-eqiad.php: Promote db1120 as x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496724 (https://phabricator.wikimedia.org/T187960) [08:12:59] (03CR) 10Marostegui: [C: 04-2] "Do not submit unless it is necessary" [puppet] - 10https://gerrit.wikimedia.org/r/496723 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [08:13:24] (03CR) 10Marostegui: [C: 04-2] "Do not submit unless it is necessary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496724 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [08:17:13] (03PS1) 10Elukey: Add Yarn TLS settings to the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/496725 (https://phabricator.wikimedia.org/T217412) [08:17:49] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496726 [08:17:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] puppet_alert: Email projectadmins instead of members [puppet] - 10https://gerrit.wikimedia.org/r/495757 (https://phabricator.wikimedia.org/T218009) (owner: 10Alex Monk) [08:19:12] (03PS2) 10Elukey: Add Yarn TLS settings to the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/496725 (https://phabricator.wikimedia.org/T217412) [08:19:57] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496726 (owner: 10Marostegui) [08:20:58] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496726 (owner: 10Marostegui) [08:22:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084 (duration: 00m 49s) [08:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:12] (03CR) 10Marostegui: [C: 03+1] "I like the idea! Makes lots of sense, specially deleting only the same section ones." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/496714 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [08:23:46] (03CR) 10Elukey: [C: 03+2] Add Yarn TLS settings to the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/496725 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [08:25:36] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496726 (owner: 10Marostegui) [08:32:08] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::standby: set Yarn TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/496727 (https://phabricator.wikimedia.org/T217412) [08:32:48] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::standby: set Yarn TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/496727 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [08:34:08] (03CR) 10Gehel: [C: 04-1] "A bunch of style issues, but the principles are good!" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) (owner: 10MSantos) [08:34:19] (03CR) 10Gehel: Add cookbook for elastic6 upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [08:36:12] (03PS2) 10Dzahn: monitoring: link to dcops runbook page for mgmt interface checks [puppet] - 10https://gerrit.wikimedia.org/r/496438 [08:37:19] (03CR) 10Dzahn: [C: 03+2] "per IRC" [puppet] - 10https://gerrit.wikimedia.org/r/496438 (owner: 10Dzahn) [08:38:14] (03PS2) 10Dzahn: graphite: add Icinga notes_url for graphite_freshness check [puppet] - 10https://gerrit.wikimedia.org/r/496437 [08:41:08] (03CR) 10Dzahn: [C: 03+2] graphite: add Icinga notes_url for graphite_freshness check [puppet] - 10https://gerrit.wikimedia.org/r/496437 (owner: 10Dzahn) [08:47:02] (03PS1) 10Marostegui: monitor_replication: Change notes_url link [puppet] - 10https://gerrit.wikimedia.org/r/496728 [08:47:29] (03PS2) 10Marostegui: monitor_replication: Change notes_url link [puppet] - 10https://gerrit.wikimedia.org/r/496728 [08:48:24] (03CR) 10Marostegui: [C: 03+2] monitor_replication: Change notes_url link [puppet] - 10https://gerrit.wikimedia.org/r/496728 (owner: 10Marostegui) [08:48:59] (03CR) 10Dzahn: Pass flag use_nodejs10 for maps services (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) (owner: 10MSantos) [08:49:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496729 [08:50:20] (03PS2) 10Dzahn: DNS: remove mgmt DNS name for rdb200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/496504 (owner: 10Papaul) [08:50:30] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496729 (owner: 10Marostegui) [08:51:35] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496729 (owner: 10Marostegui) [08:51:49] (03CR) 10Dzahn: [C: 03+2] DNS: remove mgmt DNS name for rdb200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/496504 (owner: 10Papaul) [08:52:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1084 (duration: 00m 47s) [08:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:33] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496730 [08:55:37] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496730 (owner: 10Marostegui) [08:56:37] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496730 (owner: 10Marostegui) [08:57:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103 (duration: 00m 50s) [08:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:19] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496729 (owner: 10Marostegui) [08:59:21] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496730 (owner: 10Marostegui) [08:59:40] (03Abandoned) 10Gehel: elasticsearch: fixed duplicated check description [puppet] - 10https://gerrit.wikimedia.org/r/488453 (https://phabricator.wikimedia.org/T212850) (owner: 10Gehel) [09:01:14] (03PS1) 10Dzahn: cache::ssl::unified: Icinga link to HTTPS on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/496731 [09:02:13] (03CR) 10jerkins-bot: [V: 04-1] cache::ssl::unified: Icinga link to HTTPS on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/496731 (owner: 10Dzahn) [09:02:26] (03PS2) 10Dzahn: cache::ssl::unified: Icinga link to HTTPS on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/496731 [09:05:07] (03PS1) 10Dzahn: apertium: Icinga link to CX page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/496732 [09:11:25] (03PS1) 10Elukey: Add mapreduce.ssl.enabled setting to the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/496733 (https://phabricator.wikimedia.org/T217412) [09:12:42] (03CR) 10Elukey: [C: 03+2] Add mapreduce.ssl.enabled setting to the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/496733 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [09:13:14] (03PS26) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [09:14:45] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:15:47] (03PS27) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [09:16:43] (03PS1) 10Dzahn: base::monitoring: link to new runbook page for SSH checks [puppet] - 10https://gerrit.wikimedia.org/r/496735 [09:16:45] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:17:55] !log Ramp up cxserver k8s traffic to 50% - T213195 [09:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:58] T213195: Migrate cxserver to kubernetes - https://phabricator.wikimedia.org/T213195 [09:18:41] !log jiji@cumin1001 conftool action : set/weight=15; selector: dc=codfw,service=cxserver,cluster=scb,name=kubernetes.* [09:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:05] !log jiji@cumin1001 conftool action : set/weight=12; selector: dc=eqiad,service=cxserver,cluster=scb,name=kubernetes.* [09:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:57] (03PS1) 10Alexandros Kosiaris: zotero: Remove prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/496736 [09:19:59] (03PS1) 10Alexandros Kosiaris: Package zotero 0.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/496737 [09:20:01] (03PS1) 10Alexandros Kosiaris: citoid: Add a 10s histogram bucket for requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/496738 [09:20:03] (03PS1) 10Alexandros Kosiaris: Package citoid 0.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/496739 [09:20:05] (03PS1) 10Alexandros Kosiaris: citoid: Add a 10s histogram bucket for requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/496740 [09:20:07] (03PS1) 10Alexandros Kosiaris: Package cxserver 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/496741 [09:21:13] (03PS28) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [09:21:31] (03PS2) 10Alexandros Kosiaris: cxserver: Add a 10s histogram bucket for requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/496740 [09:21:33] (03PS2) 10Alexandros Kosiaris: Package cxserver 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/496741 [09:22:09] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:23:53] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2004.codfw.wmnet are marked down but pooled [09:24:05] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([prometheus2004.codfw.wmnet]) [09:24:09] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([prometheus2004.codfw.wmnet]) [09:24:21] PROBLEM - LVS HTTP IPv4 on prometheus.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:24:44] un oh [09:24:46] anybody working on it? [09:24:48] checking [09:24:51] ack :) [09:25:02] godog: we just sent more traffic to k8s [09:25:08] cxserver [09:25:20] I am not sure it is related [09:25:24] jijiki: thanks, might be! [09:25:53] (03PS29) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [09:25:53] (03PS3) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [09:26:12] godog: should I depool codfw to check if it recovers? [09:26:36] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [09:26:36] (03PS4) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [09:26:39] (03PS4) 10Gehel: elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 [09:26:50] grafana is unresponsive too (even non-prometheus functionality) [09:27:27] now it is loading, very slowly [09:27:39] it works for me quite normally [09:27:39] RECOVERY - LVS HTTP IPv4 on prometheus.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 10959 bytes in 6.335 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:27:46] !log temporarily disable read queries to prometheus@k8s on prometheus2003 [09:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:06] !log correction, prometheus2004 [09:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:20] we had a similar issue last week iirc? [09:28:24] we did yeah [09:28:25] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [09:28:55] looks like the same issue too, queries timing out [09:28:59] (03CR) 10Jcrespo: "Making python3 scripts python2-compatible so that CI passes *facepalms*" [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:29:09] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496742 [09:29:11] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal [09:29:15] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal [09:29:21] it was codfw as well back then, right? [09:29:47] but eqiad is not suffering because it's 2 backend servers? [09:29:58] godog: last week we had ramped up traffic to k8s citoid [09:30:35] akosiaris: I think that's right yeah [09:30:37] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 (owner: 10Gehel) [09:30:40] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496742 (owner: 10Marostegui) [09:31:42] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496742 (owner: 10Marostegui) [09:32:09] so yeah doesn't look like the fixes we did last week were fully effective [09:32:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103 (duration: 00m 50s) [09:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:24] anyways I've disabled the queries for now, data is still being collected though [09:33:44] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496742 (owner: 10Marostegui) [09:33:50] (03PS1) 10Elukey: Enable TLS configuration for mapreduce on Hadoop Test masters [puppet] - 10https://gerrit.wikimedia.org/r/496743 (https://phabricator.wikimedia.org/T217412) [09:34:34] (03CR) 10Elukey: [C: 03+2] Enable TLS configuration for mapreduce on Hadoop Test masters [puppet] - 10https://gerrit.wikimedia.org/r/496743 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [09:34:41] 10Operations, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10fgiunchedi) 05Resolved→03Open reopening as this just reoccurred [09:34:42] (03PS5) 10Gehel: elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 [09:36:41] 10Operations, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10jcrespo) may I ask for the actual (emergency) actions you took, in case it happens again and you are not... [09:38:59] PROBLEM - grafana.wikimedia.org on grafana1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:35] (03PS3) 10Gehel: icinga: modify cirrus prometheus checks threshold [puppet] - 10https://gerrit.wikimedia.org/r/496509 (owner: 10Mathew.onipe) [09:41:40] ugh that looks like it is true [09:41:49] I'll bounce grafaana [09:42:53] !log bounce grafana-server on grafana1001 [09:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:47] RECOVERY - grafana.wikimedia.org on grafana1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49665 bytes in 0.159 second response time [09:44:08] (03CR) 10Gehel: [C: 03+2] icinga: modify cirrus prometheus checks threshold [puppet] - 10https://gerrit.wikimedia.org/r/496509 (owner: 10Mathew.onipe) [09:45:15] nothing unknown unfortunately, I think at least two urgent things to do: better protection on the apache side so that slow queries don't exhaust its workers, and moving k8s dashboards to use the recording rules [09:45:17] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "tested locally, worked as expected" [deployment-charts] - 10https://gerrit.wikimedia.org/r/496736 (owner: 10Alexandros Kosiaris) [09:45:22] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Package zotero 0.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/496737 (owner: 10Alexandros Kosiaris) [09:45:34] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] citoid: Add a 10s histogram bucket for requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/496738 (owner: 10Alexandros Kosiaris) [09:45:39] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Package citoid 0.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/496739 (owner: 10Alexandros Kosiaris) [09:45:44] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] cxserver: Add a 10s histogram bucket for requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/496740 (owner: 10Alexandros Kosiaris) [09:45:48] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Package cxserver 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/496741 (owner: 10Alexandros Kosiaris) [09:46:02] (03PS6) 10Muehlenhoff: Clarify expected format of service name in wmf-auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/480520 (https://phabricator.wikimedia.org/T212219) [09:46:43] godog: I can handle the latter [09:47:29] akosiaris: thanks! appreciate it, beware though that ATM there won't be a whole lot of history on those recording rules [09:47:33] I'll investigate the former [09:48:18] there isn't a whole lot of history in the metrics to begin with, so it's fine [09:48:50] (03PS2) 10Dzahn: apertium: Icinga link to CX page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/496732 [09:49:28] (03CR) 10Dzahn: [C: 03+2] apertium: Icinga link to CX page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/496732 (owner: 10Dzahn) [09:49:37] ack [09:50:52] FWIW eqiad had the same cpu spike with the influx on new metrics, but apache didn't run out of workers [09:50:54] (03CR) 10Vgutierrez: [C: 03+1] cache::ssl::unified: Icinga link to HTTPS on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/496731 (owner: 10Dzahn) [09:51:45] (03PS3) 10Dzahn: cache::ssl::unified: Icinga link to HTTPS on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/496731 [09:52:04] (03CR) 10Dzahn: [C: 03+2] cache::ssl::unified: Icinga link to HTTPS on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/496731 (owner: 10Dzahn) [09:52:58] !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-codfw.yaml production stable/zotero [namespace: zotero, clusters: codfw] [09:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:03] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [09:53:03] !log akosiaris@deploy1001 scap-helm zotero finished [09:53:04] !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-eqiad.yaml production stable/zotero [namespace: zotero, clusters: eqiad] [09:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:09] !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed [09:53:09] !log akosiaris@deploy1001 scap-helm zotero finished [09:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:32] (03PS7) 10Muehlenhoff: Clarify expected format of service name in wmf-auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/480520 (https://phabricator.wikimedia.org/T212219) [09:55:38] (03CR) 10Muehlenhoff: [C: 03+2] Clarify expected format of service name in wmf-auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/480520 (https://phabricator.wikimedia.org/T212219) (owner: 10Muehlenhoff) [09:56:35] ACKNOWLEDGEMENT - nova-compute proc maximum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute daniel_zahn in downtime https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:58:14] !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-staging.yaml staging stable/zotero [namespace: zotero, clusters: staging] [09:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:45] RECOVERY - Mjolnir bulk update failure check - codfw on icinga2001 is OK: (C)2 gt (W)1 gt 0 https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [09:58:45] RECOVERY - Mjolnir bulk update failure check - eqiad on icinga2001 is OK: (C)2 gt (W)1 gt 0 https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [10:00:08] !log akosiaris@deploy1001 scap-helm zotero upgrade --install -f zotero-values-staging.yaml staging stable/zotero [namespace: zotero, clusters: staging] [10:00:09] !log akosiaris@deploy1001 scap-helm zotero cluster staging completed [10:00:09] !log akosiaris@deploy1001 scap-helm zotero finished [10:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:34] 10Operations, 10monitoring: HP RAID (Service Check Timed Out) - https://phabricator.wikimedia.org/T172708 (10Dzahn) a:05ayounsi→03None [10:01:25] RECOVERY - Cirrus Update lag check - codfw on icinga2001 is OK: (C)2 gt (W)1 gt 0 https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&from=now-7d&to=now [10:01:25] RECOVERY - Cirrus Update lag check - eqiad on icinga2001 is OK: (C)2 gt (W)1 gt 0 https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&from=now-7d&to=now [10:01:28] !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-values-staging.yaml staging stable/citoid [namespace: citoid, clusters: staging] [10:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:34] 10Operations, 10monitoring: HP RAID (Service Check Timed Out) - https://phabricator.wikimedia.org/T172708 (10Dzahn) HP RAID checks are timing out on all eqiad swift hosts. [10:02:03] !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-staging-values.yaml staging stable/citoid [namespace: citoid, clusters: staging] [10:02:04] !log akosiaris@deploy1001 scap-helm citoid cluster staging completed [10:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:04] !log akosiaris@deploy1001 scap-helm citoid finished [10:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:06] 10Operations, 10media-storage, 10monitoring: HP RAID (Service Check Timed Out) on swift hosts - https://phabricator.wikimedia.org/T172708 (10Dzahn) [10:02:32] !log remove prometheus-statsd-exporter from zotero pods [10:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:46] !log add a 10s bucket to citoid prometheus-statsd exporter mappings [10:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:58] !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-eqiad-values.yaml production stable/citoid [namespace: citoid, clusters: eqiad] [10:02:59] !log akosiaris@deploy1001 scap-helm citoid cluster eqiad completed [10:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:59] !log akosiaris@deploy1001 scap-helm citoid finished [10:02:59] !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-codfw-values.yaml production stable/citoid [namespace: citoid, clusters: codfw] [10:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:01] !log akosiaris@deploy1001 scap-helm citoid cluster codfw completed [10:03:01] !log akosiaris@deploy1001 scap-helm citoid finished [10:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:07] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:06:16] (03PS1) 10Jcrespo: mariadb-snapshots: Better error and logging handling [puppet] - 10https://gerrit.wikimedia.org/r/496746 (https://phabricator.wikimedia.org/T210292) [10:07:16] (03CR) 10DCausse: [C: 03+1] Update apifeatureusage es template to match 5.6.x+ [puppet] - 10https://gerrit.wikimedia.org/r/496557 (https://phabricator.wikimedia.org/T183156) (owner: 10EBernhardson) [10:07:25] (03CR) 10jerkins-bot: [V: 04-1] mariadb-snapshots: Better error and logging handling [puppet] - 10https://gerrit.wikimedia.org/r/496746 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [10:07:41] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:07:52] 10Operations, 10Patch-For-Review: wmf-auto-restart fails on certain legacy services - https://phabricator.wikimedia.org/T212219 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Fix has been deployed, [10:15:19] (03PS2) 10Gehel: Update apifeatureusage es template to match 5.6.x+ [puppet] - 10https://gerrit.wikimedia.org/r/496557 (https://phabricator.wikimedia.org/T183156) (owner: 10EBernhardson) [10:16:33] (03PS1) 10Elukey: Add a core-site.xml global property to the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/496747 (https://phabricator.wikimedia.org/T217412) [10:16:35] (03CR) 10Gehel: [C: 03+2] "Template already updated in the cluster." [puppet] - 10https://gerrit.wikimedia.org/r/496557 (https://phabricator.wikimedia.org/T183156) (owner: 10EBernhardson) [10:18:09] (03CR) 10Elukey: [C: 03+2] Add a core-site.xml global property to the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/496747 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [10:18:17] (03PS2) 10Elukey: Add a core-site.xml global property to the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/496747 (https://phabricator.wikimedia.org/T217412) [10:18:20] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add a core-site.xml global property to the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/496747 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [10:30:03] akosiaris jijiki I've silenced prometheus.svc.codfw temporarily and will reenable k8s queries, what was the dashboard(s) you were looking at so I can try again? [10:30:04] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/citoid [namespace: cxserver, clusters: staging] [10:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:56] godog: https://grafana.wikimedia.org/d/F7rttgqmz/cxserver [10:31:08] last week was citoid iirc [10:31:11] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [10:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:12] !log akosiaris@deploy1001 scap-helm cxserver cluster staging completed [10:31:12] !log akosiaris@deploy1001 scap-helm cxserver finished [10:31:13] which has a similar dashboad [10:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:34] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [10:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:36] !log akosiaris@deploy1001 scap-helm cxserver cluster codfw completed [10:31:36] !log akosiaris@deploy1001 scap-helm cxserver finished [10:31:36] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [10:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:37] !log akosiaris@deploy1001 scap-helm cxserver cluster eqiad completed [10:31:37] !log akosiaris@deploy1001 scap-helm cxserver finished [10:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:42] !log add a 10s bucket to cxserver prometheus-statsd exporter mappings [10:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:44] jijiki: ack, thanks! [10:33:39] the more I think about it the more I'm convinced it is expensive queries + metrics churn due to rolling restart of pods [10:37:55] it looks like wikibugs went AWOL 12 minutes ago [10:40:45] PROBLEM - grafana.wikimedia.org on grafana1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:41:05] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([prometheus2004.codfw.wmnet]) [10:41:09] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2004.codfw.wmnet are marked down but pooled [10:41:27] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2004.codfw.wmnet are marked down but pooled [10:41:42] yes yes [10:41:49] RECOVERY - grafana.wikimedia.org on grafana1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49666 bytes in 0.172 second response time [10:41:49] :) [10:42:23] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [10:42:39] godog: how can we fix it? [10:42:43] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [10:43:56] jijiki: one of the mitigations I'm currently looking into is limiting connections that apache is willing to make a single k8s instance [10:44:09] sorry, prometheus instance not k8s instance [10:44:28] so that an overloaded prometheus instance doesn't bring the whole thing down [10:44:43] the whole thing == apache on prometheus hosts [10:45:44] tx :) [10:46:06] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal [11:05:33] @seen wikibugs [11:05:33] mutante: Last time I saw wikibugs they were quitting the network with reason: Ping timeout: 272 seconds N/A at 3/15/2019 10:25:01 AM (40m32s ago) [11:07:01] dzahn@tools-sgebastion-07:~$ become wikibugs [11:07:01] You are not a member of the group tools.wikibugs. [11:07:16] legoktm: ^ could you take a look at wikibugs? [11:16:48] !log reenable prometheus@k8s on prometheus2004 with mod_proxy connection limits - T217715 [11:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:52] T217715: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 [11:22:31] (03PS1) 10Vgutierrez: Release 0.11 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/496753 (https://phabricator.wikimedia.org/T207295) [11:27:19] (03CR) 10Volans: [C: 03+1] "LGTM, two caveats inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [11:28:12] (03CR) 10Vgutierrez: [C: 03+2] Release 0.11 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/496753 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [11:30:24] (03Merged) 10jenkins-bot: Release 0.11 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/496753 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [11:31:55] (03CR) 10jenkins-bot: Release 0.11 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/496753 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [11:33:21] (03PS1) 10Vgutierrez: acme-chief: Store certificates in unique directories [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496754 (https://phabricator.wikimedia.org/T207295) [11:33:26] (03PS1) 10Vgutierrez: Release 0.11 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496755 (https://phabricator.wikimedia.org/T207295) [11:38:50] (03PS1) 10Vgutierrez: debian: Replace live_certs and new_certs with certs [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496756 (https://phabricator.wikimedia.org/T207295) [11:38:52] (03PS1) 10Vgutierrez: debian: Add release 0.11 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496757 (https://phabricator.wikimedia.org/T207295) [11:42:19] (03PS2) 10Vgutierrez: debian: Replace live_certs and new_certs with certs [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496756 (https://phabricator.wikimedia.org/T207295) [11:42:21] (03PS2) 10Vgutierrez: debian: Add release 0.11 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496757 (https://phabricator.wikimedia.org/T207295) [11:55:43] (03CR) 10Vgutierrez: [C: 03+2] acme-chief: Store certificates in unique directories [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496754 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [11:55:48] (03CR) 10Vgutierrez: [C: 03+2] Release 0.11 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496755 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [11:56:45] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.11 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496757 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [11:57:21] (03Merged) 10jenkins-bot: acme-chief: Store certificates in unique directories [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496754 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [11:57:24] (03Merged) 10jenkins-bot: Release 0.11 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496755 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [11:58:36] (03CR) 10Volans: [C: 04-1] "Type mismatch, some additional comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [11:58:51] (03CR) 10jenkins-bot: acme-chief: Store certificates in unique directories [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496754 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [11:58:58] (03CR) 10jenkins-bot: Release 0.11 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496755 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [12:00:14] 10Operations, 10monitoring, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10Dzahn) {F28392619} [12:04:10] (03CR) 10Alex Monk: [C: 03+2] debian: Replace live_certs and new_certs with certs [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496756 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [12:05:45] (03Merged) 10jenkins-bot: debian: Replace live_certs and new_certs with certs [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496756 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [12:05:48] (03Merged) 10jenkins-bot: debian: Add release 0.11 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496757 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [12:07:18] (03CR) 10jenkins-bot: debian: Replace live_certs and new_certs with certs [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496756 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [12:07:28] (03CR) 10jenkins-bot: debian: Add release 0.11 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/496757 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [12:21:08] (03PS4) 10Muehlenhoff: Initial Kerberos KDC/kadminserver profiles/roles (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/494242 [12:22:08] (03CR) 10jerkins-bot: [V: 04-1] Initial Kerberos KDC/kadminserver profiles/roles (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/494242 (owner: 10Muehlenhoff) [12:24:14] (03CR) 10Volans: [C: 04-1] "A bunch of thing to improve/fix, see inline." (0315 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [12:24:29] 10Operations, 10SRE-Access-Requests: Grant root on MediaWiki maintenance hosts to perf-roots - https://phabricator.wikimedia.org/T217813 (10Dzahn) I agree it's almost like an oversight. perf-roots have root on appservers and mwmaint should count as a special kind of appserver imho. +1 [12:25:38] (03PS5) 10Muehlenhoff: Initial Kerberos KDC/kadminserver profiles/roles (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/494242 [12:26:46] (03PS1) 10Dzahn: admins: add perf-roots on mediawiki-maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/496761 (https://phabricator.wikimedia.org/T217813) [12:27:59] (03PS2) 10Dzahn: admins: add perf-roots on mediawiki-maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/496761 (https://phabricator.wikimedia.org/T217813) [12:30:29] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "lgtm. email and UID confirmed in LDAP" [puppet] - 10https://gerrit.wikimedia.org/r/496379 (https://phabricator.wikimedia.org/T217438) (owner: 10Vgutierrez) [12:44:38] mutante: I've worked around not being a member of tools.wikibugs with sudo -i, just updated https://www.mediawiki.org/wiki/Wikibugs#Deploying_changes to reflect that [12:45:14] ema: oh! good to know, thanks! [12:45:26] mutante: however I still haven't figured out how to properly restart that thing. `python3 manage.py restart_job` failed saying that tools-sgebastion-07.tools.eqiad.wmflabs is not and admin host [12:45:38] s/and admin/an admin/ [12:46:57] ema: hmm. i also don't know how to identify admin hosts [12:47:13] and I don't know what an admin host is, nor did I find anything on wikitech explaining that. If someone does know, please update https://www.mediawiki.org/wiki/Wikibugs so we can finally restart it [12:48:09] well ideally fix it so that it doesn't die, if that is not possible autorestart it when it commits seppuku? [13:00:15] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/496764 (https://phabricator.wikimedia.org/T135991) [13:00:34] (03CR) 10Dzahn: [C: 03+2] "not used - https://puppet-compiler.wmflabs.org/compiler1002/15143/" [puppet] - 10https://gerrit.wikimedia.org/r/496489 (owner: 10Dzahn) [13:04:08] (03PS2) 10Dzahn: rm mediawiki::generic_monitoring (Apple bridge) [puppet] - 10https://gerrit.wikimedia.org/r/496489 [13:16:04] (03PS4) 10GTirloni: puppet_alert: Email projectadmins instead of members [puppet] - 10https://gerrit.wikimedia.org/r/495757 (https://phabricator.wikimedia.org/T218009) (owner: 10Alex Monk) [13:17:12] (03CR) 10GTirloni: [C: 03+2] puppet_alert: Email projectadmins instead of members [puppet] - 10https://gerrit.wikimedia.org/r/495757 (https://phabricator.wikimedia.org/T218009) (owner: 10Alex Monk) [13:25:05] (03PS6) 10Gehel: elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 [13:26:57] (03PS1) 10Ema: ATS: set disk cache size cutoff to 1G [puppet] - 10https://gerrit.wikimedia.org/r/496767 (https://phabricator.wikimedia.org/T213263) [13:29:19] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 (owner: 10Gehel) [13:30:49] 10Puppet, 10cloud-services-team, 10Patch-For-Review: Puppet failure emails sent to non-admin members of tools project causing user confusion - https://phabricator.wikimedia.org/T218009 (10GTirloni) Change is deployed, thanks a lot @Krenair [13:30:59] 10Puppet, 10cloud-services-team, 10Patch-For-Review: Puppet failure emails sent to non-admin members of tools project causing user confusion - https://phabricator.wikimedia.org/T218009 (10GTirloni) 05Open→03Resolved p:05Triage→03Normal [13:32:52] (03PS2) 10Ema: ATS: set disk cache size cutoff to 1G [puppet] - 10https://gerrit.wikimedia.org/r/496767 (https://phabricator.wikimedia.org/T213263) [13:33:38] (03CR) 10Ema: [C: 03+2] ATS: set disk cache size cutoff to 1G [puppet] - 10https://gerrit.wikimedia.org/r/496767 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [13:35:09] (03PS2) 10GTirloni: toolforge: Cleanup host_aliases and exim4 conf for Trusty grid [puppet] - 10https://gerrit.wikimedia.org/r/496680 (https://phabricator.wikimedia.org/T109485) (owner: 10BryanDavis) [13:36:53] (03PS3) 10Dzahn: rm mediawiki::generic_monitoring (Apple bridge) [puppet] - 10https://gerrit.wikimedia.org/r/496489 [13:37:26] (03CR) 10GTirloni: "Could you remove it from ./modules/profile/templates/toolforge/mail-relay.exim4.conf.erb as well?" [puppet] - 10https://gerrit.wikimedia.org/r/496680 (https://phabricator.wikimedia.org/T109485) (owner: 10BryanDavis) [13:38:51] (03PS4) 10GTirloni: Add rate limiting to toollabs::mailrelay with warn action [puppet] - 10https://gerrit.wikimedia.org/r/379239 (https://phabricator.wikimedia.org/T175964) (owner: 10Herron) [13:40:39] (03CR) 10GTirloni: "Could you also modify the files under modules/profile/manifests/toolforge? We're caught maintaining both places until toollabs/trusty gets" [puppet] - 10https://gerrit.wikimedia.org/r/379239 (https://phabricator.wikimedia.org/T175964) (owner: 10Herron) [13:41:53] (03PS2) 10GTirloni: Use tesseract-ocr-all instead of a list of most Tesseract packages [puppet] - 10https://gerrit.wikimedia.org/r/496008 (https://phabricator.wikimedia.org/T218151) (owner: 10Tpt) [13:42:58] (03CR) 10GTirloni: [C: 03+2] Use tesseract-ocr-all instead of a list of most Tesseract packages [puppet] - 10https://gerrit.wikimedia.org/r/496008 (https://phabricator.wikimedia.org/T218151) (owner: 10Tpt) [13:44:32] (03PS4) 10GTirloni: openstack: Automatically start/stop VMs on hypervisor boot/shutdown [puppet] - 10https://gerrit.wikimedia.org/r/493807 (https://phabricator.wikimedia.org/T216040) [13:45:21] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable base::service_auto_restart for prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/496764 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:46:54] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:47:20] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [13:47:21] !log akosiaris@deploy1001 scap-helm cxserver cluster staging completed [13:47:21] !log akosiaris@deploy1001 scap-helm cxserver finished [13:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:56] (03PS2) 10Filippo Giunchedi: prometheus: maximum connections to proxypass [puppet] - 10https://gerrit.wikimedia.org/r/496750 (https://phabricator.wikimedia.org/T217715) [13:50:50] !log rolling reboot of ores in codfw for SSBD/L1TF kernel update [13:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:42] (03CR) 10GTirloni: [C: 03+2] openstack: Automatically start/stop VMs on hypervisor boot/shutdown [puppet] - 10https://gerrit.wikimedia.org/r/493807 (https://phabricator.wikimedia.org/T216040) (owner: 10GTirloni) [13:51:48] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:53:05] 10Operations, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10fgiunchedi) >>! In T217715#5026628, @jcrespo wrote: > may I ask for the actual (emergency) actions you to... [13:58:44] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:59:02] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:59:40] PROBLEM - puppet last run on labtestmetal2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:00:18] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:04] PROBLEM - puppet last run on cloudvirt1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:20] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:02:44] PROBLEM - puppet last run on cloudvirt1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:54] PROBLEM - puppet last run on labtestvirt2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:54] PROBLEM - puppet last run on cloudvirtan1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:54] PROBLEM - puppet last run on cloudvirtan1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:03:15] hey akosiaris, is verb=CONNECT latency meaningful to monitor? [14:03:20] gtirloni: maybe your change? ^ (puppet failures) [14:03:29] yeah :| [14:03:31] checking that [14:04:38] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:05:46] cdanis: good q. I think it's only used for the apiserver to kubelet proxying [14:05:53] logs, exec and so on [14:06:05] that was my guess but I hadn't actually been able to prove that [14:06:18] it's probably nice to have graphs of it, but not alert on it? [14:06:45] SGTM, especially given that 'kubectl logs -f' exists [14:08:03] (03PS1) 10Elukey: Add core-site.xml TLS properties to Hadoop Test Analytics [puppet] - 10https://gerrit.wikimedia.org/r/496775 (https://phabricator.wikimedia.org/T217412) [14:08:30] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:08:41] it alerted a few times earlier this week and I think that the only thing that was actually happening was... someone was using kubectl to debug stuff [14:09:02] that sounds plausible [14:09:33] there is one more case where these might alert and it's when etcd latencies increase as well [14:09:40] but we have an alert for those specifically [14:09:44] (03PS1) 10Jbond: Create and mtail parser for ulogd and install it on the syslog server [puppet] - 10https://gerrit.wikimedia.org/r/496776 (https://phabricator.wikimedia.org/T215277) [14:09:46] it also happened only once up to now [14:09:57] some network issues between the etcds at that point. A rack move IIRC [14:10:10] A machined moved between racks/switches, that is [14:10:21] (03CR) 10Elukey: [C: 03+2] Add core-site.xml TLS properties to Hadoop Test Analytics [puppet] - 10https://gerrit.wikimedia.org/r/496775 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [14:10:44] PROBLEM - puppet last run on cloudvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:11:38] PROBLEM - puppet last run on cloudvirtan1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:11:48] PROBLEM - Disk space on prometheus2003 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/services 3255 MB (3% inode=99%) [14:11:48] PROBLEM - puppet last run on labtestvirt2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:11:58] (03CR) 10Paladox: [V: 03+2 C: 03+2] Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/496441 (owner: 10Paladox) [14:12:09] (03PS1) 10GTirloni: openstack: Only perform VM startup/shutdown on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/496778 (https://phabricator.wikimedia.org/T216040) [14:13:05] (03PS2) 10GTirloni: openstack: Only perform VM startup/shutdown on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/496778 (https://phabricator.wikimedia.org/T216040) [14:13:07] prometheus2003 is me [14:13:41] (03CR) 10jerkins-bot: [V: 04-1] openstack: Only perform VM startup/shutdown on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/496778 (https://phabricator.wikimedia.org/T216040) (owner: 10GTirloni) [14:14:01] 10Operations, 10Cloud-VPS, 10Traffic, 10LDAP, 10cloud-services-team (Kanban): Update openldap profile to use LE - https://phabricator.wikimedia.org/T218398 (10Dzahn) [14:14:16] RECOVERY - Disk space on prometheus2003 is OK: DISK OK [14:14:53] (03PS3) 10GTirloni: openstack: Only perform VM startup/shutdown on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/496778 (https://phabricator.wikimedia.org/T216040) [14:14:54] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:15:22] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:15:37] (03CR) 10GTirloni: [C: 03+2] openstack: Only perform VM startup/shutdown on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/496778 (https://phabricator.wikimedia.org/T216040) (owner: 10GTirloni) [14:15:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, bummer we're restarting and not reload but Good Enough™ for now" [puppet] - 10https://gerrit.wikimedia.org/r/496625 (owner: 10Herron) [14:15:48] (03PS4) 10GTirloni: openstack: Only perform VM startup/shutdown on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/496778 (https://phabricator.wikimedia.org/T216040) [14:15:48] 10Operations, 10Cloud-VPS, 10Traffic, 10LDAP, 10cloud-services-team (Kanban): Update openldap profile to use LE - https://phabricator.wikimedia.org/T218398 (10Dzahn) per brief IRC chat with @vgutierrez it should be possible to migrate these to use Certcentral (https://wikitech.wikimedia.org/wiki/Certcent... [14:16:24] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:16:43] (03PS1) 10Ema: Initial debianization [debs/superior-cache-analyzer] - 10https://gerrit.wikimedia.org/r/496781 [14:17:14] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:18:28] (03PS2) 10Herron: rsyslog: restart service on kafka_shipper lookup table change [puppet] - 10https://gerrit.wikimedia.org/r/496625 [14:19:00] PROBLEM - puppet last run on cloudvirtan1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:19:42] 10Operations, 10Cloud-VPS, 10Traffic, 10LDAP, 10cloud-services-team (Kanban): Update openldap profile to use LE - https://phabricator.wikimedia.org/T218398 (10Dzahn) command for testing connection over ldaps with more debug info why it fails: [ldap-eqiad-replica01:/etc/ldap] $ `ldapsearch -H ldaps://lda... [14:20:14] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:20:19] (03PS1) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [14:20:32] PROBLEM - puppet last run on cloudvirt1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:20:34] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:20:54] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [14:21:04] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:21:34] PROBLEM - Check systemd state on cloudvirt1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:21:36] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:21:44] (03PS4) 10Dzahn: rm mediawiki::generic_monitoring (Apple bridge) [puppet] - 10https://gerrit.wikimedia.org/r/496489 [14:22:28] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:22:42] PROBLEM - puppet last run on cloudvirt1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:22:50] RECOVERY - Check systemd state on cloudvirt1013 is OK: OK - running: The system is fully operational [14:23:13] (03PS1) 10Elukey: Fix TLS parameter in Hadoop Test Analytics config [puppet] - 10https://gerrit.wikimedia.org/r/496784 (https://phabricator.wikimedia.org/T217412) [14:23:17] (03PS2) 10Ema: Initial debianization [debs/superior-cache-analyzer] - 10https://gerrit.wikimedia.org/r/496781 (https://phabricator.wikimedia.org/T213263) [14:23:24] (03CR) 10Herron: [C: 03+2] rsyslog: restart service on kafka_shipper lookup table change [puppet] - 10https://gerrit.wikimedia.org/r/496625 (owner: 10Herron) [14:23:32] RECOVERY - puppet last run on cloudvirt1022 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:23:34] PROBLEM - puppet last run on cloudvirt1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:23:46] RECOVERY - puppet last run on cloudvirtan1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:23:46] RECOVERY - puppet last run on cloudvirtan1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:23:58] (03CR) 10Elukey: [C: 03+2] Fix TLS parameter in Hadoop Test Analytics config [puppet] - 10https://gerrit.wikimedia.org/r/496784 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [14:24:07] (03PS2) 10Elukey: Fix TLS parameter in Hadoop Test Analytics config [puppet] - 10https://gerrit.wikimedia.org/r/496784 (https://phabricator.wikimedia.org/T217412) [14:24:11] (03PS2) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [14:24:14] RECOVERY - puppet last run on cloudvirtan1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:24:33] (03CR) 10Elukey: [V: 03+2 C: 03+2] Fix TLS parameter in Hadoop Test Analytics config [puppet] - 10https://gerrit.wikimedia.org/r/496784 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [14:25:10] (03PS5) 10Dzahn: rm mediawiki::generic_monitoring (Apple bridge) [puppet] - 10https://gerrit.wikimedia.org/r/496489 [14:25:42] RECOVERY - puppet last run on labtestmetal2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:25:44] RECOVERY - puppet last run on cloudvirt1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:26:22] RECOVERY - puppet last run on cloudvirt1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:26:53] (03PS1) 10Andrew Bogott: acme_chief: generate certs for ldap-labs/ldap-ro servers [puppet] - 10https://gerrit.wikimedia.org/r/496785 (https://phabricator.wikimedia.org/T218398) [14:27:04] RECOVERY - puppet last run on cloudvirt1017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:27:16] RECOVERY - puppet last run on cloudvirtan1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:27:47] (03PS4) 10Dzahn: service: add Icinga notes URL in defined types [puppet] - 10https://gerrit.wikimedia.org/r/496435 (https://phabricator.wikimedia.org/T197873) [14:27:54] RECOVERY - puppet last run on cloudvirt1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:28:48] RECOVERY - puppet last run on cloudvirt1023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:29:00] RECOVERY - puppet last run on labtestvirt2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:29:40] (03PS3) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [14:31:44] !log tools-sgebastion-07 - generating locales for user request in T130532 [14:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:47] T130532: Offer Korean Locales "ko_KR.euckr" and "ko_KR.utf8" on Tool Labs - https://phabricator.wikimedia.org/T130532 [14:33:25] (03PS2) 10Mholloway: Admin: Update production SSH key for Michael Holloway [puppet] - 10https://gerrit.wikimedia.org/r/496706 [14:35:04] (03CR) 10Mholloway: "Done, thanks Vgutierrez!" [puppet] - 10https://gerrit.wikimedia.org/r/496706 (owner: 10Mholloway) [14:37:00] (03PS2) 10Andrew Bogott: acme_chief: generate certs for ldap-labs/ldap-ro servers [puppet] - 10https://gerrit.wikimedia.org/r/496785 (https://phabricator.wikimedia.org/T218398) [14:39:36] (03CR) 10Vgutierrez: [C: 03+1] Admin: Update production SSH key for Michael Holloway [puppet] - 10https://gerrit.wikimedia.org/r/496706 (owner: 10Mholloway) [14:39:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "1 comments, rest LGTM" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/496007 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [14:41:22] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/496785 (https://phabricator.wikimedia.org/T218398) (owner: 10Andrew Bogott) [14:42:30] (03PS3) 10Andrew Bogott: acme_chief: generate certs for ldap-labs/ldap-ro servers [puppet] - 10https://gerrit.wikimedia.org/r/496785 (https://phabricator.wikimedia.org/T218398) [14:42:49] (03PS4) 10Andrew Bogott: acme_chief: generate certs for ldap-labs/ldap-ro servers [puppet] - 10https://gerrit.wikimedia.org/r/496785 (https://phabricator.wikimedia.org/T218398) [14:42:51] (03PS1) 10Andrew Bogott: openldap: switch to using acme certs [puppet] - 10https://gerrit.wikimedia.org/r/496790 [14:43:06] RECOVERY - puppet last run on labtestvirt2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:43:39] !log rebooting etherpad1001 to pick up SSBD-enabled qemu [14:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:48] (03CR) 10Andrew Bogott: [C: 03+2] acme_chief: generate certs for ldap-labs/ldap-ro servers [puppet] - 10https://gerrit.wikimedia.org/r/496785 (https://phabricator.wikimedia.org/T218398) (owner: 10Andrew Bogott) [14:45:06] (03PS2) 10Andrew Bogott: openldap: switch to using acme certs [puppet] - 10https://gerrit.wikimedia.org/r/496790 [14:45:19] !log tools tools-sgebastion-07 - dpkg-reconfigure locales and adding ko_KR.EUC-KR for Korean users by request and as done in the past on former tools bastion [14:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:42] (03PS7) 10Gehel: elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 [14:47:24] (03PS3) 10Dzahn: Admin: Update production SSH key for Michael Holloway [puppet] - 10https://gerrit.wikimedia.org/r/496706 (owner: 10Mholloway) [14:48:00] oh man etherpad is really spammy isn't it, just saw an influx of logs (cc moritzm ) [14:49:08] only for the reboot period (a minute or so), or also after that? [14:49:49] (03CR) 10Dzahn: [C: 03+2] Admin: Update production SSH key for Michael Holloway [puppet] - 10https://gerrit.wikimedia.org/r/496706 (owner: 10Mholloway) [14:50:24] (03CR) 10Dzahn: [C: 03+2] "confirmed key matches the pastebin from phab user linked to mw user with wmf name etc" [puppet] - 10https://gerrit.wikimedia.org/r/496706 (owner: 10Mholloway) [14:51:13] (03PS3) 10Andrew Bogott: openldap: switch to using acme certs [puppet] - 10https://gerrit.wikimedia.org/r/496790 [14:53:13] (03PS4) 10Andrew Bogott: openldap: switch to using acme certs [puppet] - 10https://gerrit.wikimedia.org/r/496790 [14:54:13] PROBLEM - puppet last run on graphite2003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[openssh-client],Package[openssh-server],Package[nagios-nrpe-server] [14:54:58] (03PS5) 10Andrew Bogott: openldap: switch to using acme certs [puppet] - 10https://gerrit.wikimedia.org/r/496790 [14:56:17] 10Operations, 10monitoring, 10Patch-For-Review: link Icinga checks to runbook / notes URLs - https://phabricator.wikimedia.org/T197873 (10Dzahn) A lot of changes have been merged that i did not individually link to this ticket to avoid spamming people. There is a common topic branch, notes-urls, to list th... [15:00:18] (03PS6) 10Andrew Bogott: openldap: switch to using acme certs [puppet] - 10https://gerrit.wikimedia.org/r/496790 [15:01:19] (03PS7) 10Andrew Bogott: openldap: switch to using acme certs [puppet] - 10https://gerrit.wikimedia.org/r/496790 (https://phabricator.wikimedia.org/T218398) [15:02:17] (03PS4) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [15:02:19] (03CR) 10Andrew Bogott: [C: 03+2] openldap: switch to using acme certs [puppet] - 10https://gerrit.wikimedia.org/r/496790 (https://phabricator.wikimedia.org/T218398) (owner: 10Andrew Bogott) [15:04:35] !log cp2015: test ATS depool T213263 [15:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:39] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [15:05:22] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: service=ats-be,cluster=cache_upload,name=cp2015.codfw.wmnet [15:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:39] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 1 minute ago with 8 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/ldap.rsa-2048.crt],File[/etc/acmecerts/ldap.rsa-2048.chain.crt],File[/etc/acmecerts/ldap.rsa-2048.chained.crt],File[/etc/acmecerts/ldap.rsa-2048.key] [15:05:54] (03PS1) 10Andrew Bogott: ldap: give more systems access to the ldap certs [puppet] - 10https://gerrit.wikimedia.org/r/496799 (https://phabricator.wikimedia.org/T218398) [15:06:05] (03PS1) 10GTirloni: ldap: Increase nscd cache size [puppet] - 10https://gerrit.wikimedia.org/r/496800 (https://phabricator.wikimedia.org/T217280) [15:06:34] (03CR) 10Andrew Bogott: [C: 03+2] ldap: give more systems access to the ldap certs [puppet] - 10https://gerrit.wikimedia.org/r/496799 (https://phabricator.wikimedia.org/T218398) (owner: 10Andrew Bogott) [15:07:44] (03CR) 10Andrew Bogott: [C: 03+1] "This is something I'd thought to try but hadn't gotten around to. Cache hits are super low so this may well help." [puppet] - 10https://gerrit.wikimedia.org/r/496800 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [15:07:45] !log rebooting graphite2003 for kernel security update [15:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:30] (03CR) 10Muehlenhoff: "Is the group size really big enough? We need it to be a prime number and bigger that the accumulated number of all group memberships (and " [puppet] - 10https://gerrit.wikimedia.org/r/496800 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [15:10:36] !log cp2015: repool ATS with proxy.config.cache.ram_cache.size 1G T213263 [15:10:38] (03CR) 10Gehel: "> Patch Set 2:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [15:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:40] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [15:11:11] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend [15:11:13] PROBLEM - Labs LDAP on serpens is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [15:11:30] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: service=ats-be,cluster=cache_upload,name=cp2015.codfw.wmnet [15:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:49] PROBLEM - Check systemd state on serpens is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:11:49] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:12:25] PROBLEM - toolschecker: Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 237 bytes in 0.472 second response time https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring [15:12:42] andrewbogott: ^ [15:12:45] that paged =] [15:13:00] yep, that's me, sorry, will silence [15:13:01] received the page as well, here if you need a hand [15:13:21] uh? [15:13:24] (03PS5) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [15:13:37] yea, so that would be the LDAP server cert change [15:14:07] (03PS6) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [15:14:13] ACKNOWLEDGEMENT - Check systemd state on serpens is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott Im trying to update these certs, running into wrinkles [15:14:13] ACKNOWLEDGEMENT - Labs LDAP on serpens is CRITICAL: Could not bind to the LDAP server andrew bogott Im trying to update these certs, running into wrinkles https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [15:14:13] ACKNOWLEDGEMENT - puppet last run on serpens is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[slapd] andrew bogott Im trying to update these certs, running into wrinkles [15:14:26] looks at the systemd issue on achmiechief first [15:14:43] RECOVERY - puppet last run on graphite2003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:15:35] RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational [15:16:55] PROBLEM - Labs LDAP on labtestservices2001 is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [15:17:00] (03CR) 10Gehel: [C: 03+2] Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:17:04] (03PS20) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:18:37] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:20:28] (03PS7) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [15:21:16] ^ debug in -traffic [15:22:29] (03PS1) 10Andrew Bogott: acme ldap-labtest cert: terminate a regexp [puppet] - 10https://gerrit.wikimedia.org/r/496802 [15:23:07] (03CR) 10Andrew Bogott: [C: 03+2] acme ldap-labtest cert: terminate a regexp [puppet] - 10https://gerrit.wikimedia.org/r/496802 (owner: 10Andrew Bogott) [15:23:39] (03PS2) 10GTirloni: ldap: Increase nscd cache size [puppet] - 10https://gerrit.wikimedia.org/r/496800 (https://phabricator.wikimedia.org/T217280) [15:24:18] 10Operations, 10Patch-For-Review: Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10MoritzMuehlenhoff) [15:24:19] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief1001 is OK: PROCS OK: 1 process with args acme-chief-backend [15:24:25] (03CR) 10Gehel: [C: 04-1] "See comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [15:24:35] ACKNOWLEDGEMENT - Labs LDAP on labtestservices2001 is CRITICAL: Could not bind to the LDAP server andrew bogott working on certs https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [15:26:16] (03PS8) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [15:26:20] ACKNOWLEDGEMENT - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott This is a typo of some sort Valentin will work on it. [15:26:20] ACKNOWLEDGEMENT - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend andrew bogott This is a typo of some sort Valentin will work on it. [15:27:53] (03PS1) 10Andrew Bogott: acme_cheif: removing a troublesome config [puppet] - 10https://gerrit.wikimedia.org/r/496804 [15:28:01] (03PS3) 10Filippo Giunchedi: prometheus: maximum connections to proxypass [puppet] - 10https://gerrit.wikimedia.org/r/496750 (https://phabricator.wikimedia.org/T217715) [15:28:46] (03CR) 10Andrew Bogott: [C: 03+2] acme_cheif: removing a troublesome config [puppet] - 10https://gerrit.wikimedia.org/r/496804 (owner: 10Andrew Bogott) [15:30:11] RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational [15:30:51] RECOVERY - Labs LDAP on serpens is OK: LDAP OK - 0.016 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [15:31:11] RECOVERY - toolschecker: Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 15.674 second response time https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring [15:31:15] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:31:21] RECOVERY - Check systemd state on serpens is OK: OK - running: The system is fully operational [15:34:14] (03PS1) 10Herron: rsyslog: update syslog_json template with format jsonfr [puppet] - 10https://gerrit.wikimedia.org/r/496806 (https://phabricator.wikimedia.org/T213899) [15:36:54] (03PS1) 10Andrew Bogott: openldap certs: allow group 'openldap' to read them [puppet] - 10https://gerrit.wikimedia.org/r/496807 (https://phabricator.wikimedia.org/T218398) [15:37:15] PROBLEM - Check systemd state on serpens is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:38:01] !log rebooting labtestnet2002 for kernel update [15:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:09] (03CR) 10Andrew Bogott: [C: 03+2] openldap certs: allow group 'openldap' to read them [puppet] - 10https://gerrit.wikimedia.org/r/496807 (https://phabricator.wikimedia.org/T218398) (owner: 10Andrew Bogott) [15:38:53] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[slapd] [15:39:17] (03CR) 10Jcrespo: [C: 04-1] "These dns entries do not work anymore." [dns] - 10https://gerrit.wikimedia.org/r/496410 (owner: 10Marostegui) [15:39:43] RECOVERY - Check systemd state on serpens is OK: OK - running: The system is fully operational [15:40:13] (03CR) 10Jcrespo: "Please note python linting, testing and development is done on the saner wmfmariadbpy repo." [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [15:42:49] !log rebooting labtestcontrol2003 for kernel update [15:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:54] ACKNOWLEDGEMENT - MD RAID on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 port 5666: No route to host nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T218403 [15:44:00] 10Operations, 10ops-codfw: Degraded RAID on labtestcontrol2003 - https://phabricator.wikimedia.org/T218403 (10ops-monitoring-bot) [15:44:12] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:45:10] (03CR) 10CRusnov: "> Patch Set 3: Code-Review-1" (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [15:47:51] !log rebooting labtestservices2002 for kernel update [15:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:55] !log enabling puppet on seaborgium to apply new acme cert [15:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:12] ACKNOWLEDGEMENT - MD RAID on labtestservices2002 is CRITICAL: connect to address 208.80.153.76 port 5666: No route to host nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T218405 [15:50:19] 10Operations, 10ops-codfw: Degraded RAID on labtestservices2002 - https://phabricator.wikimedia.org/T218405 (10ops-monitoring-bot) [15:50:20] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 5 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/ldap-labtest.rsa-2048.crt],File[/etc/acmecerts/ldap-labtest.rsa-2048.chain.crt],File[/etc/acmecerts/ldap-labtest.rsa-2048.chained.crt],File[/etc/acmecerts/ldap-labtest.rsa-2048.key] [15:51:42] akosiaris: I see two icinga warnings for "Stale template error files present for '/srv/config-master/pybal/eqiad/citoid'" (and codfw) [15:51:46] akosiaris: known? [15:53:07] (03PS3) 10GTirloni: ldap: Increase nscd cache size [puppet] - 10https://gerrit.wikimedia.org/r/496800 (https://phabricator.wikimedia.org/T217280) [15:53:26] (03CR) 10Jcrespo: [C: 04-1] "The actual method is https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replica_DNS , I think" [dns] - 10https://gerrit.wikimedia.org/r/496410 (owner: 10Marostegui) [15:53:38] ema: hmm yeah it must be garbage from the migration yesterday, /me looking [15:53:58] well .. "migration". it was just switching over the config. the traffic migration had already happened [15:54:38] !log rebooting labtestservices2003 for kernel update [15:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:28] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): Hardware decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10Andrew) right now labtestservices2001 is the only host for the labtest ldap db. So we should move that someplace before we deco... [15:55:47] (03CR) 10GTirloni: [C: 03+2] ldap: Increase nscd cache size [puppet] - 10https://gerrit.wikimedia.org/r/496800 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [15:59:18] !log puppetmaster1001 rm /var/run/confd-template/.citoid*.err to remove old stale confd files that resulted from merging https://gerrit.wikimedia.org/r/494213 [15:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:21] ema ^ [15:59:25] should recover [16:01:36] akosiaris: it has, thanks! [16:02:39] (03PS14) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) [16:03:05] (03CR) 10CRusnov: "> Patch Set 13: Code-Review-1" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [16:05:21] (03PS1) 10GTirloni: ldap: Increase nscd max file size [puppet] - 10https://gerrit.wikimedia.org/r/496812 (https://phabricator.wikimedia.org/T217280) [16:06:22] (03PS1) 10Ppchelko: WIP, TEST for PPC, Add rsyslog kafka to services. [puppet] - 10https://gerrit.wikimedia.org/r/496813 [16:06:45] (03CR) 10GTirloni: [C: 03+2] ldap: Increase nscd max file size [puppet] - 10https://gerrit.wikimedia.org/r/496812 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [16:07:34] (03CR) 10jerkins-bot: [V: 04-1] WIP, TEST for PPC, Add rsyslog kafka to services. [puppet] - 10https://gerrit.wikimedia.org/r/496813 (owner: 10Ppchelko) [16:09:18] !log upgrading deployment-deploy01 to component/php72 [16:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:05] 10Operations, 10Traffic, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ema) RAM cache usage seems to be growing non-stop. I have left `proxy.config.cache.ram_cache.size` to the default value of `-1`, which according... [16:26:36] (03PS2) 10Jbond: Create and mtail parser for ulogd and install it on the syslog server [puppet] - 10https://gerrit.wikimedia.org/r/496776 (https://phabricator.wikimedia.org/T215277) [16:27:31] (03CR) 10jerkins-bot: [V: 04-1] Create and mtail parser for ulogd and install it on the syslog server [puppet] - 10https://gerrit.wikimedia.org/r/496776 (https://phabricator.wikimedia.org/T215277) (owner: 10Jbond) [16:28:27] (03PS9) 10Andrew Bogott: Service name and IPs for ldap-behind-lvs [dns] - 10https://gerrit.wikimedia.org/r/496007 (https://phabricator.wikimedia.org/T218133) [16:29:38] (03PS10) 10Andrew Bogott: Service name and IPs for ldap-behind-lvs [dns] - 10https://gerrit.wikimedia.org/r/496007 (https://phabricator.wikimedia.org/T218133) [16:30:14] (03CR) 10Andrew Bogott: [C: 03+2] Service name and IPs for ldap-behind-lvs [dns] - 10https://gerrit.wikimedia.org/r/496007 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [16:32:58] (03PS3) 10Jbond: Create and mtail parser for ulogd and install it on the syslog server [puppet] - 10https://gerrit.wikimedia.org/r/496776 (https://phabricator.wikimedia.org/T215277) [16:33:31] (03CR) 10Muehlenhoff: [C: 03+1] admin: create user with analytics-privatedata access for sukhe [puppet] - 10https://gerrit.wikimedia.org/r/496379 (https://phabricator.wikimedia.org/T217438) (owner: 10Vgutierrez) [16:33:52] (03CR) 10jerkins-bot: [V: 04-1] Create and mtail parser for ulogd and install it on the syslog server [puppet] - 10https://gerrit.wikimedia.org/r/496776 (https://phabricator.wikimedia.org/T215277) (owner: 10Jbond) [16:33:53] 10Operations, 10netops: eqiad - eqord Telia link down - IC-314533 - https://phabricator.wikimedia.org/T218307 (10Dzahn) Telia claims the issue was resolved. our monitoring can't confirm that. ` Telia Carrier Case Reference 00959377 Service ID IC-314533 Service Type DWDM - 10 Gigabit - Ashburn - Chicago Ca... [16:39:40] (03PS4) 10Jbond: Create and mtail parser for ulogd and install it on the syslog server [puppet] - 10https://gerrit.wikimedia.org/r/496776 (https://phabricator.wikimedia.org/T215277) [16:39:55] (03PS5) 10Dzahn: icinga: make notes_url a required parameter of monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) [16:40:57] (03CR) 10jerkins-bot: [V: 04-1] icinga: make notes_url a required parameter of monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [16:41:29] 10Operations, 10netops: eqiad - eqord Telia link down - IC-314533 - https://phabricator.wikimedia.org/T218307 (10ayounsi) a:03ayounsi We're getting one way light, followed up with Telia. Ashburn side: ` Physical interface: xe-4/2/0 Laser bias current : 43.070 mA Laser output... [16:44:32] (03CR) 10Jbond: "i think the patch is ready for review now. It currently has four different metrics, i don't expect all of them to be used but i thought i" [puppet] - 10https://gerrit.wikimedia.org/r/496776 (https://phabricator.wikimedia.org/T215277) (owner: 10Jbond) [16:44:50] (03PS1) 10BryanDavis: keyholder: use os.getgrouplist() for listing a user's groups [puppet] - 10https://gerrit.wikimedia.org/r/496823 (https://phabricator.wikimedia.org/T204681) [16:45:14] (03PS1) 10Dzahn: mediawiki::webserver: Icinga link for etcd config check [puppet] - 10https://gerrit.wikimedia.org/r/496824 [16:46:02] (03CR) 10Faidon Liambotis: [C: 03+1] keyholder: use os.getgrouplist() for listing a user's groups [puppet] - 10https://gerrit.wikimedia.org/r/496823 (https://phabricator.wikimedia.org/T204681) (owner: 10BryanDavis) [16:47:00] (03CR) 10Dzahn: [C: 03+2] mediawiki::webserver: Icinga link for etcd config check [puppet] - 10https://gerrit.wikimedia.org/r/496824 (owner: 10Dzahn) [16:47:10] (03PS2) 10Andrew Bogott: Add an lvs service in front of our two ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/496065 (https://phabricator.wikimedia.org/T218133) [16:47:48] (03PS6) 10Dzahn: icinga: make notes_url a required parameter of monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) [16:48:04] 10Operations, 10ops-eqiad: No description on asw2-c-eqiad:xe-2/0/5 - https://phabricator.wikimedia.org/T218411 (10ayounsi) p:05Triage→03Low [16:48:05] (03CR) 10jerkins-bot: [V: 04-1] Add an lvs service in front of our two ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/496065 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [16:51:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] keyholder: use os.getgrouplist() for listing a user's groups [puppet] - 10https://gerrit.wikimedia.org/r/496823 (https://phabricator.wikimedia.org/T204681) (owner: 10BryanDavis) [16:51:44] (03PS2) 10Arturo Borrero Gonzalez: keyholder: use os.getgrouplist() for listing a user's groups [puppet] - 10https://gerrit.wikimedia.org/r/496823 (https://phabricator.wikimedia.org/T204681) (owner: 10BryanDavis) [16:58:10] PROBLEM - Keyholder SSH agent on acmechief1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [16:58:24] 10Operations, 10serviceops: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (10jijiki) [16:58:57] !log arming keyholder in cumin1001 [16:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:08] PROBLEM - Keyholder SSH agent on deploy1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [16:59:20] (03PS3) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496065 (https://phabricator.wikimedia.org/T218133) [16:59:42] PROBLEM - Keyholder SSH agent on labpuppetmaster1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [16:59:48] PROBLEM - Keyholder SSH agent on sarin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [17:00:14] !log arm keyholder in acmechief1001 [17:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:28] PROBLEM - Keyholder SSH agent on deploy2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [17:00:36] RECOVERY - Keyholder SSH agent on acmechief1001 is OK: OK: Keyholder is armed with all configured keys. [17:00:37] bd808: -_- [17:00:51] !log clean up rigel switch port [17:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:24] !log arm keyholder in deploy101 [17:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:31] (03CR) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496065 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [17:01:34] RECOVERY - Keyholder SSH agent on deploy1001 is OK: OK: Keyholder is armed with all configured keys. [17:02:43] !log arm keyholder in labpuppetmaster1002 [17:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:02] PROBLEM - Host paws.wmflabs.org is DOWN: PING CRITICAL - Packet loss = 100% [17:03:22] RECOVERY - Keyholder SSH agent on labpuppetmaster1002 is OK: OK: Keyholder is armed with all configured keys. [17:03:41] !log arm keyholder in sarin [17:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:20] PROBLEM - Keyholder SSH agent on neodymium is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [17:04:38] RECOVERY - Keyholder SSH agent on sarin is OK: OK: Keyholder is armed with all configured keys. [17:04:57] !log arm keyholder in deploy2001 [17:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:00] (03CR) 10Dzahn: "The notes_url parameter has now been added to all existing uses of monitoring::service in https://gerrit.wikimedia.org/r/q/topic:%22notes-" [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [17:05:20] RECOVERY - Keyholder SSH agent on deploy2001 is OK: OK: Keyholder is armed with all configured keys. [17:06:07] (03PS7) 10Dzahn: icinga: make notes_url a required parameter of monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) [17:06:40] (03PS8) 10Dzahn: icinga: make notes_url a required parameter of monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) [17:07:05] arturo: :( E_TOOMANYKEYHOLDERS [17:07:15] it should be fine now [17:08:07] (03CR) 10Dzahn: [C: 03+2] icinga: make notes_url a required parameter of monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [17:08:50] PROBLEM - Keyholder SSH agent on labpuppetmaster1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [17:10:02] PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [17:11:22] PROBLEM - Keyholder SSH agent on netmon2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [17:11:44] (03PS1) 10Ema: ATS: set RAM cache size [puppet] - 10https://gerrit.wikimedia.org/r/496829 (https://phabricator.wikimedia.org/T213263) [17:12:48] !log netmon1002 - armed keyholder for rancid [17:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:25] !log netmon2001 - armed keyholder for rancid [17:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:38] RECOVERY - Keyholder SSH agent on netmon1002 is OK: OK: Keyholder is armed with all configured keys. [17:13:38] (03PS4) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496065 (https://phabricator.wikimedia.org/T218133) [17:13:44] RECOVERY - Keyholder SSH agent on netmon2001 is OK: OK: Keyholder is armed with all configured keys. [17:15:06] PROBLEM - Keyholder SSH agent on cumin2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [17:16:54] (03PS1) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [17:17:57] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [17:19:26] PROBLEM - puppet last run on icinga2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:20:06] PROBLEM - Keyholder SSH agent on acmechief2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [17:20:55] oh man [17:21:16] thanks mutante [17:21:46] !log updating puppet compiler facts [17:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:15] !log cumin2001 - armed keyholder [17:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:18] RECOVERY - Host paws.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 36.88 ms [17:22:24] RECOVERY - Keyholder SSH agent on cumin2001 is OK: OK: Keyholder is armed with all configured keys. [17:23:58] (03CR) 10CDanis: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/496750 (https://phabricator.wikimedia.org/T217715) (owner: 10Filippo Giunchedi) [17:25:36] !log acmechief2001 - armed keyholder [17:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:46] RECOVERY - Keyholder SSH agent on labpuppetmaster1001 is OK: OK: Keyholder is armed with all configured keys. [17:26:08] RECOVERY - Keyholder SSH agent on acmechief2001 is OK: OK: Keyholder is armed with all configured keys. [17:26:26] PROBLEM - Host paws.wmflabs.org is DOWN: PING CRITICAL - Packet loss = 100% [17:28:38] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:28:52] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:29:53] (03PS1) 10Dzahn: icinga: add notes_url for toollabs checker [puppet] - 10https://gerrit.wikimedia.org/r/496834 [17:30:41] (03CR) 10Dzahn: [C: 03+2] icinga: add notes_url for toollabs checker [puppet] - 10https://gerrit.wikimedia.org/r/496834 (owner: 10Dzahn) [17:31:22] ACKNOWLEDGEMENT - puppet last run on icinga2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/496830/1 [17:31:22] ACKNOWLEDGEMENT - puppet last run on labmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/496830/1 [17:31:22] ACKNOWLEDGEMENT - puppet last run on labmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/496830/1 [17:31:40] (03PS2) 10Dzahn: icinga: add notes_url for toollabs checker [puppet] - 10https://gerrit.wikimedia.org/r/496834 [17:31:44] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10Krenair) Isn't the existing list private though? There is nothing in this request to suggest the new list would be non-public. [17:37:02] RECOVERY - Host paws.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 36.29 ms [17:39:52] (03PS1) 10Dzahn: grafana: add a missing Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/496835 [17:40:06] RECOVERY - puppet last run on icinga2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:41:38] (03CR) 10Dzahn: [C: 03+2] grafana: add a missing Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/496835 (owner: 10Dzahn) [17:41:46] (03PS2) 10Dzahn: grafana: add a missing Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/496835 [17:42:28] jdlrobson: your change is live on mwdebug1002, check please [17:43:04] PROBLEM - puppet last run on grafana1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:44:19] (03PS1) 10CRusnov: Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 [17:44:53] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: nuke_limit often reached on esams varnish frontends - https://phabricator.wikimedia.org/T216006 (10ema) The issue should be fixed. @VladimirAlexiev, @Addshore: can you confirm? [17:45:08] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:45:14] (03CR) 10jerkins-bot: [V: 04-1] Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (owner: 10CRusnov) [17:46:18] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:46:56] ^ that netmon problem is caused by the recent merge of the requiring notes_url for monitoring checks [17:47:08] chaomodus: yes, i am aware and fixing it [17:47:14] just did the same for labmon [17:47:16] is the requirements for that documented somewhere for future reference? [17:47:22] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:47:29] it would be if i had a chance to send the mail :) [17:47:37] but other things keep breaking [17:47:38] aha okay :) [17:48:31] chaomodus: the sad part is that normally jenkins bot detects those before merge but in some cases it does not [17:48:56] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:49:10] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:49:12] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10herron) [17:51:17] (03PS1) 10Dzahn: netbox: add missing Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/496838 [17:51:55] (03CR) 10CRusnov: [C: 03+1] netbox: add missing Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/496838 (owner: 10Dzahn) [17:52:03] (03PS2) 10Dzahn: netbox: add missing Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/496838 [17:52:11] (03PS2) 10CRusnov: Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) [17:52:20] (03CR) 10Dzahn: [C: 03+2] netbox: add missing Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/496838 (owner: 10Dzahn) [17:52:43] (03PS1) 10Andrew Bogott: openldap: add more fake passwords for the compiler [labs/private] - 10https://gerrit.wikimedia.org/r/496839 [17:53:13] !log depool cp2002's varnish-fe for the weekend T213263#5027366 [17:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:15] (03CR) 10jerkins-bot: [V: 04-1] Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [17:53:16] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [17:53:32] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2002.codfw.wmnet,service=nginx [17:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:34] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2002.codfw.wmnet,service=varnish-fe [17:53:34] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] openldap: add more fake passwords for the compiler [labs/private] - 10https://gerrit.wikimedia.org/r/496839 (owner: 10Andrew Bogott) [17:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:22] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:55:26] jdlrobson: how does mwdebug1002 look? still testing? [17:56:34] LGTM please sync! [17:56:52] * thcipriani does [17:57:18] (03PS5) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496065 (https://phabricator.wikimedia.org/T218133) [17:57:26] (03PS6) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496065 (https://phabricator.wikimedia.org/T218133) [17:57:54] (03PS2) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [17:59:01] thcipriani: please confirm once it's live so i can release the QA hounds [17:59:10] jdlrobson: will do, syncing now [17:59:13] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [17:59:31] and it goes without saying thank you for doing a scary friday deploy (there should be a t-shirt for that if there isn't) [17:59:54] (03PS1) 10Andrew Bogott: Add acme-chief key for labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/496841 [17:59:54] !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.21/extensions/MobileFrontend: SWAT: [[gerrit:496827|iOS: Fix mobile editor]] T218069 T218062 T218352 T211490 T218062 T211491 T172877 (duration: 00m 54s) [18:00:00] ^ jdlrobson live now [18:00:02] (03CR) 10Bstorm: [C: 03+2] dumps distribution: fail dumps back to labstore1006 [dns] - 10https://gerrit.wikimedia.org/r/496614 (https://phabricator.wikimedia.org/T217473) (owner: 10Bstorm) [18:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:09] T218062: Dialogs in iOS mobile VE can't be scrolled - https://phabricator.wikimedia.org/T218062 [18:00:09] T218352: iOS Safari: Can't save edit in WikiText editor after entering edit summary - https://phabricator.wikimedia.org/T218352 [18:00:10] T172877: Backspace during mobile editing moves viewing window to the top - https://phabricator.wikimedia.org/T172877 [18:00:10] T211490: iPad: scrolling is broken when editing mobile site in source mode - https://phabricator.wikimedia.org/T211490 [18:00:10] T211491: iPad Pro/iOS 12.1: cursor jumps out of view when editing mobile site in source mode - https://phabricator.wikimedia.org/T211491 [18:00:11] T218069: Multiple issues with scrolling in the "Add discussion" overlay on iOS render it unusable (can't save, can't type long messages) - https://phabricator.wikimedia.org/T218069 [18:00:12] (03PS2) 10Bstorm: dumps distribution: fail dumps back to labstore1006 [dns] - 10https://gerrit.wikimedia.org/r/496614 (https://phabricator.wikimedia.org/T217473) [18:00:43] (03PS2) 10CRusnov: Add report which checks against puppetdb and compares serial numbers [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/495267 (https://phabricator.wikimedia.org/T212526) [18:00:51] (03CR) 10Vgutierrez: [C: 03+1] Add acme-chief key for labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/496841 (owner: 10Andrew Bogott) [18:01:20] (03CR) 10Andrew Bogott: [C: 03+2] Add acme-chief key for labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/496841 (owner: 10Andrew Bogott) [18:01:29] (03PS2) 10Andrew Bogott: Add acme-chief key for labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/496841 [18:01:31] (03PS3) 10Bstorm: Revert "dumps distribution: swap do_acme for dumps server failover" [puppet] - 10https://gerrit.wikimedia.org/r/496576 [18:02:25] (03CR) 10CRusnov: "See separate patch for the microservice proxy in progress. This is trivially ported to that proxy of course." (034 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/495267 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [18:02:39] (03PS1) 10Dzahn: mediawiki::webserver: add missing notes URL [puppet] - 10https://gerrit.wikimedia.org/r/496842 [18:03:44] (03CR) 10Bstorm: [V: 03+2] Revert "dumps distribution: swap do_acme for dumps server failover" [puppet] - 10https://gerrit.wikimedia.org/r/496576 (owner: 10Bstorm) [18:03:53] (03CR) 10Bstorm: [V: 03+2 C: 03+2] Revert "dumps distribution: swap do_acme for dumps server failover" [puppet] - 10https://gerrit.wikimedia.org/r/496576 (owner: 10Bstorm) [18:03:55] (03PS2) 10Dzahn: mediawiki::webserver: add missing notes URL [puppet] - 10https://gerrit.wikimedia.org/r/496842 [18:04:04] (03PS4) 10Bstorm: Revert "dumps distribution: swap do_acme for dumps server failover" [puppet] - 10https://gerrit.wikimedia.org/r/496576 [18:04:14] (03CR) 10Bstorm: [V: 03+2 C: 03+2] Revert "dumps distribution: swap do_acme for dumps server failover" [puppet] - 10https://gerrit.wikimedia.org/r/496576 (owner: 10Bstorm) [18:06:06] (03CR) 10Dzahn: [C: 03+2] mediawiki::webserver: add missing notes URL [puppet] - 10https://gerrit.wikimedia.org/r/496842 (owner: 10Dzahn) [18:06:16] (03PS3) 10Dzahn: mediawiki::webserver: add missing notes URL [puppet] - 10https://gerrit.wikimedia.org/r/496842 [18:06:41] thank you thcipriani it's looking good [18:06:51] waitng for edit spikes in the graphs now ;-) [18:07:20] jdlrobson: glad to hear everything appears to be working :) [18:08:19] RECOVERY - puppet last run on grafana1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:08:25] (03PS3) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [18:10:10] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [18:10:35] (03CR) 10CRusnov: "-inline" (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/495267 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [18:11:15] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2566 MB (5% inode=57%) [18:11:55] RECOVERY - Labs LDAP on labtestservices2001 is OK: LDAP OK - 0.013 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [18:12:45] (03PS7) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496065 (https://phabricator.wikimedia.org/T218133) [18:15:10] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:15:12] (03CR) 10Andrew Bogott: [C: 03+2] Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496065 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [18:15:43] (03PS3) 10CRusnov: Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) [18:17:37] (03CR) 10jerkins-bot: [V: 04-1] Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [18:24:46] 10Operations, 10ops-eqiad: No description on asw2-c-eqiad:xe-2/0/5 - https://phabricator.wikimedia.org/T218411 (10Cmjohnson) 05Open→03Resolved Port description corrected xe-2/0/5 up up cloudelastic1003 [18:27:17] (03PS4) 10Eevans: Ad-hoc Cassandra clusters for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/496192 [18:27:23] (03PS4) 10CRusnov: Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) [18:27:39] (03PS1) 10Jforrester: Duplicate …Squid variables into …Cdn ahead of MW renaming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496847 (https://phabricator.wikimedia.org/T104148) [18:27:42] (03PS1) 10Jforrester: Stop reading wmgUseClusterSquid, never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496848 [18:27:44] (03PS1) 10Jforrester: Stop setting wmgUseClusterSquid, never varied, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496849 [18:27:46] (03PS1) 10Jforrester: De-dplicate …Squid variables now MW only uses the …Cdn ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496850 (https://phabricator.wikimedia.org/T104148) [18:29:47] (03CR) 10Ppchelko: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/496813 (owner: 10Ppchelko) [18:31:27] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban: confirm gpu form factor in stat1005 - https://phabricator.wikimedia.org/T216528 (10Cmjohnson) @elukey please see attached jpg . {F28393516} [18:31:56] (03PS1) 10Andrew Bogott: Revert "Add lvs to the read-only ldap replicas" [puppet] - 10https://gerrit.wikimedia.org/r/496856 [18:32:32] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:32:54] (03CR) 10jerkins-bot: [V: 04-1] Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [18:33:32] (03PS1) 10Bmansurov: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496857 (https://phabricator.wikimedia.org/T213969) [18:34:12] RECOVERY - Disk space on contint1001 is OK: DISK OK [18:34:16] (03CR) 10jerkins-bot: [V: 04-1] WIP, TEST for PPC, Add rsyslog kafka to services. [puppet] - 10https://gerrit.wikimedia.org/r/496813 (owner: 10Ppchelko) [18:34:32] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Add lvs to the read-only ldap replicas" [puppet] - 10https://gerrit.wikimedia.org/r/496856 (owner: 10Andrew Bogott) [18:34:36] (03PS5) 10CRusnov: Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) [18:34:43] (03CR) 10Vgutierrez: [C: 03+1] "pcc shows how prod is broken and works with the change: https://puppet-compiler.wmflabs.org/compiler1002/15162/" [puppet] - 10https://gerrit.wikimedia.org/r/496856 (owner: 10Andrew Bogott) [18:37:04] (03CR) 10Jforrester: [C: 04-2] "Not until we know we're going ahead with this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496847 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester) [18:37:06] PROBLEM - puppet last run on lvs1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:37:46] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [18:37:54] (03PS1) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) [18:38:49] (03PS2) 10Ppchelko: Add rsyslog kafka to service nodes. [puppet] - 10https://gerrit.wikimedia.org/r/496813 (https://phabricator.wikimedia.org/T211125) [18:40:10] (03CR) 10jerkins-bot: [V: 04-1] Add rsyslog kafka to service nodes. [puppet] - 10https://gerrit.wikimedia.org/r/496813 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [18:42:18] RECOVERY - puppet last run on lvs1016 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:42:31] (03PS2) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) [18:43:45] (03CR) 10Vgutierrez: [C: 04-1] Add lvs to the read-only ldap replicas (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [18:47:19] (03PS3) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) [18:48:17] (03PS6) 10CRusnov: Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) [18:49:21] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban: confirm gpu form factor in stat1005 - https://phabricator.wikimedia.org/T216528 (10EBernhardson) Thanks chris! Based on this a standard dual-slot card will fit in this configuration. There will not be room for a second GPU. My suggestion is to mo... [18:50:13] (03CR) 10Vgutierrez: [C: 04-1] Add lvs to the read-only ldap replicas (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [18:52:19] (03PS4) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) [18:53:07] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10EBernhardson) https://phabricator.wikimedia.org/T216528#5027798 includes a picture of the rear of the case. This confirms a dual-slot card will fit into one side of the chasis. There will b... [18:54:11] (03CR) 10CRusnov: "Compile looks good https://puppet-compiler.wmflabs.org/compiler1002/15166/" [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [18:54:29] (03PS3) 10Ppchelko: Add rsyslog kafka to service nodes. [puppet] - 10https://gerrit.wikimedia.org/r/496813 (https://phabricator.wikimedia.org/T211125) [18:58:13] (03CR) 10Vgutierrez: "pcc looks sane/happy now https://puppet-compiler.wmflabs.org/compiler1002/15167/ but I'd merge this on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [18:58:42] (03CR) 10Mobrovac: [C: 03+1] Add rsyslog kafka to service nodes. [puppet] - 10https://gerrit.wikimedia.org/r/496813 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [19:08:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ldap-ro on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/ldap-ro [19:10:18] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ldap-ro-ssl on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/ldap-ro-ssl [19:12:20] (03PS1) 10Alex Monk: openstack::clientpackages::common: include python3 packages [puppet] - 10https://gerrit.wikimedia.org/r/496863 [19:12:48] (03PS2) 10Alex Monk: openstack::clientpackages::common: include python3 packages [puppet] - 10https://gerrit.wikimedia.org/r/496863 (https://phabricator.wikimedia.org/T218423) [19:13:07] hm… vgutierrez do you know where the state is that's producing ^^? I assume it's just puppet runs being out of sync somehow [19:13:39] * andrewbogott runs conftool-merge [19:20:31] (03CR) 10Marostegui: "bstorm, bd808, can you guys confirm what's the correct way of doing this?" [dns] - 10https://gerrit.wikimedia.org/r/496410 (owner: 10Marostegui) [19:34:29] (03CR) 10BryanDavis: "> Is this done via this change or via the wikitech page Jaime posted?" [dns] - 10https://gerrit.wikimedia.org/r/496410 (owner: 10Marostegui) [19:36:13] (03PS1) 10Herron: rsyslog: remove format=json from msg field in syslog_json template [puppet] - 10https://gerrit.wikimedia.org/r/496871 (https://phabricator.wikimedia.org/T213899) [19:38:00] (03PS1) 10Mobrovac: Varnish: serve Swift traffic in active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) [19:38:23] (03CR) 10Marostegui: "> > Is this done via this change or via the wikitech page Jaime" [dns] - 10https://gerrit.wikimedia.org/r/496410 (owner: 10Marostegui) [19:40:11] (03CR) 10Mobrovac: [C: 04-1] "Not ready to go, -1'ing until that is the case." [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [19:41:20] (03PS1) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) [19:43:17] (03CR) 10Jforrester: [C: 04-2] "Hold for performance concerns." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) (owner: 10Ammarpad) [19:44:42] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/15168/icinga2001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [19:46:03] 10Operations, 10monitoring, 10netops, 10Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992 (10ayounsi) a:03ayounsi [19:50:32] (03PS2) 10Herron: rsyslog: update syslog_json template with format jsonf [puppet] - 10https://gerrit.wikimedia.org/r/496806 (https://phabricator.wikimedia.org/T213899) [19:50:55] 10Operations, 10CX-cxserver, 10Citoid, 10Graphoid, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10mobrovac) [20:01:10] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10dduvall) >>! In T205911#5026051, @mobrovac wrote: > > Having ser... [20:07:13] (03CR) 10Bstorm: "They are all there in trusty :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496863 (https://phabricator.wikimedia.org/T218423) (owner: 10Alex Monk) [20:08:08] (03PS3) 10Alex Monk: openstack::clientpackages::common: include python3 packages [puppet] - 10https://gerrit.wikimedia.org/r/496863 (https://phabricator.wikimedia.org/T218423) [20:08:28] (03CR) 10CRusnov: "check code looks good!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [20:11:23] (03PS4) 10Alex Monk: openstack::clientpackages::common: include python3 packages [puppet] - 10https://gerrit.wikimedia.org/r/496863 (https://phabricator.wikimedia.org/T218423) [20:12:29] (03CR) 10Alex Monk: openstack::clientpackages::common: include python3 packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496863 (https://phabricator.wikimedia.org/T218423) (owner: 10Alex Monk) [20:22:44] 10Puppet, 10cloud-services-team, 10Patch-For-Review: Puppet failure emails sent to non-admin members of tools project causing user confusion - https://phabricator.wikimedia.org/T218009 (10bd808) 05Resolved→03Open Reports of unintended fallout via irc: ` [20:08] < Izhidez> why are we forcing old version... [20:30:36] 10Puppet, 10cloud-services-team, 10Patch-For-Review: Puppet failure emails sent to non-admin members of tools project causing user confusion - https://phabricator.wikimedia.org/T218009 (10stwalkerster) ` name=/var/log/syslog Mar 15 13:43:50 accounts-db3 systemd[1]: Stopping LSB: Start and stop the mysql data... [20:33:08] (03PS2) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) [20:33:52] (03CR) 10jerkins-bot: [V: 04-1] Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [20:37:35] (03CR) 10Ayounsi: "Thanks! Addressing the 2 comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [20:39:16] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/496806 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [20:46:43] (03CR) 10CRusnov: [C: 03+1] "LGTM nitpick inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [20:53:25] (03CR) 10Bstorm: openstack::clientpackages::common: include python3 packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496863 (https://phabricator.wikimedia.org/T218423) (owner: 10Alex Monk) [20:54:39] (03CR) 10Bstorm: openstack::clientpackages::common: include python3 packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496863 (https://phabricator.wikimedia.org/T218423) (owner: 10Alex Monk) [20:56:05] (03PS3) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) [20:56:47] (03CR) 10jerkins-bot: [V: 04-1] Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [20:57:13] (03CR) 10Ayounsi: Icinga: Add OSPF check to routers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [20:57:40] (03CR) 10CRusnov: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [21:03:39] (03CR) 10Alex Monk: openstack::clientpackages::common: include python3 packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496863 (https://phabricator.wikimedia.org/T218423) (owner: 10Alex Monk) [21:05:15] (03CR) 10Bstorm: [C: 03+2] "Let's see if we broke it :)" [puppet] - 10https://gerrit.wikimedia.org/r/496863 (https://phabricator.wikimedia.org/T218423) (owner: 10Alex Monk) [21:07:54] (03CR) 10Herron: "Question -- is the filtering needed for this something that the existing nginx can handle? On paper it seems like we could restrict certa" [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [21:10:28] (03CR) 10CRusnov: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [21:12:52] PROBLEM - puppet last run on cloudcontrol1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [21:13:00] PROBLEM - puppet last run on cloudvirt1020 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 4 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [21:13:28] PROBLEM - puppet last run on cloudnet1004 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [21:13:35] bstorm_: ^^ [21:13:53] Shoot there was a couple there. [21:14:13] Krenair: I was pretty sure there was something, but I couldn't find it ^^ [21:14:42] * anarcat waves [21:14:47] i've opened two issues on cumin [21:16:18] PROBLEM - puppet last run on cloudvirt2002-dev is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [21:17:04] PROBLEM - puppet last run on cloudnet1003 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 7 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [21:17:10] PROBLEM - puppet last run on labtestservices2002 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [21:17:10] PROBLEM - puppet last run on cloudvirt1025 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [21:18:14] (03PS1) 10Bstorm: Revert "openstack::clientpackages::common: include python3 packages" [puppet] - 10https://gerrit.wikimedia.org/r/496900 [21:18:44] PROBLEM - puppet last run on cloudvirt1009 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [21:19:25] (03CR) 10Alex Monk: [C: 03+1] "yep, didn't think it could be that easy :(" [puppet] - 10https://gerrit.wikimedia.org/r/496900 (owner: 10Bstorm) [21:19:28] PROBLEM - puppet last run on cloudvirtan1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [21:19:46] PROBLEM - puppet last run on cloudvirt2003-dev is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 4 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [21:19:52] (03CR) 10Bstorm: [C: 03+2] Revert "openstack::clientpackages::common: include python3 packages" [puppet] - 10https://gerrit.wikimedia.org/r/496900 (owner: 10Bstorm) [21:20:00] PROBLEM - puppet last run on cloudvirt2001-dev is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 7 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [21:20:42] PROBLEM - puppet last run on cloudvirt1016 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [21:20:43] (03Abandoned) 10Herron: rsyslog: remove format=json from msg field in syslog_json template [puppet] - 10https://gerrit.wikimedia.org/r/496871 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [21:21:22] PROBLEM - puppet last run on cloudvirt1030 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [21:22:10] PROBLEM - Disk space on prometheus2003 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/services 5446 MB (3% inode=99%) [21:22:47] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10Varnent) >>! In T218367#5027610, @Krenair wrote: > Isn't the existing list private though? There is nothing in this request to suggest the new list would be non-... [21:22:58] PROBLEM - puppet last run on cloudvirt1021 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [21:23:27] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10Varnent) [21:23:52] PROBLEM - puppet last run on cloudvirt1023 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [21:23:56] PROBLEM - puppet last run on cloudvirt1013 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [21:24:54] 10Operations: Audit our puppet tree for uses of jessie-backports - https://phabricator.wikimedia.org/T216711 (10Bstorm) [21:27:00] PROBLEM - puppet last run on cloudservices1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [21:28:13] (03CR) 10Ayounsi: Expose some PuppetDB values to netmon via microservice (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [21:32:24] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10dduvall) >>! In T205911#5028106, @dduvall wrote: > Or: Isn't `pac... [21:33:38] RECOVERY - puppet last run on cloudcontrol1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:35:10] (03CR) 10Cwhite: "comment inline, but otherwise looks ok" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [21:37:50] (03CR) 10Cwhite: "Suggestion inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [21:39:26] RECOVERY - puppet last run on cloudnet1004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [21:43:00] RECOVERY - puppet last run on cloudnet1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:43:06] RECOVERY - puppet last run on labtestservices2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:44:12] RECOVERY - puppet last run on cloudvirt1020 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [21:45:30] RECOVERY - puppet last run on cloudvirtan1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:45:48] RECOVERY - puppet last run on cloudvirt2003-dev is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:46:02] RECOVERY - puppet last run on cloudvirt2001-dev is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:46:48] (03PS1) 10Bstorm: sonofgridengine: Correctly set masters and shadow masters as submithosts [puppet] - 10https://gerrit.wikimedia.org/r/496979 (https://phabricator.wikimedia.org/T216992) [21:47:17] (03PS2) 10Bstorm: sonofgridengine: Correctly set masters and shadow masters as submithosts [puppet] - 10https://gerrit.wikimedia.org/r/496979 (https://phabricator.wikimedia.org/T216992) [21:47:32] RECOVERY - puppet last run on cloudvirt2002-dev is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:48:18] RECOVERY - puppet last run on cloudvirt1025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:48:48] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: Correctly set masters and shadow masters as submithosts [puppet] - 10https://gerrit.wikimedia.org/r/496979 (https://phabricator.wikimedia.org/T216992) (owner: 10Bstorm) [21:49:00] RECOVERY - puppet last run on cloudvirt1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:49:56] RECOVERY - puppet last run on cloudvirt1023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:49:58] RECOVERY - puppet last run on cloudvirt1009 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [21:49:58] RECOVERY - puppet last run on cloudvirt1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:51:54] RECOVERY - puppet last run on cloudvirt1016 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:51:59] (03PS7) 10CRusnov: Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) [21:52:36] RECOVERY - puppet last run on cloudvirt1030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:52:41] (03CR) 10jerkins-bot: [V: 04-1] Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [21:52:58] RECOVERY - puppet last run on cloudservices1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:55:07] (03PS8) 10CRusnov: Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) [21:55:38] (03CR) 10CRusnov: "Thank you for the reviews!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [22:01:35] (03PS1) 10Bstorm: sonofgridengine: move the client package out of master.pp [puppet] - 10https://gerrit.wikimedia.org/r/496987 (https://phabricator.wikimedia.org/T216992) [22:02:53] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: move the client package out of master.pp [puppet] - 10https://gerrit.wikimedia.org/r/496987 (https://phabricator.wikimedia.org/T216992) (owner: 10Bstorm) [22:07:03] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) Ping @fgiunchedi about putting this as a commong goal next quarter [22:09:13] (03CR) 10Cwhite: [C: 03+1] Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [22:11:54] 10Operations, 10SRE-Access-Requests, 10WMF-NDA-Requests: Volunteer NDA for Alex Monk - https://phabricator.wikimedia.org/T218448 (10bd808) Adding #sre-access-requests tag as well because I'm not 100% certain if the NDA needed is just L2 or if the Cobblestone process is required. [22:22:42] (03PS1) 10BryanDavis: ldap: disable group member list expansion on Stretch clients [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) [22:23:37] (03CR) 10jerkins-bot: [V: 04-1] ldap: disable group member list expansion on Stretch clients [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) (owner: 10BryanDavis) [22:26:20] (03CR) 10CRusnov: [C: 03+2] Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [22:26:34] (03PS9) 10CRusnov: Expose some PuppetDB values to netmon via microservice [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) [22:26:36] (03CR) 10BryanDavis: "Jenkins failures seem unrelated?" [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) (owner: 10BryanDavis) [22:28:12] 10Puppet, 10cloud-services-team, 10Patch-For-Review: Puppet failure emails sent to non-admin members of tools project causing user confusion - https://phabricator.wikimedia.org/T218009 (10Krenair) @aborrero I tracked down mysql packages in openstack clientpackages to seemingly have been introduced in https:/... [22:32:38] PROBLEM - High lag on wdqs1003 is CRITICAL: 3619 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:34:30] PROBLEM - Check systemd state on puppetdb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:34:40] ^ me [22:34:42] fixing [22:34:46] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:35:06] PROBLEM - puppet last run on mw2252 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:35:10] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:35:18] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:35:28] okay all of this is me [22:35:59] looks like puppetdb is down in codfw? [22:36:04] PROBLEM - puppet last run on mw2214 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:36:06] PROBLEM - puppet last run on ping2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:36:06] PROBLEM - puppet last run on mw2233 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:36:26] PROBLEM - puppet last run on db2089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:36:33] yah my +2s fault [22:36:36] PROBLEM - puppet last run on maps2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:36:38] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:36:46] PROBLEM - puppet last run on mc2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:36:54] PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:37:22] PROBLEM - puppet last run on es2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:37:28] PROBLEM - puppet last run on sessionstore2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:37:35] (03PS1) 10Bstorm: sonofgridengine: Read observer creds from file with cli fallback [puppet] - 10https://gerrit.wikimedia.org/r/496993 (https://phabricator.wikimedia.org/T216992) [22:37:36] PROBLEM - puppet last run on ms-be2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:37:42] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:37:42] PROBLEM - puppet last run on logstash2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:37:42] PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:37:44] PROBLEM - puppet last run on mc2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:37:49] !log temporarily stop ircecho on icinga2001 [22:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:52] PROBLEM - puppet last run on mw2279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:37:56] okay [22:38:00] one sec will have a patch to fix [22:38:40] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: Read observer creds from file with cli fallback [puppet] - 10https://gerrit.wikimedia.org/r/496993 (https://phabricator.wikimedia.org/T216992) (owner: 10Bstorm) [22:40:09] (03PS1) 10CRusnov: fix copy-paste fail in nginx puppetdb microservice configuration [puppet] - 10https://gerrit.wikimedia.org/r/496997 [22:40:47] (03PS2) 10CRusnov: fix copy-paste fail in nginx puppetdb microservice configuration [puppet] - 10https://gerrit.wikimedia.org/r/496997 [22:42:32] (03CR) 10Ayounsi: [C: 03+1] fix copy-paste fail in nginx puppetdb microservice configuration [puppet] - 10https://gerrit.wikimedia.org/r/496997 (owner: 10CRusnov) [22:42:53] (03CR) 10CRusnov: [C: 03+2] fix copy-paste fail in nginx puppetdb microservice configuration [puppet] - 10https://gerrit.wikimedia.org/r/496997 (owner: 10CRusnov) [22:44:53] (03PS1) 10Bstorm: sonofgridengine: fix mistake in file for grid_configurator [puppet] - 10https://gerrit.wikimedia.org/r/496998 [22:45:25] (03PS2) 10Bstorm: sonofgridengine: fix mistake in file for grid_configurator [puppet] - 10https://gerrit.wikimedia.org/r/496998 [22:46:45] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: fix mistake in file for grid_configurator [puppet] - 10https://gerrit.wikimedia.org/r/496998 (owner: 10Bstorm) [22:54:49] (03PS1) 10Bstorm: sonofgridengine: remove gridengine-client from shadow_master class [puppet] - 10https://gerrit.wikimedia.org/r/497001 (https://phabricator.wikimedia.org/T216992) [22:55:21] crusnov: update? [22:55:34] should be good [22:55:49] I can't seem to reach puppetdb from cumin1001 [22:56:06] erm [22:56:09] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: remove gridengine-client from shadow_master class [puppet] - 10https://gerrit.wikimedia.org/r/497001 (https://phabricator.wikimedia.org/T216992) (owner: 10Bstorm) [22:56:31] cumin complains about connection refused [22:57:01] oh [22:57:03] heh [22:57:05] one sec [22:57:45] should be working fingers crossed [22:57:52] apparent[ly puppetdb can explode itself [22:58:44] what was the fix? [22:59:20] removing the broken nginx configuration file and restarting nginx [22:59:26] and it should regenerate the fixed one [22:59:41] ah, great! [22:59:43] nginx is super fragile :\ [23:00:05] I'm running puppet across the infra to clean up the failed runs [23:00:22] okay cool [23:01:47] yap confirmed fix works [23:02:51] nice! [23:03:19] I'm stepping out have to go pickup a rental car before it closes [23:09:09] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:09:09] RECOVERY - puppet last run on db2087 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:09:09] RECOVERY - puppet last run on ms-be2027 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:09:09] RECOVERY - puppet last run on es2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:09:10] RECOVERY - puppet last run on auth1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:09:25] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:09:29] RECOVERY - puppet last run on wtp1030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:09:29] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:09:31] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:09:51] RECOVERY - puppet last run on labstore2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:09:53] RECOVERY - puppet last run on ms-be1044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:09:59] RECOVERY - puppet last run on mw2259 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:09:59] RECOVERY - puppet last run on wtp2014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:10:01] RECOVERY - puppet last run on kubestagetcd1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:10:11] RECOVERY - puppet last run on mw1340 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:10:11] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [23:10:13] RECOVERY - puppet last run on centrallog1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:10:13] RECOVERY - puppet last run on ms-be1050 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:10:15] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:10:19] RECOVERY - puppet last run on elastic2043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:10:21] RECOVERY - puppet last run on rdb1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:10:23] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [23:10:23] RECOVERY - puppet last run on cloudvirt1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:10:23] RECOVERY - puppet last run on dumpsdata1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:10:33] RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:10:37] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:10:39] RECOVERY - puppet last run on mc1024 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [23:10:39] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:10:43] RECOVERY - puppet last run on ganeti2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:10:45] RECOVERY - puppet last run on mc2028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:10:45] RECOVERY - puppet last run on wtp2013 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:10:45] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:10:47] RECOVERY - puppet last run on thumbor1004 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [23:14:29] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [23:14:37] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [23:14:41] RECOVERY - puppet last run on logstash2005 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [23:14:48] :D [23:14:55] RECOVERY - puppet last run on elastic1049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:14:57] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:15:07] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:15:07] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:15:13] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:15:13] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:15:17] RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:15:17] RECOVERY - puppet last run on mw1269 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:15:21] RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:15:23] RECOVERY - puppet last run on netmon1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:15:23] RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [23:15:23] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:15:23] RECOVERY - puppet last run on cp1084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:15:23] RECOVERY - puppet last run on cloudvirt1023 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:15:25] RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:15:25] RECOVERY - puppet last run on cp1085 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:15:27] RECOVERY - puppet last run on roentgenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:15:31] RECOVERY - puppet last run on mc1022 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:15:33] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:15:33] RECOVERY - puppet last run on wtp1045 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [23:15:37] RECOVERY - puppet last run on db1099 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:15:39] RECOVERY - puppet last run on dns2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [23:15:39] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [23:15:47] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:15:47] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:15:49] RECOVERY - puppet last run on db1100 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [23:15:51] RECOVERY - puppet last run on mw1227 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [23:15:51] RECOVERY - puppet last run on ganeti1005 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [23:15:51] RECOVERY - puppet last run on db1098 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:15:55] RECOVERY - puppet last run on cp1083 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:15:55] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:15:55] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:15:57] RECOVERY - puppet last run on es1018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:15:57] RECOVERY - puppet last run on ms-be1024 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [23:15:59] RECOVERY - puppet last run on ms-be1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:16:07] RECOVERY - puppet last run on cloudvirtan1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:16:09] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:16:17] RECOVERY - puppet last run on cp1082 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:16:17] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:16:31] RECOVERY - puppet last run on ms-be1026 is OK: OK: Puppet is currently enabled, last run 8 minutes ago with 0 failures [23:16:39] RECOVERY - puppet last run on mw1333 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:16:41] RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:16:41] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:16:43] RECOVERY - puppet last run on francium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:16:43] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:17:14] shh icinga [23:17:22] 85% now [23:19:27] actually noisy puppet breakage is my *middle* name [23:22:28] (03PS2) 10BryanDavis: ldap: disable group member list expansion on Stretch clients [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) [23:25:38] (03CR) 10BryanDavis: "> Jenkins failures seem unrelated?" [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) (owner: 10BryanDavis) [23:26:45] (03PS1) 10Alex Monk: Re-apply "openstack::clientpackages::common: include python3 packages" [puppet] - 10https://gerrit.wikimedia.org/r/497009 [23:28:30] (03CR) 10Bstorm: "I can cherry-pick this into toolsbeta and see how it goes just in case :) If it works right, this could make our weekend *better* rather " [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) (owner: 10BryanDavis) [23:28:44] (03PS3) 10Bstorm: ldap: disable group member list expansion on Stretch clients [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) (owner: 10BryanDavis) [23:30:05] RECOVERY - puppet last run on icinga2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:32:17] chaomodus: I think we're all clear now :) [23:32:30] :) [23:32:39] RECOVERY - puppet last run on mw2279 is OK: OK: Puppet is currently enabled, last run 27 minutes ago with 0 failures [23:33:19] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:56:06] (03CR) 10Bstorm: "Decided not merging until Monday. It worked in toolsbeta fine, though the value will show up in Trusty as is in addition to Stretch. tool" [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) (owner: 10BryanDavis) [23:58:34] (03PS1) 10CRusnov: Specify the actual correct parameters to the uwsgi::app [puppet] - 10https://gerrit.wikimedia.org/r/497016 [23:59:36] (03CR) 10jerkins-bot: [V: 04-1] Specify the actual correct parameters to the uwsgi::app [puppet] - 10https://gerrit.wikimedia.org/r/497016 (owner: 10CRusnov)