[00:01:35] (03CR) 10jenkins-bot: wikitech: Lock LDAP accounts when users are blocked [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497866 (https://phabricator.wikimedia.org/T168692) (owner: 10BryanDavis) [00:01:37] (03CR) 10jenkins-bot: wikitech: Disable Phabricator accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [00:07:24] !log bd808@deploy1001 Synchronized wmf-config/wikitech.php: SWAT: [[gerrit:497866|wikitech: Lock LDAP accounts when users are blocked]], [[gerrit:501123|Disable Phabricator accounts when blocked on wikitech]] (T168692) (duration: 00m 59s) [00:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:29] T168692: Blocking an account on wikitech should disable LDAP logins - https://phabricator.wikimedia.org/T168692 [00:09:23] !log bd808@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:497866|wikitech: Lock LDAP accounts when users are blocked]], [[gerrit:501123|Disable Phabricator accounts when blocked on wikitech]] (T168692) 2/2 (duration: 00m 57s) [00:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:17] (03PS1) 10Bstorm: osmdb: set old osmdb servers to spare for decom [puppet] - 10https://gerrit.wikimedia.org/r/501457 (https://phabricator.wikimedia.org/T220144) [00:11:03] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10Bstorm) [00:11:45] (03CR) 10Bstorm: [C: 03+2] osmdb: set old osmdb servers to spare for decom [puppet] - 10https://gerrit.wikimedia.org/r/501457 (https://phabricator.wikimedia.org/T220144) (owner: 10Bstorm) [00:18:28] (03PS20) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [00:19:12] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [00:19:38] (03CR) 10jerkins-bot: [V: 04-1] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [00:19:51] (03CR) 10jerkins-bot: [V: 04-1] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [00:21:55] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10Bstorm) [00:24:44] (03PS21) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [00:26:42] (03CR) 10jerkins-bot: [V: 04-1] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [00:29:23] (03PS22) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [00:29:54] 10Operations, 10Puppet, 10puppet-compiler, 10Release-Engineering-Team (Watching / External): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10ayounsi) That's useful, thank you. It didn't work for https://gerrit.wikimedia.org/r/c/operations/puppet/+/397... [00:31:18] (03PS6) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) [00:32:47] (03CR) 10jerkins-bot: [V: 04-1] Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [00:35:02] (03PS7) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) [00:35:20] (03PS8) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) [00:38:35] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [00:40:24] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10Bstorm) [00:54:19] PROBLEM - puppet last run on wtp1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:04:49] RECOVERY - puppet last run on wtp1037 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:12:30] (03PS3) 10Alex Monk: profile::cache::ssl::unified: Allow passing certs/certs_active by hiera [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927) [01:15:21] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [01:16:37] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [01:22:25] (03PS7) 10Alex Monk: Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655 [01:27:43] (03PS1) 10Alex Monk: wikiba.se TLS: Make support for different certificate sources clearer [puppet] - 10https://gerrit.wikimedia.org/r/501461 [01:28:16] (03CR) 10Alex Monk: [C: 03+1] ssl::wikibase: Fix le_subjects hieradata key name [puppet] - 10https://gerrit.wikimedia.org/r/501357 (owner: 10Vgutierrez) [01:41:53] RECOVERY - Check systemd state on cloudcontrol2001-dev is OK: OK - running: The system is fully operational [01:45:47] PROBLEM - Check systemd state on cloudcontrol2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:00:05] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:07:45] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:07:58] (03PS1) 10Mathew.onipe: icinga: Ok when total shards is zero [puppet] - 10https://gerrit.wikimedia.org/r/501462 (https://phabricator.wikimedia.org/T214921) [02:17:39] PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:19:29] PROBLEM - tools project instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: k8s-etcd,prometheus class instances not spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:27:05] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [02:27:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:28:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [02:32:03] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [02:32:21] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:34:09] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [02:44:05] RECOVERY - puppet last run on db1084 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:13:09] PROBLEM - puppet last run on db1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:21:19] PROBLEM - Apache HTTP on mw1314 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1312 bytes in 5.707 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:21:33] PROBLEM - Nginx local proxy to apache on mw1314 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:22:31] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:22:49] RECOVERY - Nginx local proxy to apache on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:44:51] RECOVERY - puppet last run on db1064 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:34:39] PROBLEM - puppet last run on cloudvirtan1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:49:01] !log T216594 Start purge of namespace 0 on ruwiki [04:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:05] T216594: Layout Stability API origin trial - https://phabricator.wikimedia.org/T216594 [04:58:28] (03PS1) 10Marostegui: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501476 [05:01:01] RECOVERY - puppet last run on cloudvirtan1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:07:26] (03PS1) 10Dduvall: Revert "all wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501477 [05:07:28] (03PS1) 10Dduvall: Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501478 [05:08:23] (03PS1) 10Marostegui: mariadb: Promote db1075 to master [puppet] - 10https://gerrit.wikimedia.org/r/501479 (https://phabricator.wikimedia.org/T219115) [05:08:46] (03CR) 10Dduvall: [C: 03+2] Revert "all wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501477 (owner: 10Dduvall) [05:08:55] (03CR) 10Marostegui: [C: 04-2] "Wait for thursday 10th april" [puppet] - 10https://gerrit.wikimedia.org/r/501479 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [05:09:22] (03CR) 10Dduvall: [C: 03+2] Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501478 (owner: 10Dduvall) [05:09:53] (03Merged) 10jenkins-bot: Revert "all wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501477 (owner: 10Dduvall) [05:10:19] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501478 (owner: 10Dduvall) [05:12:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501476 (owner: 10Marostegui) [05:12:56] (03CR) 10jenkins-bot: Revert "all wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501477 (owner: 10Dduvall) [05:12:58] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501478 (owner: 10Dduvall) [05:13:12] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501476 (owner: 10Marostegui) [05:13:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501476 (owner: 10Marostegui) [05:14:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1075 (duration: 00m 59s) [05:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:19] !log Fully upgrade and reboot db1075 [05:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:47] (03PS1) 10Marostegui: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) [05:20:01] (03CR) 10Marostegui: [C: 04-2] "Wait for Thursday 10th April" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [05:22:42] (03PS1) 10Marostegui: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) [05:23:06] (03CR) 10Marostegui: [C: 04-2] "Wait for Thursday 10th April" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [05:27:21] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501482 [05:29:51] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:30:11] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501482 (owner: 10Marostegui) [05:31:17] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501482 (owner: 10Marostegui) [05:32:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1075 with low weight (duration: 00m 58s) [05:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:13] (03PS1) 10Marostegui: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/501483 (https://phabricator.wikimedia.org/T219115) [05:33:51] (03CR) 10Marostegui: [C: 04-2] "Wait for Thursday 10th April" [dns] - 10https://gerrit.wikimedia.org/r/501483 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [05:35:31] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501482 (owner: 10Marostegui) [05:39:05] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) @jcrespo would you mind taking a look at the above patches ^ I have also updated our etherpad with the plan Thanks! [05:43:43] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501485 [05:47:15] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501485 (owner: 10Marostegui) [05:48:20] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501485 (owner: 10Marostegui) [05:56:57] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [05:57:55] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501485 (owner: 10Marostegui) [05:59:09] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 3 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Joe) [06:04:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1075 (duration: 01m 00s) [06:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:38] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [06:21:53] 10Operations, 10cloud-services-team: cloudcontrol2001-dev hourly cronspam - https://phabricator.wikimedia.org/T220173 (10jijiki) [06:22:02] 10Operations, 10cloud-services-team: cloudcontrol2001-dev hourly cronspam - https://phabricator.wikimedia.org/T220173 (10jijiki) p:05Triage→03Normal [06:24:10] (03CR) 10Vgutierrez: [C: 03+2] Revert "profile::cache::ssl::wikibase: Simplify" [puppet] - 10https://gerrit.wikimedia.org/r/501346 (owner: 10Alex Monk) [06:24:23] (03PS2) 10Vgutierrez: Revert "profile::cache::ssl::wikibase: Simplify" [puppet] - 10https://gerrit.wikimedia.org/r/501346 (owner: 10Alex Monk) [06:28:47] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:41] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:34:33] (03CR) 10ArielGlenn: "We don't currently have any testbed hosts but maybe you want to add it to that role too? hieradata/role/common/dumps/generation/worker/tes" [puppet] - 10https://gerrit.wikimedia.org/r/501370 (owner: 10Muehlenhoff) [06:36:21] (03CR) 10Vgutierrez: [C: 03+2] "NOOP in production: https://puppet-compiler.wmflabs.org/compiler1002/15599/" [puppet] - 10https://gerrit.wikimedia.org/r/501357 (owner: 10Vgutierrez) [06:36:30] (03PS2) 10Vgutierrez: ssl::wikibase: Fix le_subjects hieradata key name [puppet] - 10https://gerrit.wikimedia.org/r/501357 [06:40:39] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501487 [06:41:14] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Reimage gerrit2001 as stretch - https://phabricator.wikimedia.org/T168562 (10Dzahn) [06:41:20] 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn) 05Resolved→03Open let's only resolve stuff that is actually resolved, not what will be resolved i... [06:51:48] (03CR) 10Muehlenhoff: "Sure thing, updating the patch." [puppet] - 10https://gerrit.wikimedia.org/r/501370 (owner: 10Muehlenhoff) [06:52:21] (03PS2) 10Muehlenhoff: snapshot: Add hhvm to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/501370 [06:52:35] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501487 (owner: 10Marostegui) [06:54:14] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501487 (owner: 10Marostegui) [06:54:22] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) tensorflow-rocm 1.13.1 available for Python 3.7 on PyPi! https... [06:55:53] (03CR) 10ArielGlenn: [C: 03+1] snapshot: Add hhvm to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/501370 (owner: 10Muehlenhoff) [06:56:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1075 (duration: 00m 57s) [06:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:19] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:03:45] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:04:50] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501487 (owner: 10Marostegui) [07:08:23] 10Operations, 10SRE-Access-Requests: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10Gilles) [07:08:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:08:51] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:13:37] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:14:03] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:15:20] 10Operations, 10SRE-Access-Requests: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10jcrespo) Please advice as the analytics permission masters. [07:16:11] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:31] 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10MoritzMuehlenhoff) >>! In T219764#5087063, @Krenair wrote: > Thanks. Do we know how many production hosts are affected, if any? Affected in the sense that they are currently in a broken sta... [07:16:39] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:18:15] !log upgrading mw1262-mw1265 to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [07:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:19] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [07:20:03] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:31] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:23:24] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2044 is CRITICAL: cluster=mysql device=cciss,11 instance=db2044:9100 job=node site=codfw Jcrespo https://phabricator.wikimedia.org/T220102 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044&var-datasource=codfw+prometheus/ops [07:23:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:21] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:26:27] RECOVERY - MariaDB disk space on dbstore1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:26:53] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [07:27:41] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:27:55] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [07:27:57] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501493 [07:29:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:33] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:30:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:30:51] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:34:16] !log Repooling thumbor1004 until we replace its memory - T215411 [07:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:20] T215411: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 [07:37:49] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 866 bytes in 0.086 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:38:05] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26976 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [07:38:57] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy],Exec[git_pull_research/landing-page] [07:38:57] PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [07:38:59] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [07:40:15] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [07:41:01] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 4 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [07:41:43] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [07:41:43] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [07:41:53] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [07:42:01] PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [07:42:03] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars] [07:42:51] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [07:42:51] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [07:43:55] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [07:44:05] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [07:44:15] RECOVERY - puppet last run on db1124 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:44:15] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [07:45:47] (03PS3) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [07:47:14] (03PS4) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [07:51:09] !log restart gerrit on cobalt (timeouts and general slowdown) [07:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:06] (03PS1) 10Jcrespo: mariadb-snapshots: Only create x1 snapshots on dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/501506 (https://phabricator.wikimedia.org/T206203) [07:55:17] (03CR) 10Marostegui: "> Looks good (with the labsdb1004/1005 caveat)" [puppet] - 10https://gerrit.wikimedia.org/r/461035 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [08:00:03] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:00:25] (03PS2) 10Jcrespo: mariadb-snapshots: Only create x1 snapshots on dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/501506 (https://phabricator.wikimedia.org/T206203) [08:01:49] (03CR) 10Jcrespo: [C: 03+2] "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/501506 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [08:02:49] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:02:55] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:03:05] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:03:47] (03CR) 10Marostegui: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501493 (owner: 10Marostegui) [08:03:55] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:03:55] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:04:46] (03PS2) 10Gehel: icinga: Ok when total shards is zero [puppet] - 10https://gerrit.wikimedia.org/r/501462 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [08:04:56] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501493 (owner: 10Marostegui) [08:05:01] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:05:11] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:05:23] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:05:23] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:05:41] (03CR) 10Gehel: [C: 03+2] icinga: Ok when total shards is zero [puppet] - 10https://gerrit.wikimedia.org/r/501462 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [08:05:55] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501493 (owner: 10Marostegui) [08:06:37] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:06:47] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [08:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:05] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 3 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Dzahn) a:03Dzahn [08:07:21] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:07:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1075 (duration: 00m 59s) [08:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:05] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:08:21] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:11:17] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501493 (owner: 10Marostegui) [08:16:04] (03PS2) 10Elukey: Update AQS druid datasource to 2019-03 snapshot [puppet] - 10https://gerrit.wikimedia.org/r/501341 (owner: 10Joal) [08:17:02] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: update swift dashboard to use new metric names [puppet] - 10https://gerrit.wikimedia.org/r/501399 (https://phabricator.wikimedia.org/T219825) (owner: 10Cwhite) [08:17:06] (03CR) 10Elukey: [C: 03+2] Update AQS druid datasource to 2019-03 snapshot [puppet] - 10https://gerrit.wikimedia.org/r/501341 (owner: 10Joal) [08:22:07] PROBLEM - ElasticSearch health check for shards on 9643 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 592 threshold =0.15 breach: number_of_nodes: 15, number_of_in_flight_fetch: 0, status: yellow, timed_out: False, number_of_pending_tasks: 0, active_shards: 2669, active_shards_percent_as_number: 81.84605949095369, unassigned_shards: 592, relocating_shards: 0, initializing_shards: 0, task_max_wait [08:22:07] is: 0, active_primary_shards: 1087, delayed_unassigned_shards: 0, cluster_name: production-search-psi-eqiad, number_of_data_nodes: 15 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:22:46] ^ restart in progress, should recover in a few seconds [08:23:22] 15 nodes confused me a bit before I saw psi... :) [08:23:22] super [08:23:27] no actual issue, the threshold of this check is a bit high for our newer smaller clusters [08:24:43] RECOVERY - ElasticSearch health check for shards on 9643 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-psi-eqiad: cluster_name: production-search-psi-eqiad, initializing_shards: 4, status: yellow, number_of_in_flight_fetch: 0, number_of_nodes: 17, relocating_shards: 0, task_max_waiting_in_queue_millis: 40696, unassigned_shards: 19, number_of_pending_tasks: 5, active_primary_shards: 1087, number_of_d [08:24:43] med_out: False, delayed_unassigned_shards: 0, active_shards: 3238, active_shards_percent_as_number: 99.29469487887151 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:24:44] (03PS4) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [08:25:14] (03PS3) 10Muehlenhoff: toolforge: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/500388 (https://phabricator.wikimedia.org/T219362) [08:25:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [08:26:43] (03PS1) 10Gehel: elasticsearch: raise the alerting threshold on unassigned shards [puppet] - 10https://gerrit.wikimedia.org/r/501510 [08:27:28] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Logging infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220103 (10fgiunchedi) [08:28:26] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: raise the alerting threshold on unassigned shards [puppet] - 10https://gerrit.wikimedia.org/r/501510 (owner: 10Gehel) [08:28:34] (03CR) 10Muehlenhoff: [C: 03+2] toolforge: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/500388 (https://phabricator.wikimedia.org/T219362) (owner: 10Muehlenhoff) [08:29:01] (03CR) 10Gehel: [C: 03+2] elasticsearch: raise the alerting threshold on unassigned shards [puppet] - 10https://gerrit.wikimedia.org/r/501510 (owner: 10Gehel) [08:29:09] (03PS2) 10Gehel: elasticsearch: raise the alerting threshold on unassigned shards [puppet] - 10https://gerrit.wikimedia.org/r/501510 [08:29:41] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 3 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Dzahn) @Eevans Done. I added: $application_username = 'sessions' $application_password =... [08:30:39] 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to cloudweb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) [08:30:58] 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to cloudweb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) p:05Triage→03High a:03aborrero [08:31:26] 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to cloudweb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) [08:31:33] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 2.515e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [08:31:47] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 634 threshold =0.15 breach: cluster_name: production-search-omega-eqiad, active_shards_percent_as_number: 80.90361445783132, initializing_shards: 0, number_of_data_nodes: 15, delayed_unassigned_shards: 0, active_shards: 2686, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, timed_out: F [08:31:47] ary_shards: 1107, number_of_nodes: 15, status: yellow, number_of_pending_tasks: 0, unassigned_shards: 634, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:32:04] Mirror Maker is surely due to the restarts --^ [08:32:19] !log roll restart of aqs on aqs100* to pick up new druid settings [08:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:41] (03PS2) 10Arturo Borrero Gonzalez: labtestservices2002: rename to cloudservices2002-dev and put into service [puppet] - 10https://gerrit.wikimedia.org/r/501314 (https://phabricator.wikimedia.org/T220101) [08:33:35] (03CR) 10Dzahn: "there is no more redirects.conf that gets created from redirects.dat and i need to upload, right?" [puppet] - 10https://gerrit.wikimedia.org/r/501202 (https://phabricator.wikimedia.org/T219856) (owner: 10Dzahn) [08:34:04] (03Abandoned) 10Muehlenhoff: Remove tools-checker-grid-start-trusty monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/500409 (owner: 10Muehlenhoff) [08:34:21] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: number_of_data_nodes: 18, status: yellow, number_of_pending_tasks: 11, active_shards: 2917, task_max_waiting_in_queue_millis: 4653, cluster_name: production-search-omega-eqiad, active_primary_shards: 1107, timed_out: False, initializing_shards: 6, unassigned_shards: 397, number_of_node [08:34:21] in_flight_fetch: 0, relocating_shards: 0, active_shards_percent_as_number: 87.86144578313252, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:35:51] (03PS3) 10Arturo Borrero Gonzalez: labtestservices2002: rename to cloudservices2002-dev and put into service [puppet] - 10https://gerrit.wikimedia.org/r/501314 (https://phabricator.wikimedia.org/T220101) [08:36:16] (03PS2) 10Dzahn: druid: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/501337 [08:36:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:36:30] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) [08:36:51] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:37:29] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:37:33] PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:37:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I think this is good to merge now. No more trusty VMs are present in Cloud VPS." [puppet] - 10https://gerrit.wikimedia.org/r/499933 (owner: 10Muehlenhoff) [08:37:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestservices2002: rename to cloudservices2002-dev and put into service [puppet] - 10https://gerrit.wikimedia.org/r/501314 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez) [08:38:11] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:38:19] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:38:21] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:38:21] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:38:23] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:38:33] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:38:45] checking [08:39:05] (03PS3) 10Dzahn: druid: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/501337 [08:39:11] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) [08:39:19] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:39:31] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:39:31] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:39:33] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:39:37] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:39:43] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:40:05] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:40:46] so this was us switching a druid datasource, that was not cached [08:40:56] causing timeouts until the cache was warmed up [08:41:19] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 80695 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:42:03] elukey: ok to add https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid to all the druid checks? [08:42:21] yep! [08:42:26] (it was mere conincidence that i did that at the same tiem ) [08:42:29] ok, doing :) [08:43:18] !log upgrade kubernetes staging cluster to 1.11.9 [08:43:20] (03PS2) 10Muehlenhoff: role::labs::instance: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/499933 [08:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:56] (03CR) 10Dzahn: [C: 03+2] druid: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/501337 (owner: 10Dzahn) [08:44:08] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) p:05Triage→03High [08:45:57] (03PS3) 10Muehlenhoff: role::labs::instance: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/499933 [08:48:41] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:49:13] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:49:55] (03CR) 10Muehlenhoff: [C: 03+2] role::labs::instance: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/499933 (owner: 10Muehlenhoff) [08:49:57] (03PS1) 10Arturo Borrero Gonzalez: labtestservices2002: rename to cloudservices2002-dev [dns] - 10https://gerrit.wikimedia.org/r/501511 (https://phabricator.wikimedia.org/T220101) [08:51:03] (03CR) 10Ema: "I guess we should 's/swift-rw/swift-ro/' (sic) in ./hieradata/role/common/trafficserver/backend.yaml to make the ATS backends serve swift " [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [08:52:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestservices2002: rename to cloudservices2002-dev [dns] - 10https://gerrit.wikimedia.org/r/501511 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez) [08:54:24] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) [08:54:53] (03CR) 10DCausse: [C: 03+1] "recheck" [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox) [08:55:50] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: `... [08:55:57] !log T220101 reimaging+renaming labtestservices2002 to cloudservices2002-dev [08:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:01] T220101: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 [08:57:15] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=zotero [08:57:16] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=blubberoid [08:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:17] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=mathoid [08:57:18] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=cxserver [08:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:19] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=citoid [08:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:40] !log depool codfw kubernetes apps from discovery in preparation for upgrade [08:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:01] (03PS1) 10Alexandros Kosiaris: nrpe: Add more rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/501512 [08:59:44] (03CR) 10DCausse: [C: 03+1] "recheck" [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox) [09:00:10] (03CR) 10jerkins-bot: [V: 04-1] Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox) [09:00:55] PROBLEM - Host 208.80.153.78 is DOWN: PING CRITICAL - Packet loss = 100% [09:02:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] nrpe: Add more rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/501512 (owner: 10Alexandros Kosiaris) [09:02:39] (03PS8) 10Alexandros Kosiaris: Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk) [09:03:37] (03CR) 10jerkins-bot: [V: 04-1] Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk) [09:04:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Sure. Let's say Tuesday 09 Apr" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [09:07:04] (03CR) 10Alexandros Kosiaris: "/me fixing the tests and then merging" [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk) [09:09:57] (03CR) 10DCausse: [C: 03+1] "recheck" [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox) [09:10:22] (03CR) 10jerkins-bot: [V: 04-1] Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox) [09:10:29] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [09:11:03] PROBLEM - DPKG on acrux is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:14:15] PROBLEM - Host 208.80.153.78 is DOWN: PING CRITICAL - Packet loss = 100% [09:14:18] ACKNOWLEDGEMENT - EDAC syslog messages on thumbor1004 is CRITICAL: 6.001 ge 4 daniel_zahn https://phabricator.wikimedia.org/T215411 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [09:14:53] 10Operations, 10cloud-services-team: cloudcontrol2001-dev hourly cronspam - https://phabricator.wikimedia.org/T220173 (10aborrero) [09:15:25] ACKNOWLEDGEMENT - Host 208.80.153.78 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn WIP by arturo [09:15:53] ACKNOWLEDGEMENT - Recursive DNS on 208.80.153.78 is CRITICAL: CRITICAL - Plugin timed out while executing system call daniel_zahn WIP by arturo https://wikitech.wikimedia.org/wiki/DNS [09:16:01] (03PS9) 10Alexandros Kosiaris: Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk) [09:16:21] 10Operations, 10cloud-services-team: cloudcontrol2001-dev hourly cronspam - https://phabricator.wikimedia.org/T220173 (10aborrero) This server is part of an openstack deployment which is being bootstrapped in codfw so we can rescue/save databases as part of Trusty HW deprecation process. See parent tasks {T219... [09:16:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk) [09:21:38] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: refresh FQDN of DNS servers [puppet] - 10https://gerrit.wikimedia.org/r/501517 (https://phabricator.wikimedia.org/T220101) [09:23:31] RECOVERY - DPKG on acrux is OK: All packages OK [09:23:39] @seen andre_ [09:23:39] mutante: Last time I saw andre_ they were changing the nickname to Guest17533, but Guest17533 is no longer in channel #wikimedia-dev at 10/18/2018 4:53:24 AM (169d4h30m15s ago) [09:24:41] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> I guess we should 's/swift-rw/swift-ro/' (sic) in ./hieradata/role/common/trafficserver/backend.yaml to make the ATS backends serve swif" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [09:24:55] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: refresh FQDN of DNS servers [puppet] - 10https://gerrit.wikimedia.org/r/501517 (https://phabricator.wikimedia.org/T220101) [09:25:44] (03CR) 10Ema: "> Unless I am mistaken, it would be better to just set active_active:" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [09:26:16] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [09:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:20] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [09:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:15] (03PS1) 10Filippo Giunchedi: grafana: remove frack datasources [puppet] - 10https://gerrit.wikimedia.org/r/501519 (https://phabricator.wikimedia.org/T219825) [09:27:21] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: refresh DNS related FQDNs [dns] - 10https://gerrit.wikimedia.org/r/501520 (https://phabricator.wikimedia.org/T220101) [09:27:33] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [09:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:55] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [09:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:24] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: remove frack datasources [puppet] - 10https://gerrit.wikimedia.org/r/501519 (https://phabricator.wikimedia.org/T219825) (owner: 10Filippo Giunchedi) [09:28:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev: refresh DNS related FQDNs [dns] - 10https://gerrit.wikimedia.org/r/501520 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez) [09:28:53] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:29:18] (03PS3) 10Alexandros Kosiaris: Varnish: serve Swift traffic in active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [09:29:27] (03PS2) 10Filippo Giunchedi: grafana: remove frack datasources [puppet] - 10https://gerrit.wikimedia.org/r/501519 (https://phabricator.wikimedia.org/T219825) [09:29:38] ^ scandium - keeps happening ... is the test host. subbu knows [09:30:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: refresh FQDN of DNS servers [puppet] - 10https://gerrit.wikimedia.org/r/501517 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez) [09:30:28] ACKNOWLEDGEMENT - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T219933 [09:31:18] (03PS1) 10Dzahn: add fake passwords for cassandra session store [labs/private] - 10https://gerrit.wikimedia.org/r/501521 (https://phabricator.wikimedia.org/T219560) [09:32:23] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudservices2002-dev.wikimedia.org'] ` and were... [09:33:12] 10Operations, 10Analytics, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10fgiunchedi) @ayounsi this is good to go, unless there's signoff needed? [09:36:24] (03PS2) 10Dzahn: add fake passwords for cassandra session store [labs/private] - 10https://gerrit.wikimedia.org/r/501521 (https://phabricator.wikimedia.org/T219560) [09:38:19] (03PS1) 10Arturo Borrero Gonzalez: cloudservices2002-dev: typo in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/501524 (https://phabricator.wikimedia.org/T220101) [09:39:24] (03PS5) 10Gehel: icinga: add mediawiki cirrus update lag check [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe) [09:39:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2002-dev: typo in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/501524 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez) [09:40:01] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake passwords for cassandra session store [labs/private] - 10https://gerrit.wikimedia.org/r/501521 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [09:40:40] (03CR) 10Gehel: [C: 03+2] icinga: add mediawiki cirrus update lag check [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe) [09:40:50] (03PS6) 10Gehel: icinga: add mediawiki cirrus update lag check [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe) [09:47:50] 10Operations, 10Puppet, 10puppet-compiler, 10Release-Engineering-Team (Watching / External): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10jbond) > In addition, Jenkins doesn't seem to like having more than Change-id and Bug in the footer: Seems to... [09:47:58] (03PS1) 10Filippo Giunchedi: Add restbase2019 / restbase2020 instances [dns] - 10https://gerrit.wikimedia.org/r/501525 (https://phabricator.wikimedia.org/T217368) [09:48:49] (03CR) 10Dzahn: [C: 03+1] admin: add gpu-users group and assign it to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [09:49:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/501370 (owner: 10Muehlenhoff) [09:50:11] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:52:56] (03PS1) 10Filippo Giunchedi: restbase: add restbase2019 / restbase2020 [puppet] - 10https://gerrit.wikimedia.org/r/501526 (https://phabricator.wikimedia.org/T217368) [09:53:37] (03CR) 10Vgutierrez: [C: 04-1] profile::cache::ssl::unified: Allow passing certs/certs_active by hiera (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [09:53:37] (03PS1) 10Dzahn: exim: remove wikivoyage.de from wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/501527 (https://phabricator.wikimedia.org/T219867) [09:53:44] oooh.. the puppet error on icinga1001 is real [09:53:46] for once [09:54:58] (03PS3) 10DCausse: Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox) [09:55:00] (03PS1) 10DCausse: Add workaround to surefire [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501528 [09:55:30] (03CR) 10jerkins-bot: [V: 04-1] Add workaround to surefire [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501528 (owner: 10DCausse) [09:55:41] (03Abandoned) 10Jbond: pdebuild: add a new repo for build dependencies [puppet] - 10https://gerrit.wikimedia.org/r/500464 (owner: 10Jbond) [09:55:58] (03CR) 10jerkins-bot: [V: 04-1] Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox) [09:56:05] gehel: it broke puppet on icinga due to double quotes [09:56:13] mutante: yep, I'm on it [09:56:22] 'k, cool [09:56:53] (03PS2) 10DCausse: Add workaround to surefire [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501528 [09:56:55] (03PS4) 10DCausse: Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox) [09:57:01] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [09:57:47] (03PS1) 10Ema: ATS: install libhwloc5 from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/501529 (https://phabricator.wikimedia.org/T219967) [09:57:50] (03PS3) 10Jbond: jessie-backports: remove redundant pins [puppet] - 10https://gerrit.wikimedia.org/r/499808 (https://phabricator.wikimedia.org/T219333) [09:58:46] (03PS1) 10Gehel: elasticsearch: fix single quotes in graphite check [puppet] - 10https://gerrit.wikimedia.org/r/501530 [09:59:17] mutante: ^ have a minute for a review? [09:59:27] (03PS2) 10Ema: ATS: install libhwloc5 from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/501529 (https://phabricator.wikimedia.org/T219967) [09:59:36] (03PS1) 10Dzahn: toollabs::bastion: remove trusty cgred code [puppet] - 10https://gerrit.wikimedia.org/r/501531 [10:01:44] (03PS3) 10DCausse: Use latest parent pom [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501528 [10:01:46] (03PS5) 10DCausse: Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox) [10:01:48] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) [10:01:54] (03CR) 10Dzahn: [C: 03+1] "yep, this seems right. matches with what puppet says on icinga1001 " consider using double quotes"" [puppet] - 10https://gerrit.wikimedia.org/r/501530 (owner: 10Gehel) [10:02:03] sure gehel, looks right [10:02:14] mutante: thanks! [10:02:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/501529 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:02:30] (03CR) 10Gehel: [C: 03+2] elasticsearch: fix single quotes in graphite check [puppet] - 10https://gerrit.wikimedia.org/r/501530 (owner: 10Gehel) [10:04:11] (03PS3) 10Ema: ATS: install libhwloc5 from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/501529 (https://phabricator.wikimedia.org/T219967) [10:04:33] ACKNOWLEDGEMENT - toolschecker: expect a long running job on stretch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string ok not found on http://checker.tools.wmflabs.org:80/grid/continuous/stretch - 158 bytes in 0.219 second response time daniel_zahn https://phabricator.wikimedia.org/T213413 ?? https://wikitech.wikimedia.org/wiki/portal:toolforge/admin/toolschecker [10:04:34] ACKNOWLEDGEMENT - toolschecker: gridengine webservice running on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/gridengine - 351 bytes in 0.344 second response time daniel_zahn https://phabricator.wikimedia.org/T213413 ?? https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [10:04:35] ACKNOWLEDGEMENT - toolschecker: kubernetes webservice running on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 350 bytes in 0.651 second response time daniel_zahn https://phabricator.wikimedia.org/T213413 ?? https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [10:04:35] ACKNOWLEDGEMENT - toolschecker: start a job and verify on stretch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string ok not found on http://checker.tools.wmflabs.org:80/grid/start/stretch - 158 bytes in 0.793 second response time daniel_zahn https://phabricator.wikimedia.org/T213413 ?? https://wikitech.wikimedia.org/wiki/portal:toolforge/admin/toolschecker [10:05:20] (03CR) 10Ema: [C: 03+2] ATS: install libhwloc5 from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/501529 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:05:24] (03CR) 10DCausse: [C: 03+2] Use latest parent pom [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501528 (owner: 10DCausse) [10:05:30] (03CR) 10DCausse: [C: 03+2] Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox) [10:05:43] ACKNOWLEDGEMENT - Check systemd state on cloudcontrol2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T220173 ? [10:05:43] ACKNOWLEDGEMENT - keystone admin endpoint port 35357 on cloudcontrol2001-dev is CRITICAL: connect to address 208.80.153.59 and port 35357: Connection refused daniel_zahn https://phabricator.wikimedia.org/T220173 ? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [10:05:43] ACKNOWLEDGEMENT - keystone public endoint port 5000 on cloudcontrol2001-dev is CRITICAL: connect to address 208.80.153.59 and port 5000: Connection refused daniel_zahn https://phabricator.wikimedia.org/T220173 ? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [10:06:10] (03CR) 10jerkins-bot: [V: 04-1] Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox) [10:07:35] (03CR) 10Jbond: [C: 03+2] jessie-backports: remove redundant pins [puppet] - 10https://gerrit.wikimedia.org/r/499808 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [10:07:45] (03PS4) 10Jbond: jessie-backports: remove redundant pins [puppet] - 10https://gerrit.wikimedia.org/r/499808 (https://phabricator.wikimedia.org/T219333) [10:07:48] (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns_recursor: in stretch, don't use the package from jessie [puppet] - 10https://gerrit.wikimedia.org/r/501532 (https://phabricator.wikimedia.org/T220101) [10:07:54] (03CR) 10Jbond: [V: 03+2 C: 03+2] jessie-backports: remove redundant pins [puppet] - 10https://gerrit.wikimedia.org/r/499808 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [10:07:56] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10Dzahn) labtestnet2003 is still in Icinga: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=lab... [10:08:21] ACKNOWLEDGEMENT - NTP on labtestnet2003 is CRITICAL: NTP CRITICAL: No response from NTP server daniel_zahn https://phabricator.wikimedia.org/T219776 https://wikitech.wikimedia.org/wiki/NTP [10:08:54] (03CR) 10Jbond: [C: 03+2] pbuilder: add security updates repository [puppet] - 10https://gerrit.wikimedia.org/r/501163 (https://phabricator.wikimedia.org/T220003) (owner: 10Jbond) [10:09:03] (03PS5) 10Jbond: pbuilder: add security updates repository [puppet] - 10https://gerrit.wikimedia.org/r/501163 (https://phabricator.wikimedia.org/T220003) [10:09:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: pdns_recursor: in stretch, don't use the package from jessie [puppet] - 10https://gerrit.wikimedia.org/r/501532 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez) [10:09:36] (03CR) 10DCausse: "recheck" [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox) [10:09:42] (03PS2) 10Arturo Borrero Gonzalez: openstack: pdns_recursor: in stretch, don't use the package from jessie [puppet] - 10https://gerrit.wikimedia.org/r/501532 (https://phabricator.wikimedia.org/T220101) [10:10:03] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [10:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:07] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [10:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:16] (03CR) 10DCausse: [C: 03+2] Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox) [10:10:44] (03PS3) 10Arturo Borrero Gonzalez: openstack: pdns_recursor: in stretch, don't use the package from jessie [puppet] - 10https://gerrit.wikimedia.org/r/501532 (https://phabricator.wikimedia.org/T220101) [10:11:00] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:12:21] ACKNOWLEDGEMENT - tools project instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: k8s-etcd,prometheus class instances not spread out enough daniel_zahn https://phabricator.wikimedia.org/T220189 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [10:13:46] (03CR) 10Arturo Borrero Gonzalez: "This patch is OK, but I believe you can drop the toollabs::bastion class entirely." [puppet] - 10https://gerrit.wikimedia.org/r/501531 (owner: 10Dzahn) [10:14:51] ACKNOWLEDGEMENT - Mediawiki Cirrussearch update lag - codfw on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] Gehel initial deployment of the check https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:14:51] ACKNOWLEDGEMENT - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] Gehel initial deployment of the check https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:15:19] 10Operations, 10SRE-Access-Requests: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10elukey) Checked a bit into puppet and the analytics_deploy key is usable by analytics-admins: ` analytics_deploy: trusted_groups: - analytics-admins ` Thi... [10:15:35] 10Operations, 10SRE-Access-Requests, 10User-Elukey: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10elukey) [10:15:46] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.103e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [10:16:31] (03CR) 10Dzahn: "don't forget private repo has modules/privateexim/manifests/init.pp: "/etc/exim4/aliases/wikivoyage.de":" [puppet] - 10https://gerrit.wikimedia.org/r/501527 (https://phabricator.wikimedia.org/T219867) (owner: 10Dzahn) [10:17:15] (03CR) 10Elukey: "I am wondering now if the the gpu-users group could be directly a replacement of 'gpu-testers', since probably Erik doesn't need anymore t" [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [10:18:10] (03PS1) 10Jbond: package-builder: export hook variable [puppet] - 10https://gerrit.wikimedia.org/r/501535 (https://phabricator.wikimedia.org/T220003) [10:18:12] PROBLEM - Recursive DNS on 208.80.153.78 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/DNS [10:22:19] (03PS2) 10Jbond: package-builder: export hook variable [puppet] - 10https://gerrit.wikimedia.org/r/501535 (https://phabricator.wikimedia.org/T220003) [10:25:30] (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns: give mariadb stretch support [puppet] - 10https://gerrit.wikimedia.org/r/501536 (https://phabricator.wikimedia.org/T220101) [10:26:54] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:26:58] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:27:20] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:27:28] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:28:10] elukey: ^ [10:28:52] (03PS1) 10Arturo Borrero Gonzalez: cloudservices2002-dev: cleanup duplicate entries [dns] - 10https://gerrit.wikimedia.org/r/501537 (https://phabricator.wikimedia.org/T220101) [10:29:26] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:29:26] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:29:40] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:30:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2002-dev: cleanup duplicate entries [dns] - 10https://gerrit.wikimedia.org/r/501537 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez) [10:30:20] (03PS5) 10BBlack: Shortener VCL fixups [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup) [10:30:26] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [10:30:36] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:31:02] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [10:31:08] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) [10:31:20] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:32:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC noop for eqiad1: https://puppet-compiler.wmflabs.org/compiler1002/15603/" [puppet] - 10https://gerrit.wikimedia.org/r/501536 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez) [10:34:55] Hi ops team - letting you know I'm waiting for elukey to help with the AQS alarms above - We're on it :) [10:35:05] thanks joal :) [10:35:21] 10Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic, 10User-Ladsgroup: Make UrlShortener 404s cacheable - https://phabricator.wikimedia.org/T220190 (10Ladsgroup) [10:40:40] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:40:46] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:40:48] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:40:54] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:41:19] !log restart druid broker on druid1004 - exceptions in the logs after old datasource removal [10:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:56] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) [10:42:34] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:42:36] !log restart druid broker on druid100[5,6] - exceptions in the logs after old datasource removal [10:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:24] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:44:00] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:44:06] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:45:02] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [10:45:04] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:46:16] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:46:52] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:47:56] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) a:05aborrero→03Papaul @Papaul I'm assigning this task to you to do the changes related to the... [10:53:26] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: cloudservices2002-dev: include openldap profile [puppet] - 10https://gerrit.wikimedia.org/r/501539 (https://phabricator.wikimedia.org/T218575) [10:54:31] 10Operations, 10SRE-Access-Requests, 10User-Elukey: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10jcrespo) analytics-deployers seems to me like a good idea, because later maybe someone else wants to do the same and we don't want those other people h... [10:54:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: cloudservices2002-dev: include openldap profile [puppet] - 10https://gerrit.wikimedia.org/r/501539 (https://phabricator.wikimedia.org/T218575) (owner: 10Arturo Borrero Gonzalez) [10:57:48] !log updating puppet catalog compiler facts [10:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:52] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:03:21] (03PS1) 10Mathew.onipe: icinga: fix wrong thresholds [puppet] - 10https://gerrit.wikimedia.org/r/501542 [11:05:04] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) A bit more to the picture, managed to get facter to build by updating all refrence of `std::unordered_map` to `std::map`. however im now getting the fol... [11:10:04] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: cloudservices2002-dev introduce hiera keys for openldap [puppet] - 10https://gerrit.wikimedia.org/r/501544 (https://phabricator.wikimedia.org/T218575) [11:13:41] (03PS3) 10Effie Mouzeli: lvs: Use the kubernetes cluster for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/496382 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris) [11:14:46] (03PS1) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) [11:15:50] (03CR) 10jerkins-bot: [V: 04-1] mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [11:16:32] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:45] (03CR) 10Effie Mouzeli: [C: 03+2] "Mystery solved, it was due to service restarts on scb*, we are free to proceed" [puppet] - 10https://gerrit.wikimedia.org/r/496382 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris) [11:21:31] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: cloudservices2002-dev introduce hiera keys for openldap [puppet] - 10https://gerrit.wikimedia.org/r/501544 (https://phabricator.wikimedia.org/T218575) [11:24:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: cloudservices2002-dev introduce hiera keys for openldap [puppet] - 10https://gerrit.wikimedia.org/r/501544 (https://phabricator.wikimedia.org/T218575) (owner: 10Arturo Borrero Gonzalez) [11:24:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc https://puppet-compiler.wmflabs.org/compiler1002/15607/" [puppet] - 10https://gerrit.wikimedia.org/r/501544 (https://phabricator.wikimedia.org/T218575) (owner: 10Arturo Borrero Gonzalez) [11:25:37] (03PS5) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [11:25:39] (03PS2) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) [11:26:42] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [11:27:02] (03CR) 10jerkins-bot: [V: 04-1] mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [11:27:05] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99) [11:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:31:28] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:31:35] (03Abandoned) 10Dzahn: exim: remove wikivoyage.de from wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/501527 (https://phabricator.wikimedia.org/T219867) (owner: 10Dzahn) [11:31:36] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:31:53] 10Operations: DNS for wikivoyage-old.org - https://phabricator.wikimedia.org/T81727 (10Dzahn) [11:31:59] 10Operations, 10Domains, 10Traffic, 10serviceops, 10Patch-For-Review: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Dzahn) 05Open→03Resolved [11:32:26] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:32:46] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:32:50] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [11:33:32] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:33:58] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [11:34:18] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:35:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:37:03] !log Restarting pybal on lvs1006 and lvs2006 for 496382 [11:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:37] ACKNOWLEDGEMENT - MD RAID on cp3041 is CRITICAL: connect to address 10.20.0.176 port 5666: No route to host nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T220193 [11:37:40] ACKNOWLEDGEMENT - MD RAID on cp3034 is CRITICAL: connect to address 10.20.0.169 port 5666: No route to host nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T220194 [11:37:42] 10Operations, 10ops-esams: Degraded RAID on cp3041 - https://phabricator.wikimedia.org/T220193 (10ops-monitoring-bot) [11:37:45] 10Operations, 10ops-esams: Degraded RAID on cp3034 - https://phabricator.wikimedia.org/T220194 (10ops-monitoring-bot) [11:38:23] (03CR) 10Dzahn: "thanks! i would prefer to do incrementally" [puppet] - 10https://gerrit.wikimedia.org/r/501531 (owner: 10Dzahn) [11:39:54] (03PS3) 10Muehlenhoff: snapshot: Add hhvm to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/501370 [11:40:51] (03CR) 10Dzahn: "https://tools.wmflabs.org/openstack-browser/puppetclass/role::toollabs::bastion" [puppet] - 10https://gerrit.wikimedia.org/r/501531 (owner: 10Dzahn) [11:40:58] is someone looking at the issues with esams? [11:41:10] (03CR) 10Dzahn: [C: 03+2] toollabs::bastion: remove trusty cgred code [puppet] - 10https://gerrit.wikimedia.org/r/501531 (owner: 10Dzahn) [11:42:06] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:42:58] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [11:43:45] (03CR) 10Muehlenhoff: [C: 03+2] snapshot: Add hhvm to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/501370 (owner: 10Muehlenhoff) [11:44:40] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:44:40] Is there a list of what sites are on which shard? I cant remember. [11:44:48] jouncebot: now [11:44:49] No deployments scheduled for the next 70 hour(s) and 45 minute(s) [11:44:57] mark: it looks like the Level3 link between eqiad and esams went down [11:45:12] 10Gbps wave [11:45:28] yes [11:46:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toollabs::bastion: remove trusty cgred code [puppet] - 10https://gerrit.wikimedia.org/r/501531 (owner: 10Dzahn) [11:48:13] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:49:08] (03PS1) 10BBlack: Depool esams, primary transport is down and seeing packet loss [dns] - 10https://gerrit.wikimedia.org/r/501549 [11:49:29] Zppix: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s3.dblist and other files, I guess [11:50:02] Lucas_WMDE: ah that was it and actually that was the exact shard i was wanting to look at too :P [11:50:13] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:50:13] (03CR) 10BBlack: [C: 03+2] Depool esams, primary transport is down and seeing packet loss [dns] - 10https://gerrit.wikimedia.org/r/501549 (owner: 10BBlack) [11:50:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:50:25] (03PS2) 10BBlack: Depool esams, primary transport is down and seeing packet loss [dns] - 10https://gerrit.wikimedia.org/r/501549 [11:53:02] !log esams depooled in DNS [11:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:15] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:54:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:54:19] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:56:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] postgresql: set recovery.conf to writeable by postgres user [puppet] - 10https://gerrit.wikimedia.org/r/501371 (https://phabricator.wikimedia.org/T219652) (owner: 10Bstorm) [11:56:29] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:57:52] (03PS1) 10Arturo Borrero Gonzalez: acme_chief: generate certs for cloudservices2002-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/501550 (https://phabricator.wikimedia.org/T218575) [11:59:23] (03CR) 10Vgutierrez: [C: 04-1] "looks good in general, fix the nitpicks :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/501550 (https://phabricator.wikimedia.org/T218575) (owner: 10Arturo Borrero Gonzalez) [11:59:43] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 59.36 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:01:47] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [12:01:53] (03PS2) 10Arturo Borrero Gonzalez: acme_chief: generate certs for cloudservices2002-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/501550 (https://phabricator.wikimedia.org/T218575) [12:02:41] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [12:02:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [12:03:25] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/501550 (https://phabricator.wikimedia.org/T218575) (owner: 10Arturo Borrero Gonzalez) [12:03:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] acme_chief: generate certs for cloudservices2002-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/501550 (https://phabricator.wikimedia.org/T218575) (owner: 10Arturo Borrero Gonzalez) [12:04:51] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [12:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:03] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:08:25] ACKNOWLEDGEMENT - Recursive DNS on 208.80.153.78 is CRITICAL: CRITICAL - Plugin timed out while executing system call daniel_zahn WIP by Arturo https://wikitech.wikimedia.org/wiki/DNS [12:08:55] (03PS1) 10BBlack: Revert "Depool esams, primary transport is down and seeing packet loss" [dns] - 10https://gerrit.wikimedia.org/r/501552 [12:09:53] ACKNOWLEDGEMENT - Labs LDAP on cloudservices2002-dev is CRITICAL: Could not search/find objectclasses in dc=wikimedia,dc=org daniel_zahn host in downtime, just services on it were not https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [12:09:54] ACKNOWLEDGEMENT - mysqld processes on cloudservices2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld daniel_zahn host in downtime, just services on it were not https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:10:47] arturo: does cloudservices2002-dev mysql really need to page? [12:11:03] not at all [12:11:12] (03CR) 10BBlack: [C: 03+2] Revert "Depool esams, primary transport is down and seeing packet loss" [dns] - 10https://gerrit.wikimedia.org/r/501552 (owner: 10BBlack) [12:11:28] but it didn't page, right? [12:11:34] the issue was that the host was in downtime but the service on it were not [12:11:38] double check puppet class :-) [12:11:43] (03PS2) 10BBlack: Revert "Depool esams, primary transport is down and seeing packet loss" [dns] - 10https://gerrit.wikimedia.org/r/501552 [12:11:45] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "Depool esams, primary transport is down and seeing packet loss" [dns] - 10https://gerrit.wikimedia.org/r/501552 (owner: 10BBlack) [12:12:03] !log repool esams [12:12:03] there is an action in the icinga web ui like "this host and all services on it" [12:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:08] I think you may be using production configuration [12:12:16] on a non-production host [12:12:44] yea, probably "if -dev in host name then not CRIT" ? [12:13:03] could turn off the paging that way in hiera [12:14:12] icinga unhandled issues back to just 2 again now.. one of which is esams and one is known/common [12:15:09] (03Abandoned) 10Hashar: (WIP) run tests against multiple mw versions (WIP) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332777 (https://phabricator.wikimedia.org/T115713) (owner: 10Hashar) [12:15:31] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [12:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:35] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [12:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:11] (03Abandoned) 10Hashar: Fix .gitreview to point to proper repo [debs/php-excimer] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/481613 (owner: 10Hashar) [12:18:17] (03Abandoned) 10Hashar: gbp: use upstream branch master, not tags [debs/php-excimer] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/481615 (owner: 10Hashar) [12:18:27] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [12:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:38] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [12:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:08] (03PS3) 10Hashar: Rake: honor rubocop AllCops/Excludes [puppet] - 10https://gerrit.wikimedia.org/r/484410 [12:19:22] (03PS5) 10Hashar: rsync: readd incoming and outgoing chmod [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) [12:19:45] (03PS5) 10Hashar: doc: make published files group writable [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) [12:21:48] (03PS3) 10Hashar: shinken: add basic spec [puppet] - 10https://gerrit.wikimedia.org/r/497253 [12:23:09] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 74.52 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:23:11] (03CR) 10Ema: [C: 03+1] Shortener VCL fixups [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup) [12:25:02] (03CR) 10Hashar: "I have missed:" [puppet] - 10https://gerrit.wikimedia.org/r/497595 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar) [12:25:52] (03PS2) 10Hashar: contint: update sury.org gpg key for apt [puppet] - 10https://gerrit.wikimedia.org/r/497605 (https://phabricator.wikimedia.org/T218735) [12:26:53] (03PS1) 10Ema: ATS: make error template directory depend on package [puppet] - 10https://gerrit.wikimedia.org/r/501553 (https://phabricator.wikimedia.org/T219967) [12:26:55] (03PS3) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) [12:29:18] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=zotero [12:29:19] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=blubberoid [12:29:20] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=mathoid [12:29:21] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=cxserver [12:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:22] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=citoid [12:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:25] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 2.401e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [12:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:34] !log repool codfw for all kubernetes services [12:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:51] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 54.7 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:30:08] (03CR) 10Ema: [C: 03+2] ATS: make error template directory depend on package [puppet] - 10https://gerrit.wikimedia.org/r/501553 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [12:30:40] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [12:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:44] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [12:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:26] (03PS3) 10Dzahn: contint: update sury.org gpg key for apt [puppet] - 10https://gerrit.wikimedia.org/r/497605 (https://phabricator.wikimedia.org/T218735) (owner: 10Hashar) [12:31:30] (03PS1) 10Jcrespo: mariadb-snapshot: Reduce codfw mariabackup generation to x1 and m5 [puppet] - 10https://gerrit.wikimedia.org/r/501555 (https://phabricator.wikimedia.org/T206203) [12:31:40] (03CR) 10Dzahn: [C: 03+2] "confirmed this is the same file as https://packages.sury.org/php/apt.gpg" [puppet] - 10https://gerrit.wikimedia.org/r/497605 (https://phabricator.wikimedia.org/T218735) (owner: 10Hashar) [12:31:54] !log repool codfw for all kubernetes services T217426 [12:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:09] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=zotero [12:32:10] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=blubberoid [12:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:11] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mathoid [12:32:12] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=cxserver [12:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:13] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=citoid [12:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:18] !log depool eqiad for all kubernetes services T217426 [12:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:17] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [12:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:20] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [12:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:56] (03PS1) 10Gehel: elasticsearch: cleanup logging during shard allocation [software/spicerack] - 10https://gerrit.wikimedia.org/r/501556 [12:39:01] (03CR) 10Hashar: "The paths in .Dockerignore are all joined with the directory that contains the Dockerfile. git does the same with .gitignore." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484547 (https://phabricator.wikimedia.org/T183546) (owner: 10Hashar) [12:41:07] (03CR) 10Jcrespo: "This is not yet implemented, but let me know what you think of the idea/option. By making it an option (despite I hate too many options), " [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [12:43:03] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [12:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:08] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [12:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:52] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [12:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:56] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [12:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:22] !log Restarting pybal on lvs1016 and lvs2003 for 496382 [12:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:53] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [12:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:13] !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=99) [12:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:45] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:54:13] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 82.59 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:54:31] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) a:03Vgutierrez [12:58:29] (03PS4) 10Alex Monk: profile::cache::ssl::unified: Allow passing certs/certs_active by hiera [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927) [13:03:16] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/499355 (owner: 10Alex Monk) [13:03:20] (03CR) 10Gehel: [C: 04-1] icinga: fix wrong thresholds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501542 (owner: 10Mathew.onipe) [13:04:27] (03CR) 10Vgutierrez: [C: 03+1] "pcc is happy and shows a NOOP now: https://puppet-compiler.wmflabs.org/compiler1002/15608/" [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [13:04:50] (03PS1) 10Ema: cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967) [13:04:52] 10Operations, 10cloud-services-team: cloudcontrol2001-dev hourly cronspam - https://phabricator.wikimedia.org/T220173 (10aborrero) 05Open→03Resolved a:03aborrero I just put `exit 0` in `/etc/cron.hourly/keystone`. Is a cleanup cron that makes no sense while the service is unpopulated. [13:05:40] (03PS5) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [13:06:04] (03CR) 10jerkins-bot: [V: 04-1] cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:07:08] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499737 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [13:08:00] (03PS2) 10Ema: cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967) [13:09:26] (03PS6) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [13:09:33] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: sync apt with our production settings [puppet] - 10https://gerrit.wikimedia.org/r/501563 [13:09:35] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: improve build-base-images [puppet] - 10https://gerrit.wikimedia.org/r/501564 [13:09:37] (03PS1) 10Giuseppe Lavagetto: apt: remove redundant Install-Recommends [puppet] - 10https://gerrit.wikimedia.org/r/501565 [13:10:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/501563 (owner: 10Giuseppe Lavagetto) [13:11:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] docker::baseimages: sync apt with our production settings [puppet] - 10https://gerrit.wikimedia.org/r/501563 (owner: 10Giuseppe Lavagetto) [13:14:23] (03PS2) 10Dzahn: toollabs::bastion: remove trusty cgred code [puppet] - 10https://gerrit.wikimedia.org/r/501531 [13:16:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::baseimages: sync apt with our production settings [puppet] - 10https://gerrit.wikimedia.org/r/501563 (owner: 10Giuseppe Lavagetto) [13:16:15] (03PS2) 10Giuseppe Lavagetto: docker::baseimages: sync apt with our production settings [puppet] - 10https://gerrit.wikimedia.org/r/501563 [13:16:18] 10Operations, 10Office-IT, 10Research, 10Wikimedia-Mailing-lists: Create research-alerts mailing list - https://phabricator.wikimedia.org/T219309 (10bmansurov) @eliza replied. Thanks! [13:17:18] 10Operations, 10Office-IT, 10Research, 10Wikimedia-Mailing-lists: Create research-alerts mailing list - https://phabricator.wikimedia.org/T219309 (10Dzahn) 05Open→03Resolved a:03Dzahn cool, looks like we can close it here as resolved then [13:17:37] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203 (10aborrero) [13:17:49] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203 (10aborrero) p:05Triage→03High [13:19:37] PROBLEM - Check systemd state on cloudservices2002-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:22] (03CR) 10Marostegui: "Question now that I realised that retention: 1" [puppet] - 10https://gerrit.wikimedia.org/r/501555 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [13:22:25] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [13:23:25] PROBLEM - puppet last run on cloudservices2002-dev is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[mariadb] [13:23:46] (03PS2) 10Giuseppe Lavagetto: docker::baseimages: improve build-base-images [puppet] - 10https://gerrit.wikimedia.org/r/501564 [13:24:19] (03CR) 10Marostegui: "I am talking from memory,but I thought we had some sort of implementation of this on WMFReplication, why not using that one?" [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [13:25:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::baseimages: improve build-base-images [puppet] - 10https://gerrit.wikimedia.org/r/501564 (owner: 10Giuseppe Lavagetto) [13:26:55] (03CR) 10Eevans: "I think this is missing the rack information (ala `hieradata/hosts/restbase2019.yaml` and `hieradata/hosts/restbase2020.yaml` files)." [puppet] - 10https://gerrit.wikimedia.org/r/501526 (https://phabricator.wikimedia.org/T217368) (owner: 10Filippo Giunchedi) [13:28:40] (03PS2) 10Mathew.onipe: icinga: correct direction of check [puppet] - 10https://gerrit.wikimedia.org/r/501542 [13:29:36] (03CR) 10Mathew.onipe: icinga: correct direction of check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501542 (owner: 10Mathew.onipe) [13:30:07] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:30:11] (03CR) 10Jcrespo: "> Question now that I realised that retention: 1" [puppet] - 10https://gerrit.wikimedia.org/r/501555 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [13:31:07] (03PS1) 10Arturo Borrero Gonzalez: labtestnet2002: spare server in stretch [puppet] - 10https://gerrit.wikimedia.org/r/501567 (https://phabricator.wikimedia.org/T220203) [13:31:20] (03PS1) 10Dzahn: add Icinga notes_url to various NRPE monitor checks, pt 3 [puppet] - 10https://gerrit.wikimedia.org/r/501568 [13:31:31] (03CR) 10Marostegui: [C: 03+1] "> > Question now that I realised that retention: 1" [puppet] - 10https://gerrit.wikimedia.org/r/501555 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [13:31:34] ACKNOWLEDGEMENT - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T219933 [13:33:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestnet2002: spare server in stretch [puppet] - 10https://gerrit.wikimedia.org/r/501567 (https://phabricator.wikimedia.org/T220203) (owner: 10Arturo Borrero Gonzalez) [13:33:45] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10Dzahn) cloudservices2002-dev is alerting in Icinga: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cg... [13:34:44] ACKNOWLEDGEMENT - Check systemd state on cloudservices2002-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T220101 [13:34:44] ACKNOWLEDGEMENT - puppet last run on cloudservices2002-dev is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 14 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[mariadb] daniel_zahn https://phabricator.wikimedia.org/T220101 [13:35:14] cwd: something broke on frack puppet related to icinga [13:35:18] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [13:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:22] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [13:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:25] (03PS1) 10Ema: ATS: make config files depend on package [puppet] - 10https://gerrit.wikimedia.org/r/501569 (https://phabricator.wikimedia.org/T219967) [13:35:40] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [13:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:45] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [13:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:10] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10fgiunchedi) a:05fgiunchedi→03Papaul @Papaul we need to allocate these hosts in the same rows as the hosts they are replacing (restbase2007 and restbase2008) thus p... [13:36:46] !log T220101 disable active icinga checks for cloudcontrol2002-dev [13:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:51] T220101: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 [13:37:31] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 1073 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:37:52] arturo: please dont disable active checks. can we do downtime instead ? [13:38:32] imho it should disappear from icinga when going through the rename workflow [13:38:50] (03CR) 10Jcrespo: "> I am talking from memory,but I thought we had some sort of" [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [13:38:53] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) @EBernhardson question for you - while working on https://gerr... [13:38:57] mutante: Jeff is fixing [13:39:15] cwd: :) [13:39:44] (03CR) 10Eevans: [C: 03+1] Add restbase2019 / restbase2020 instances [dns] - 10https://gerrit.wikimedia.org/r/501525 (https://phabricator.wikimedia.org/T217368) (owner: 10Filippo Giunchedi) [13:40:21] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` labtestnet2002.codfw.wm... [13:41:19] !log T220203 reimage labtestnet2002 as spare in stretch [13:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:23] T220203: labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203 [13:41:23] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: fix yaml syntax [puppet] - 10https://gerrit.wikimedia.org/r/501570 [13:41:52] mutante: was trying to reduce noise as much as possible. Will revisit later... or probably next week if you don't revisit before [13:42:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::baseimages: fix yaml syntax [puppet] - 10https://gerrit.wikimedia.org/r/501570 (owner: 10Giuseppe Lavagetto) [13:42:30] arturo: ok, downtime also means no alerts [13:43:20] * unless the downtime happens after the alerting [13:43:36] for example, if it pages -> downtime -> the recovery will page [13:43:43] (even on downtime) [13:44:11] (03CR) 10Marostegui: "> Also WMFMariaDB is not being used by any of these scripts so we have to think if we want to use mariadb to connect (in addition to cumin" [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [13:44:56] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=zotero [13:44:57] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=blubberoid [13:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:58] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mathoid [13:44:59] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=cxserver [13:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:00] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=citoid [13:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:06] !log ρepool eqiad for all kubernetes services T217426 [13:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:11] !log repool eqiad for all kubernetes services T217426 [13:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:00] (03CR) 10Ema: [C: 03+2] ATS: make config files depend on package [puppet] - 10https://gerrit.wikimedia.org/r/501569 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:49:08] (03PS2) 10Ema: ATS: make config files depend on package [puppet] - 10https://gerrit.wikimedia.org/r/501569 (https://phabricator.wikimedia.org/T219967) [13:49:13] (03CR) 10Jcrespo: "> We might be creating some tech debt." [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [13:51:07] (03CR) 10Alexandros Kosiaris: Add an update action (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [13:53:00] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99) [13:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:24] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10EBernhardson) Doesn't seem to be needed anymore, feel free to start mo... [13:57:45] (03CR) 10Dzahn: "still don't really understand why we want to manually edit files on doc and for the rsync options i will add Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [13:59:17] (03Abandoned) 10Elukey: admin: add gpu-users group and assign it to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [14:04:21] (03PS1) 10Elukey: admin: remove sudo permissions from gpu-testers and add users to it [puppet] - 10https://gerrit.wikimedia.org/r/501575 (https://phabricator.wikimedia.org/T148843) [14:06:08] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/15611/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/501575 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [14:06:27] (03PS3) 10Ema: cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967) [14:06:50] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [14:07:42] 10Operations: parsoid-vd on scandium randomly died - https://phabricator.wikimedia.org/T219933 (10ssastry) See logs below from scandium. Seems to reliably shut down after 1 hour of no activity. Must be some config setting in some nodejs library. ` $ sudo journalctl -n 1000 -u parsoid-vd | egrep "8011|FAIL" ....... [14:08:02] (03PS7) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [14:09:21] (03PS3) 10Gehel: icinga: correct direction of check [puppet] - 10https://gerrit.wikimedia.org/r/501542 (owner: 10Mathew.onipe) [14:10:14] (03CR) 10Gehel: [C: 03+2] icinga: correct direction of check [puppet] - 10https://gerrit.wikimedia.org/r/501542 (owner: 10Mathew.onipe) [14:10:41] 10Operations: parsoid-vd on scandium randomly died - https://phabricator.wikimedia.org/T219933 (10ssastry) @Arlolra @cscott Any insights here? Why would express terminate after an hour of being idle? (testreduce codebase for parsoid-vd service on scandium). This is not critical. I asked @Dzahn to turn off these... [14:14:07] (03PS2) 10Elukey: admin: remove sudo permissions from gpu-testers and add users to it [puppet] - 10https://gerrit.wikimedia.org/r/501575 (https://phabricator.wikimedia.org/T148843) [14:14:09] (03PS1) 10Elukey: admin: create analytics-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/501578 (https://phabricator.wikimedia.org/T220175) [14:15:38] (03PS2) 10Elukey: admin: create analytics-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/501578 (https://phabricator.wikimedia.org/T220175) [14:16:00] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['labtestnet2002.codfw.wmnet'] ` and were **ALL** successful. [14:16:42] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Elukey: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10elukey) p:05Triage→03Normal a:03elukey [14:21:52] (03CR) 10Alexandros Kosiaris: "Aside from my comment, rest lgtm" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [14:25:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] Move pulling logic to us, away from the docker daemon [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501017 (owner: 10Giuseppe Lavagetto) [14:26:49] (03PS1) 10Elukey: role::statistics::gpu: add common statistics packages [puppet] - 10https://gerrit.wikimedia.org/r/501580 (https://phabricator.wikimedia.org/T148843) [14:27:21] (03CR) 10Alexandros Kosiaris: Add changelog (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501186 (owner: 10Giuseppe Lavagetto) [14:27:29] (03CR) 10jerkins-bot: [V: 04-1] role::statistics::gpu: add common statistics packages [puppet] - 10https://gerrit.wikimedia.org/r/501580 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [14:28:43] how dare you Luca using tabs [14:29:44] (03PS1) 10Alex Monk: openstack::puppet::master::encapi: work on stretch with python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/501581 (https://phabricator.wikimedia.org/T171188) [14:29:57] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor typo, logic LGTM" (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501185 (owner: 10Giuseppe Lavagetto) [14:30:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10Andrew) [14:30:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] Depend on docker-py 3.x [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501184 (owner: 10Giuseppe Lavagetto) [14:30:15] did somebody just disable notificaitons for everyting on scandium? please dont [14:31:04] (03PS2) 10Elukey: role::statistics::gpu: add common statistics packages [puppet] - 10https://gerrit.wikimedia.org/r/501580 (https://phabricator.wikimedia.org/T148843) [14:32:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, but some tests for --snapshot would be nice" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501183 (owner: 10Giuseppe Lavagetto) [14:32:45] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15613/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/501580 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [14:33:16] (03CR) 10Andrew Bogott: [C: 03+2] openstack::puppet::master::encapi: work on stretch with python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/501581 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [14:33:23] (03PS2) 10Andrew Bogott: openstack::puppet::master::encapi: work on stretch with python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/501581 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [14:39:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Well, the fact we populate it on installation should not mean that we avoid enforcing it in runtime." [puppet] - 10https://gerrit.wikimedia.org/r/501565 (owner: 10Giuseppe Lavagetto) [14:40:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/501535 (https://phabricator.wikimedia.org/T220003) (owner: 10Jbond) [14:41:33] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Eevans) @Dzahn I think this was just for !!labs/private.git!!, could you do the same for p... [14:42:09] (03CR) 10Jbond: [C: 03+2] package-builder: export hook variable [puppet] - 10https://gerrit.wikimedia.org/r/501535 (https://phabricator.wikimedia.org/T220003) (owner: 10Jbond) [14:42:40] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Dzahn) @eevans I did both, the private repo part just doesn't show up on ticket. [14:44:03] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Eevans) >>! In T219560#5088509, @Dzahn wrote: > @eevans I did both, the private repo part... [14:45:03] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Dzahn) @Eevans Yep, i noticed that too and couldn't find the puppet code that would use th... [14:46:21] (03PS1) 10Dzahn: icinga/parsoid: disable systemd monitoring on scandium test host [puppet] - 10https://gerrit.wikimedia.org/r/501586 (https://phabricator.wikimedia.org/T219933) [14:47:39] (03PS1) 10Alex Monk: openstack::puppet::master::encapi: Avoid nginx-apache conflict [puppet] - 10https://gerrit.wikimedia.org/r/501587 [14:48:02] (03CR) 10Dzahn: [C: 03+2] icinga/parsoid: disable systemd monitoring on scandium test host [puppet] - 10https://gerrit.wikimedia.org/r/501586 (https://phabricator.wikimedia.org/T219933) (owner: 10Dzahn) [14:49:14] (03PS2) 10Alex Monk: openstack::puppet::master::encapi: Avoid nginx-apache conflict [puppet] - 10https://gerrit.wikimedia.org/r/501587 (https://phabricator.wikimedia.org/T171188) [14:49:52] urandom: can you see the puppet code location that should use the new private variables ? [14:50:24] (03PS4) 10Ema: cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967) [14:51:49] mutante: there are two templates, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/cassandra/templates/adduser.cql.erb and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/cassandra/templates/cqlshrc.erb [14:53:44] (03CR) 10Subramanya Sastry: [C: 03+1] icinga/parsoid: disable systemd monitoring on scandium test host [puppet] - 10https://gerrit.wikimedia.org/r/501586 (https://phabricator.wikimedia.org/T219933) (owner: 10Dzahn) [14:54:21] (03CR) 10Jcrespo: "Comment." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [14:54:32] urandom: yes, i saw those, but something else must be missing in .pp files [14:54:54] oh. [14:55:00] hrm. [14:55:22] !log powering down restbase2019 and 2020 for relocation [14:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:26] there is definitely class passwords::cassandra with $application_username = 'seessions' [14:55:30] sessions [14:56:05] (03CR) 10CDanis: [C: 03+1] grafana: remove frack datasources [puppet] - 10https://gerrit.wikimedia.org/r/501519 (https://phabricator.wikimedia.org/T219825) (owner: 10Filippo Giunchedi) [14:56:50] PROBLEM - Host restbase2020 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:53] maybe the "include ::passwords::cassandra" [14:57:01] hmm.. [14:57:06] mutante: I'll bet it's missing from role::sessionstore [14:57:12] s/from/in/ [14:57:19] i.e. it ought to be there, and isn't :/ [14:57:34] PROBLEM - Host restbase2019 is DOWN: PING CRITICAL - Packet loss = 100% [14:57:39] that includes profile::cassandra though [14:57:53] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) p:05High→03Normal Lowering priority since the remaining steps are less urgent. [14:57:54] and that includes the password class.. hrmm [14:59:12] mutante: the restbase role has a `include ::passwords::cassandra`, and the sessionstore one does not [14:59:36] (03PS1) 10Elukey: profile::analytics::cluster::packages::statistics: add myspell guards [puppet] - 10https://gerrit.wikimedia.org/r/501589 (https://phabricator.wikimedia.org/T148843) [15:00:13] urandom: heh, i guess that is it.. though profile::cassandra also has include ::passwords::cassandra [15:00:43] that one probably not in scope then [15:01:15] I wonder what would happen if it were removed [15:01:42] we can jupload a patch to do that and then compile it [15:01:52] ya [15:02:05] PROBLEM - Host restbase2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:02:54] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15615/" [puppet] - 10https://gerrit.wikimedia.org/r/501589 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [15:04:00] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203 (10aborrero) Right now this server is spare, waiting for further decisions, see https://phabricator.wikimedia.org/T217891#5088306 [15:04:21] adds the include to sessionstore role [15:04:33] (03PS1) 10Dzahn: sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) [15:04:59] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [15:05:06] (03CR) 10Marostegui: [C: 04-2] db-eqiad.php: Promote db1075 to master (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [15:05:37] (03CR) 10jerkins-bot: [V: 04-1] sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [15:05:45] doh [15:05:45] 10Operations, 10Patch-For-Review: parsoid-vd on scandium randomly died - https://phabricator.wikimedia.org/T219933 (10Dzahn) systemd monitoring has been removed from icinga entirely for scandium. that's what was flapping due to this. all other base monitoring checks are still here and enabled (again). [15:05:53] gah.. style check? [15:06:15] yep... [15:06:30] we are not supposed to include them in the role in the first place ... [15:08:13] PROBLEM - Host restbase2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:09:17] how is it even possible that we get alerts for restbase2020 before it is in DNS https://gerrit.wikimedia.org/r/c/operations/dns/+/501525 [15:09:22] sigh..it never stops [15:10:46] papaul: seems they are in the wrong row [15:11:10] ACKNOWLEDGEMENT - Host restbase2019 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T217368 [15:11:10] ACKNOWLEDGEMENT - Host restbase2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T217368 [15:11:10] ACKNOWLEDGEMENT - Host restbase2020 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T217368 [15:11:10] ACKNOWLEDGEMENT - Host restbase2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T217368 [15:12:27] urandom: not sure yet why the include in profile::cassandra doesnt give us what we want already .. [15:13:16] mutante: or why it's OK for it to be included in the restbase role... [15:13:23] does the order matter? [15:13:49] urandom: it's probably not ok.. just that jenkins check only looks for NEW violations and ignores existing ones [15:13:58] OK [15:14:13] 10Operations, 10Traffic, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10jijiki) We can gradually increase API PHP7 traffic by switching completely each server to PHP7 one or more at a time. If we assume for example that each A... [15:14:57] 10Operations, 10serviceops, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [15:15:00] 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki) [15:15:04] 10Operations, 10Traffic, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10jijiki) [15:15:42] (03PS1) 10Muehlenhoff: Remove access for gtirloni [puppet] - 10https://gerrit.wikimedia.org/r/501596 (https://phabricator.wikimedia.org/T220211) [15:16:07] 10Operations, 10serviceops, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [15:16:10] 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki) [15:16:13] 10Operations, 10serviceops, 10Beta-Feature: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10jijiki) [15:16:33] PROBLEM - Host serpens is DOWN: PING CRITICAL - Packet loss = 100% [15:16:35] (03PS8) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [15:16:43] PROBLEM - Host poolcounter2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:17:19] PROBLEM - Host install2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:17:19] PROBLEM - Host ganeti2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:17:21] PROBLEM - Host pollux is DOWN: PING CRITICAL - Packet loss = 100% [15:17:21] PROBLEM - Host tureis is DOWN: PING CRITICAL - Packet loss = 100% [15:17:35] PROBLEM - Host releases2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:18:17] ugh [15:18:21] hmmm [15:18:55] we lost ganeti2004 just now? [15:19:27] urandom: for some reason it set @super_password to "cassandra" in cqlshrc on sessionstore1001 ? [15:19:34] herron: ganeti.. yea :( [15:19:42] ahh that explains it [15:19:47] will these automatically relocate to another host? [15:19:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "makes sense to me. Worth double-checking that we are not breaking the code live in production when this change is merged. Specially some m" [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [15:20:04] mutante: that's the default [15:20:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for gtirloni [puppet] - 10https://gerrit.wikimedia.org/r/501596 (https://phabricator.wikimedia.org/T220211) (owner: 10Muehlenhoff) [15:21:40] chaomodus: cant reach mgmt either [15:21:47] asking in dcops about ongoing work [15:22:19] 2003 isn't reachable? [15:22:39] 2004 mgmt works for me [15:22:44] (03PS1) 10Elukey: profile::an::cluster::pkgs::statistics: add better handling of myspell pkgs [puppet] - 10https://gerrit.wikimedia.org/r/501600 (https://phabricator.wikimedia.org/T148843) [15:23:32] I’m on 2003 at the moment [15:23:35] RECOVERY - Host restbase2019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.60 ms [15:24:08] ganeti2004.codfw.wmnet ? ? ? ? ? 6 7 [15:24:12] in the node list [15:24:46] https://www.irccloud.com/pastebin/dIJXyDST/ [15:24:51] papaul is checking the cable [15:24:57] this is B5 [15:25:17] safe to attempt to restart these, or might that make things worse? [15:25:22] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203 (10aborrero) 05Open→03Declined We are keeping labtestnet2002 as spare for now. Is running Debian already, with role::spare. [15:25:30] herron: ip link show [15:25:32] herron: don't they have to be migrated before they can start again? [15:25:33] does it still have link? [15:26:17] mutante: what do you mean “it” ? [15:26:34] (03PS2) 10Elukey: profile::an::cluster::pkgs::statistics: add better handling of myspell pkgs [puppet] - 10https://gerrit.wikimedia.org/r/501600 (https://phabricator.wikimedia.org/T148843) [15:26:51] herron: ganeti2004 or ganeti2003 [15:27:01] you mentioned both [15:27:16] I can’t reach ganeti2004, but I’m logged in to ganeti2003 [15:27:17] RECOVERY - Host serpens is UP: PING OK - Packet loss = 0%, RTA = 37.24 ms [15:27:17] RECOVERY - Host install2002 is UP: PING OK - Packet loss = 0%, RTA = 37.83 ms [15:27:19] RECOVERY - Host releases2001 is UP: PING OK - Packet loss = 0%, RTA = 38.27 ms [15:27:23] RECOVERY - Host ganeti2004 is UP: PING OK - Packet loss = 0%, RTA = 73.08 ms [15:27:25] RECOVERY - Host poolcounter2002 is UP: PING OK - Packet loss = 0%, RTA = 36.38 ms [15:27:25] RECOVERY - Host pollux is UP: PING OK - Packet loss = 0%, RTA = 36.31 ms [15:27:25] there we go [15:27:31] RECOVERY - Host tureis is UP: PING OK - Packet loss = 0%, RTA = 36.41 ms [15:27:35] fwew [15:27:39] cable issue? [15:27:44] it was the cable on 2004 [15:27:48] -dcops [15:27:55] at least it was nothing serious :) [15:28:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Elukey: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10RobH) Does this new group have sudo? If they are becoming 'analytics' users, it sounds like it will? (I don't see sudo rights i... [15:28:06] (03CR) 10Elukey: [C: 03+2] profile::an::cluster::pkgs::statistics: add better handling of myspell pkgs [puppet] - 10https://gerrit.wikimedia.org/r/501600 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [15:28:56] (03PS3) 10Bstorm: postgresql: set recovery.conf to writeable by postgres user [puppet] - 10https://gerrit.wikimedia.org/r/501371 (https://phabricator.wikimedia.org/T219652) [15:30:13] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:13] PROBLEM - puppet last run on ganeti2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:13] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Elukey: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10elukey) >>! In T220175#5088605, @RobH wrote: > Does this new group have sudo? If they are becoming 'analytics' users, it sounds... [15:30:16] I think the answer to my question is trying to restart them would have made it worse [15:30:21] (03CR) 10Bstorm: [C: 03+2] postgresql: set recovery.conf to writeable by postgres user [puppet] - 10https://gerrit.wikimedia.org/r/501371 (https://phabricator.wikimedia.org/T219652) (owner: 10Bstorm) [15:30:35] since the cluster was probably split brain until the cable was reconnected? [15:30:49] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:31:17] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:31:17] maybe it wouldn’t have allowed that action though. anyway glad it back to normal [15:31:22] herron: should probably come up with a strategy to deal with that, like the wikitech page is about when you know it's gonna go down - you drain the node and stuff [15:32:31] PROBLEM - puppet last run on poolcounter2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:32:40] ahah [15:32:48] https://wikitech.wikimedia.org/wiki/Service_restarts#Ganeti [15:32:57] PROBLEM - puppet last run on tureis is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:32:58] for planned maintenance on hosts there are docs [15:33:00] more evidence that this problem is due to load on puppetmaster [15:33:28] cdanis: yes, i mean, if there's an /unexpected/ outage of a node [15:33:36] cdanis: there should be a strategy [15:33:45] but also the vms on ganeti2004 were without network for a bit [15:33:50] re: puppet fails [15:34:21] it's going to be just that ... running puppet [15:34:26] right so they are all running simultaneously [15:34:36] and loading puppetmaster in some way that causes the catalog fetch fail [15:34:53] chaomodus: ah, yeah [15:34:58] oh I guess it could be, I was thinking more along the lines of the scheduled cron failed due to cable issue and now icinga noticed [15:35:16] nah, it's fine on poolcounter2002 but had to manually run it [15:35:29] i was thinking they were blocking on connect so they all were able to connect at the same time [15:35:33] i dont think they were running at the same time.its' randomized [15:35:33] but anyways [15:35:35] https://nsrc.org/workshops/2016/sanog27/raw-attachment/wiki/Track2Virt/ex-ganeti-failure-scenarios.htm [15:37:47] RECOVERY - puppet last run on poolcounter2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:38:13] RECOVERY - puppet last run on tureis is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:39:39] I think these puppet fails are a different condition from our suspected load related 503s. the error was poolcounter2002 puppet-agent[13231]: Could not retrieve catalog from remote server: getaddrinfo: Name or service not known [15:39:54] interesting [15:41:02] good point [15:41:07] RECOVERY - Host restbase2020.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 345.43 ms [15:41:56] * Krinkle staging on mwdebug1002 to roll out three patches for wmf.24 (group0 only) [15:45:21] (03PS2) 10Dzahn: sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) [15:45:38] herron: i think that had nothing to do with master overload [15:45:43] it was just the network being down [15:46:01] RECOVERY - puppet last run on ganeti2004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:46:12] (03PS1) 10Elukey: ores::base: fix package requires for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) [15:46:17] (03CR) 10jerkins-bot: [V: 04-1] sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [15:47:28] (03PS3) 10Dzahn: sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) [15:49:44] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Papaul) a:05Papaul→03Marostegui complete [15:50:08] (03PS1) 10Muehlenhoff: Remove gtirloni from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/501609 (https://phabricator.wikimedia.org/T220211) [15:50:12] (03CR) 10Eevans: [C: 03+1] sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [15:51:19] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:51:45] (03PS1) 10BBlack: wm.org no-op cleanup: no empty left-hand-side [dns] - 10https://gerrit.wikimedia.org/r/501610 [15:51:48] (03PS1) 10BBlack: wm.org no-op cleanup: prefer @ to wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/501611 [15:51:49] (03PS1) 10BBlack: wm.org no-op cleanup: Group on hostnames for meta [dns] - 10https://gerrit.wikimedia.org/r/501612 [15:51:51] (03PS1) 10BBlack: wm.org no-op cleanup: re-arrange top section [dns] - 10https://gerrit.wikimedia.org/r/501613 [15:51:52] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/GlobalBlocking/includes/specials/: I5843cd181ca7d (duration: 01m 02s) [15:51:54] (03PS1) 10BBlack: ns[012].wikimedia.org: 1D TTLs [dns] - 10https://gerrit.wikimedia.org/r/501614 [15:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:56] (03PS1) 10BBlack: wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615 [15:52:08] (03CR) 10jerkins-bot: [V: 04-1] wm.org no-op cleanup: prefer @ to wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/501611 (owner: 10BBlack) [15:52:10] (03CR) 10jerkins-bot: [V: 04-1] wm.org no-op cleanup: no empty left-hand-side [dns] - 10https://gerrit.wikimedia.org/r/501610 (owner: 10BBlack) [15:52:13] (03CR) 10jerkins-bot: [V: 04-1] wm.org no-op cleanup: Group on hostnames for meta [dns] - 10https://gerrit.wikimedia.org/r/501612 (owner: 10BBlack) [15:52:19] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:22] (03CR) 10Dzahn: [C: 03+1] Remove gtirloni from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/501609 (https://phabricator.wikimedia.org/T220211) (owner: 10Muehlenhoff) [15:52:24] (03CR) 10jerkins-bot: [V: 04-1] wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615 (owner: 10BBlack) [15:52:29] (03CR) 10jerkins-bot: [V: 04-1] ns[012].wikimedia.org: 1D TTLs [dns] - 10https://gerrit.wikimedia.org/r/501614 (owner: 10BBlack) [15:52:31] (03CR) 10jerkins-bot: [V: 04-1] wm.org no-op cleanup: re-arrange top section [dns] - 10https://gerrit.wikimedia.org/r/501613 (owner: 10BBlack) [15:52:58] urandom: "Compilation results for sessionstore1001.eqiad.wmnet: no change" :( [15:53:02] wtf [15:53:13] merges it anyways to see :) [15:53:29] (03CR) 10Muehlenhoff: ores::base: fix package requires for Debian Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [15:53:31] (03CR) 10Dzahn: [C: 03+2] sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [15:53:44] (03PS1) 10Jbond: aptrepo: create new components for facter3 and puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/501616 (https://phabricator.wikimedia.org/T219803) [15:53:46] (03PS4) 10Dzahn: sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) [15:53:48] (03PS1) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [15:53:50] (03PS1) 10Jbond: facter3 and puppet5: add repositories for puppet5 and facter3 [puppet] - 10https://gerrit.wikimedia.org/r/501618 (https://phabricator.wikimedia.org/T219803) [15:54:48] (03CR) 10Muehlenhoff: [C: 03+1] aptrepo: create new components for facter3 and puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/501616 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:54:57] (03CR) 10Elukey: ores::base: fix package requires for Debian Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [15:55:15] (03CR) 10jerkins-bot: [V: 04-1] puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:55:18] (03CR) 10Jbond: [C: 03+2] aptrepo: create new components for facter3 and puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/501616 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:55:29] (03PS2) 10Jbond: aptrepo: create new components for facter3 and puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/501616 (https://phabricator.wikimedia.org/T219803) [15:55:31] (03CR) 10jerkins-bot: [V: 04-1] facter3 and puppet5: add repositories for puppet5 and facter3 [puppet] - 10https://gerrit.wikimedia.org/r/501618 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:56:37] (03PS2) 10Muehlenhoff: Remove gtirloni from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/501609 (https://phabricator.wikimedia.org/T220211) [15:57:07] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:57:27] (03PS3) 10Jbond: aptrepo: create new components for facter3 and puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/501616 (https://phabricator.wikimedia.org/T219803) [15:57:30] (03CR) 10Jbond: [V: 03+2 C: 03+2] aptrepo: create new components for facter3 and puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/501616 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:57:55] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/NavigationTiming/: I6b23be850d35c7d19 / T220156 (duration: 01m 00s) [15:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:58] T220156: navtiming: firstPaint.mobile metric broken on wmf.24 - https://phabricator.wikimedia.org/T220156 [15:58:17] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) - restbase2019 relocate to B5 restbase2020 relocate to C5 - Netbox update - clean asw-a-codfw [15:58:28] urandom: merged and ran puppet and nothing happened. it must be something else we dont see yet :( [15:58:31] (03PS3) 10Muehlenhoff: Remove gtirloni from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/501609 (https://phabricator.wikimedia.org/T220211) [15:58:33] not that include [15:58:45] wth [15:58:46] (03PS2) 10BBlack: wm.org no-op cleanup: no empty left-hand-side [dns] - 10https://gerrit.wikimedia.org/r/501610 [15:58:48] (03PS2) 10BBlack: wm.org no-op cleanup: prefer @ to wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/501611 [15:58:49] mutante: OK [15:58:49] (03PS2) 10BBlack: wm.org no-op cleanup: Group on hostnames for meta [dns] - 10https://gerrit.wikimedia.org/r/501612 [15:58:51] (03PS2) 10BBlack: wm.org no-op cleanup: re-arrange top section [dns] - 10https://gerrit.wikimedia.org/r/501613 [15:58:54] (03PS2) 10BBlack: ns[012].wikimedia.org: 1D TTLs [dns] - 10https://gerrit.wikimedia.org/r/501614 [15:58:56] (03PS2) 10BBlack: wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615 [15:59:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10Cmjohnson) [15:59:36] (03CR) 10jerkins-bot: [V: 04-1] wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615 (owner: 10BBlack) [15:59:40] (03PS3) 10Jbond: package-builder: export hook variable [puppet] - 10https://gerrit.wikimedia.org/r/501535 (https://phabricator.wikimedia.org/T220003) [15:59:48] (03CR) 10Dzahn: [C: 03+2] Remove gtirloni from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/501609 (https://phabricator.wikimedia.org/T220211) (owner: 10Muehlenhoff) [15:59:51] (03CR) 10Muehlenhoff: [C: 03+2] Remove gtirloni from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/501609 (https://phabricator.wikimedia.org/T220211) (owner: 10Muehlenhoff) [16:00:37] (03PS4) 10Jbond: package-builder: export hook variable [puppet] - 10https://gerrit.wikimedia.org/r/501535 (https://phabricator.wikimedia.org/T220003) [16:00:56] urandom: at least then we dont have to wonder why the include in the profile doesnt work.. trying to see the silver lining. heh. but also need to run soon [16:01:25] (03PS2) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [16:01:29] mutante: I understand [16:01:42] mutante: for another day then; thanks! [16:01:49] right on Monday! [16:02:11] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.24/includes/jobqueue/jobs/RefreshLinksJob.php: Ib1ac31365f9c / T220037 (duration: 00m 59s) [16:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:15] T220037: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 [16:02:28] (03CR) 10jerkins-bot: [V: 04-1] puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [16:02:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10Cmjohnson) a:05Cmjohnson→03RobH Assigning back to robh NIC has been enabled to PXE, second cable has been run, por... [16:02:45] (03PS1) 10Dzahn: Revert "sessionstore: include cassandra passwords" [puppet] - 10https://gerrit.wikimedia.org/r/501620 [16:03:18] 10Operations, 10ops-eqiad, 10DC-Ops: relocate/reimage cloudvirt1015 with 10G interfaces - https://phabricator.wikimedia.org/T217140 (10Cmjohnson) [16:03:47] 10Operations, 10ops-eqiad, 10DC-Ops: relocate/reimage cloudvirt1015 with 10G interfaces - https://phabricator.wikimedia.org/T217140 (10Cmjohnson) a:05Cmjohnson→03Andrew This server should be ready to go [16:04:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10Cmjohnson) [16:04:56] (03PS3) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [16:05:22] (03PS2) 10Dzahn: Revert "sessionstore: include cassandra passwords" [puppet] - 10https://gerrit.wikimedia.org/r/501620 [16:05:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10Cmjohnson) a:05Cmjohnson→03RobH @robh connected the second cable, updated switch cfg with cloud-virt-instance-trunk. Ran the SPP [16:05:58] (03PS3) 10Dzahn: Revert "sessionstore: include cassandra passwords" [puppet] - 10https://gerrit.wikimedia.org/r/501620 [16:06:08] (03CR) 10Dzahn: [C: 03+2] Revert "sessionstore: include cassandra passwords" [puppet] - 10https://gerrit.wikimedia.org/r/501620 (owner: 10Dzahn) [16:06:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1018 with 10G interfaces - https://phabricator.wikimedia.org/T217347 (10Cmjohnson) [16:07:09] (03PS3) 10BBlack: wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615 [16:08:05] (03PS2) 10Elukey: ores::base: fix package requires for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) [16:08:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1018 with 10G interfaces - https://phabricator.wikimedia.org/T217347 (10Cmjohnson) Everything is done with this server but I am not getting any link lights on the 10G card. I verified and re-verified that the car... [16:09:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1024 with 10G interfaces - https://phabricator.wikimedia.org/T216724 (10Cmjohnson) [16:09:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1024 with 10G interfaces - https://phabricator.wikimedia.org/T216724 (10Cmjohnson) Everything is done with this server but I am not getting any link lights on the 10G card. I verified and re-verified that the car... [16:09:48] (03CR) 10Elukey: "Mortiz: not sure if this version is correct, because on the same pre-existing stretch host we'd end up with, for example, both hunspell-ca" [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [16:09:50] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10Papaul) papaul@asw-c-codfw# run show interfaces ge-1/0/17 descriptions Interface Admin Link Descript... [16:10:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1009 with 10G interfaces - https://phabricator.wikimedia.org/T216324 (10Cmjohnson) [16:10:14] I'm trying to push a patch to gerrit and I keep getting a disconnect issue with "Too many authentication failures: 7" message. [16:10:19] Any idea what's going on? [16:10:20] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10Papaul) [16:10:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1009 with 10G interfaces - https://phabricator.wikimedia.org/T216324 (10Cmjohnson) a:05Cmjohnson→03Andrew The server is moved and is ready to install [16:10:55] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10Papaul) a:05Papaul→03aborrero [16:12:40] (03PS2) 10Jbond: facter3 and puppet5: add repositories for puppet5 and facter3 [puppet] - 10https://gerrit.wikimedia.org/r/501618 (https://phabricator.wikimedia.org/T219803) [16:14:46] 10Operations, 10wikidiff2, 10Wikimedia-production-error: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 (10Krinkle) p:05Triage→03High [16:15:44] dmaza: do you know when it stopped working for you? [16:16:03] it worked fine 2 days ago [16:16:23] actually, it worked fine yesterday [16:16:24] and you can log into the web ui fine? [16:16:28] yes [16:16:50] checked ssh pubkey and all that? [16:17:18] yup.. Lemme double check it is the same but nothing has changed on my system [16:18:00] dmaza: the ssh authentication fails due to the ssh key not matching [16:18:39] found via: ssh cobalt.wikimedia.org grep dmaza /var/log/gerrit/sshd_log [16:18:53] let me re-add the keys.. that's very odd [16:19:04] you can check at https://gerrit.wikimedia.org/r/#/settings/ssh-keys [16:19:05] yeah, I see it working yesterday in those logs as well [16:19:14] and also verify your local ssh config to make sure it offers the proper key [16:19:45] (03PS1) 10Elukey: Fix more common packages deployed to Buster based Analytics nodes [puppet] - 10https://gerrit.wikimedia.org/r/501621 (https://phabricator.wikimedia.org/T148843) [16:20:08] if it is a 'client not offering the correct key' issue, then this might be helpful dmaza: ssh -v -p 29418 gerrit.wikimedia.org [16:20:26] welp.. I restarted sshd and re-added my key and it works now [16:21:00] (03CR) 10Elukey: [C: 03+2] Fix more common packages deployed to Buster based Analytics nodes [puppet] - 10https://gerrit.wikimedia.org/r/501621 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [16:21:07] thank you and sorry for the inconvenience [16:21:33] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Marostegui) Thanks, it is rebuilding ` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 54% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive 1I:1:2 (port... [16:21:34] dmaza: and your key have not been changed in gerrit :) [16:21:40] dmaza: well done! [16:22:17] cdanis: so for Gerrit, I usually jump to that cobalt.wikimedia.org:/var/log/gerrit/sshd_log and usually that gives some good enough clues (eg: auth failure) [16:22:23] haha.. I assumed that someone was brute-forcing my account 🤷‍♂️ [16:22:27] sometime that is just using the wrong username [16:22:42] since ssh would typically use the local username which might not match the WMCS shell name [16:22:46] 10Operations, 10wikidiff2, 10Wikimedia-production-error: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 (10WMDE-Fisch) We changed the signature in wikidiff2 version 1.8.0 so not using $wikiDiff2MovedParagraphDetectionCuto... [16:22:54] yeah, makes sense [16:23:17] right. I wonder if git review has a "verbose" option [16:24:22] oh it does (-v). Maybe I'll try that if I have any other issues. It might spit out some useful info [16:24:52] (03PS4) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [16:25:32] 10Operations, 10wikidiff2, 10Wikimedia-production-error: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 (10MoritzMuehlenhoff) @Krinkle, @WMDE-Fisch : Shall we depool the five servers already upgraded until that is resolved? [16:27:51] (03CR) 10Jbond: "catalogue compile output" [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [16:30:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/501575 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [16:30:36] 10Operations, 10wikidiff2, 10Patch-For-Review, 10Wikimedia-production-error: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 (10Krinkle) @WMDE-Fisch Does this mean the feature is no longer exists, or is no longer configu... [16:30:54] (03CR) 10CRusnov: "Just a quick look through. I'm no puppet expert but overall seems good moving toward the standard modren way things are done and not the o" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [16:30:57] (03CR) 10Muehlenhoff: "The packages should probably audited one by one on a stretch system, e.g. myspell-ca is a transitional package in stretch and a virtual pa" [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [16:33:41] 10Operations, 10wikidiff2, 10Patch-For-Review, 10Wikimedia-production-error: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 (10WMDE-Fisch) >>! In T220217#5088833, @Krinkle wrote: > @WMDE-Fisch Does this mean the feature... [16:33:43] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10CDanis) >>! In T219803#5088004, @jbond wrote: > A bit more to the picture, managed to get facter to build by updating all refrence of `std::unordered_map` to `s... [16:34:43] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) ` papaul@asw-b-codfw> show interfaces ge-5/0/18 descriptions Interface Admin Link Description ge-5/0/18 up down restbase2019 papaul@asw-c-codf... [16:36:18] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) Ran into this in {... [16:37:26] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) @CDanis thanks i will try a patch with some of the other maps however the problem is that the std::unsorted_map is available but it has a bug in the libr... [16:38:21] (03PS1) 10BBlack: Add CNAME-variant langlist template [dns] - 10https://gerrit.wikimedia.org/r/501628 (https://phabricator.wikimedia.org/T208263) [16:38:23] (03PS1) 10BBlack: wiktionary: test with zone-local CNAME->DYNA [dns] - 10https://gerrit.wikimedia.org/r/501629 (https://phabricator.wikimedia.org/T208263) [16:39:42] (03CR) 10Bstorm: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [16:40:11] (03PS1) 10Andrew Bogott: Toolforge: update indic font packages [puppet] - 10https://gerrit.wikimedia.org/r/501630 [16:42:18] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10CDanis) Ah, got it. Sorry for not reading more of the context here, just saw that one line and thought "uh oh" :) [16:43:19] (03CR) 10Muehlenhoff: "Maybe sync that up with mediawiki::packages::fonts, e.g. the Malayalam fonts are missing in your list." [puppet] - 10https://gerrit.wikimedia.org/r/501630 (owner: 10Andrew Bogott) [16:43:31] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10Krinkle) [16:44:13] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10Krinkle) (Adding to our radar to look at navtiming/dns metrics impact after it is rolled out.) [16:45:25] (03PS1) 10Elukey: Fix more common packages for Analytics hosts for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501632 (https://phabricator.wikimedia.org/T148843) [16:46:16] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10MoritzMuehlenhoff) >>! In T219803#5088899, @CDanis wrote: > Ah, got it. Sorry for not reading more of the context here, just saw that > one line and thought "u... [16:47:23] (03CR) 10Elukey: [C: 03+2] Fix more common packages for Analytics hosts for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501632 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [16:47:31] (03CR) 10Hashar: "Thank you :]" [puppet] - 10https://gerrit.wikimedia.org/r/497605 (https://phabricator.wikimedia.org/T218735) (owner: 10Hashar) [16:47:54] (03CR) 10Nuria: [C: 03+1] admin: remove sudo permissions from gpu-testers and add users to it [puppet] - 10https://gerrit.wikimedia.org/r/501575 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [16:48:52] (03CR) 10Nuria: [C: 03+1] "Nice, yes , agreed this is much better." [puppet] - 10https://gerrit.wikimedia.org/r/501578 (https://phabricator.wikimedia.org/T220175) (owner: 10Elukey) [16:49:12] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10BBlack) We may try the wiktionary patch early next week. The goal with that test is just to see if we get any... [16:50:03] (03PS2) 10Andrew Bogott: Toolforge: remove old trusty-specific indic font packages [puppet] - 10https://gerrit.wikimedia.org/r/501630 [16:53:46] (03PS1) 10Elukey: Fix last common packages for Analytics hosts for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501635 (https://phabricator.wikimedia.org/T148843) [16:54:33] (03PS3) 10Andrew Bogott: Toolforge: remove old trusty-specific indic font packages [puppet] - 10https://gerrit.wikimedia.org/r/501630 [16:54:58] (03CR) 10Elukey: [C: 03+2] Fix last common packages for Analytics hosts for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501635 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [16:55:54] (03PS4) 10Andrew Bogott: Toolforge: remove old trusty-specific indic font packages [puppet] - 10https://gerrit.wikimedia.org/r/501630 [16:57:12] (03CR) 10Andrew Bogott: [C: 03+2] Toolforge: remove old trusty-specific indic font packages [puppet] - 10https://gerrit.wikimedia.org/r/501630 (owner: 10Andrew Bogott) [16:59:09] (03PS1) 10Alexandros Kosiaris: waf: Add dummy data for it [labs/private] - 10https://gerrit.wikimedia.org/r/501636 [16:59:13] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) The long list of patches above was needed to allow to deploy t... [17:01:32] RECOVERY - HP RAID on db2044 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [17:10:45] (03PS1) 10Thcipriani: Revert "Gerrit 2.15.12 (update core only)" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/501638 [17:12:26] RECOVERY - Device not healthy -SMART- on db2044 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044&var-datasource=codfw+prometheus/ops [17:12:32] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.24/includes/diff/TextSlotDiffRenderer.php: Ia326c67de28a4e / T220217 (duration: 01m 00s) [17:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:36] T220217: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 [17:12:37] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] Revert "Gerrit 2.15.12 (update core only)" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/501638 (owner: 10Thcipriani) [17:12:50] (03PS1) 10Alexandros Kosiaris: waf: Move to httpd::conf instead of httpd::site [puppet] - 10https://gerrit.wikimedia.org/r/501639 [17:12:52] (03PS1) 10Alexandros Kosiaris: waf: Remove realm if guards [puppet] - 10https://gerrit.wikimedia.org/r/501640 [17:16:47] 10Operations, 10Analytics, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10ayounsi) a:03ayounsi [17:17:54] (03CR) 10Herron: [C: 03+1] waf: Add dummy data for it [labs/private] - 10https://gerrit.wikimedia.org/r/501636 (owner: 10Alexandros Kosiaris) [17:18:02] 10Operations, 10Release Pipeline, 10Core Platform Team Kanban (Done with CPT), 10Release-Engineering-Team (Watching / External), 10Services (done): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10mobrovac) FYI, [service-runner v2.6... [17:19:35] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.23/includes/diff/TextSlotDiffRenderer.php: Ia326c67de28a4e / T220217 (duration: 01m 02s) [17:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:39] T220217: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 [17:21:21] 10Operations, 10wikidiff2, 10Patch-For-Review, 10Wikimedia-production-error: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 (10Krinkle) 05Open→03Resolved a:03WMDE-Fisch [17:23:32] Krinkle: are you in the middle of some backports? I was just about to do a quick gerrit restart. [17:23:39] thcipriani: done [17:23:47] k, thanks [17:24:09] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] waf: Add dummy data for it [labs/private] - 10https://gerrit.wikimedia.org/r/501636 (owner: 10Alexandros Kosiaris) [17:25:33] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@a4e66d4]: Gerrit to back to 2.15.11 (on gerrit2001 only) [17:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:43] (03CR) 10Alexandros Kosiaris: [C: 03+1] ores::base: fix package requires for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [17:25:43] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@a4e66d4]: Gerrit to back to 2.15.11 (on gerrit2001 only) (duration: 00m 10s) [17:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:28] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@a4e66d4]: Gerrit to back to 2.15.11 on cobalt (restart incoming) [17:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:40] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@a4e66d4]: Gerrit to back to 2.15.11 on cobalt (restart incoming) (duration: 00m 11s) [17:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:28] !log restart gerrit [17:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:13] !log gerrit back on 2.15.11 [17:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:21] (03CR) 10Herron: [C: 03+1] waf: Remove realm if guards [puppet] - 10https://gerrit.wikimedia.org/r/501640 (owner: 10Alexandros Kosiaris) [17:31:41] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [17:33:47] (03CR) 10Herron: [C: 03+1] "looks good to me, just one nitpick" [puppet] - 10https://gerrit.wikimedia.org/r/501639 (owner: 10Alexandros Kosiaris) [17:34:16] (03CR) 10Mobrovac: [C: 03+1] "Nice catch, Ema!" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [17:35:41] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [17:47:29] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:49:02] that is quite a lot [17:52:22] * greg-g looks at the week: https://grafana.wikimedia.org/d/000000438/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen&from=now-7d&to=now [17:53:31] yeah, you are right greg-g [17:55:41] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:55:58] cdanis: sadly, of course :/ [17:56:38] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Marostegui) 05Open→03Resolved Finished correctly, thanks! ` root@db2044:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380264FFFB0) Port Name: 1I P... [18:01:25] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:02:35] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:04:11] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Move graphoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219923 (10mobrovac) >>! In T219923#5086476, @Pchelolo wrote: > Apparently `g... [18:12:34] (03CR) 10Aaron Schulz: [C: 03+1] db-eqiad.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499737 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [18:16:51] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) [18:24:06] (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:27:15] (03PS2) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527) [18:51:33] (03PS2) 10Jcrespo: mariadb-snapshot: Reduce codfw mariabackup generation to x1 and m5 [puppet] - 10https://gerrit.wikimedia.org/r/501555 (https://phabricator.wikimedia.org/T206203) [18:52:42] (03CR) 10Jcrespo: [C: 03+2] mariadb-snapshot: Reduce codfw mariabackup generation to x1 and m5 [puppet] - 10https://gerrit.wikimedia.org/r/501555 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [18:53:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10RobH) I'll describe the issue I'm seeing, and how I've troubleshot it, so far to no avail: cloudvirt1008 will PXE boo... [18:54:27] (03PS1) 10RobH: cloudvirt1008 dhcp lease file correction [puppet] - 10https://gerrit.wikimedia.org/r/501666 (https://phabricator.wikimedia.org/T216661) [18:55:58] (03CR) 10RobH: [C: 03+2] cloudvirt1008 dhcp lease file correction [puppet] - 10https://gerrit.wikimedia.org/r/501666 (https://phabricator.wikimedia.org/T216661) (owner: 10RobH) [18:56:06] (03PS2) 10RobH: cloudvirt1008 dhcp lease file correction [puppet] - 10https://gerrit.wikimedia.org/r/501666 (https://phabricator.wikimedia.org/T216661) [19:02:36] (03CR) 10Bstorm: "That looks better. I can deal with an extra line break if things line up right." [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [19:07:04] (03PS3) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527) [19:09:25] (03CR) 10Bstorm: [C: 03+2] labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [19:13:45] (03PS6) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [19:13:47] (03PS4) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) [19:18:58] (03CR) 10Bstorm: "Confirmed it works! :)" [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [19:26:25] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:27:35] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:29:55] (03CR) 10Cwhite: [C: 03+2] grafana: update swift dashboard to use new metric names [puppet] - 10https://gerrit.wikimedia.org/r/501399 (https://phabricator.wikimedia.org/T219825) (owner: 10Cwhite) [19:30:02] (03PS2) 10Cwhite: grafana: update swift dashboard to use new metric names [puppet] - 10https://gerrit.wikimedia.org/r/501399 (https://phabricator.wikimedia.org/T219825) [19:32:08] (03CR) 10Cwhite: [C: 03+1] grafana: remove frack datasources [puppet] - 10https://gerrit.wikimedia.org/r/501519 (https://phabricator.wikimedia.org/T219825) (owner: 10Filippo Giunchedi) [19:36:41] (03PS1) 10Bstorm: Revert "labstore: Adapt nfs-exportd to be used on more than one cluster" [puppet] - 10https://gerrit.wikimedia.org/r/501676 [19:38:01] (03CR) 10Bstorm: [C: 03+2] Revert "labstore: Adapt nfs-exportd to be used on more than one cluster" [puppet] - 10https://gerrit.wikimedia.org/r/501676 (owner: 10Bstorm) [19:38:11] (03PS2) 10Bstorm: Revert "labstore: Adapt nfs-exportd to be used on more than one cluster" [puppet] - 10https://gerrit.wikimedia.org/r/501676 [19:38:39] (03CR) 10BryanDavis: "> I don't recall grid exec nodes having public IPs (in terms of the" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [19:44:12] (03CR) 10Bstorm: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [19:44:33] (03CR) 10Bstorm: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [19:53:12] (03PS1) 10Papaul: DNS: Change production DNS for restbase2019 and restbase2020 [dns] - 10https://gerrit.wikimedia.org/r/501686 [19:53:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10Andrew) a:05RobH→03Andrew [20:04:47] PROBLEM - nova-compute proc minimum on cloudvirt1012 is CRITICAL: connect to address 10.64.20.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:06:09] (03PS1) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [20:06:30] * arturo looking ^^^ [20:22:54] (03PS1) 10Andrew Bogott: cloudvirts: update nic names for 10Gb [puppet] - 10https://gerrit.wikimedia.org/r/501712 (https://phabricator.wikimedia.org/T216195) [20:23:50] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirts: update nic names for 10Gb [puppet] - 10https://gerrit.wikimedia.org/r/501712 (https://phabricator.wikimedia.org/T216195) (owner: 10Andrew Bogott) [20:24:04] (03PS2) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [20:34:19] (03PS3) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [20:35:53] (03PS4) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [20:47:02] (03PS5) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [20:47:48] (03PS6) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [20:50:16] (03PS2) 10BryanDavis: dynamicproxy: Prevent STS header from non-TLS connections [puppet] - 10https://gerrit.wikimedia.org/r/499669 (https://phabricator.wikimedia.org/T102367) [20:52:46] 10Operations, 10serviceops, 10Beta-Feature: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10Jdforrester-WMF) > Alternatively, we could decide to migrate all logged-in users before all the other users. Do we want to just do this? The only downside I can see is that content exclusively... [20:55:48] (03PS1) 10Andrew Bogott: site.pp: Make cloudvirt1008 a cloudvirt host [puppet] - 10https://gerrit.wikimedia.org/r/501791 [20:57:56] (03PS7) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [20:58:53] (03PS8) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [21:02:45] Anyone else run into ferm trying to run this ip6tables command resulting in that error? https://phabricator.wikimedia.org/P8355 [21:02:46] or know what it means [21:03:41] (03CR) 10BryanDavis: "> Did the previous/old setup work? Why is there a need for a change" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [21:10:43] (03PS9) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [21:12:44] (03PS10) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [21:13:39] 10Operations, 10ops-eqiad, 10DC-Ops: relocate/reimage cloudvirt1015 with 10G interfaces - https://phabricator.wikimedia.org/T217140 (10Andrew) [21:13:59] 10Operations, 10ops-eqiad, 10DC-Ops: relocate/reimage cloudvirt1015 with 10G interfaces - https://phabricator.wikimedia.org/T217140 (10Andrew) I reimaged and built a canary VM and everything looks good. Will put into proper service soon. [21:16:55] PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:17:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1009 with 10G interfaces - https://phabricator.wikimedia.org/T216324 (10Andrew) I reimaged and built a canary VM -- the hosted VM cannot access any external networks. I haven't investigated this more deeply yet,... [21:18:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10Andrew) I reimaged and built a canary VM -- the hosted VM cannot access any external networks. I haven't investigated this more deeply yet,... [21:18:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10Andrew) [21:27:01] (03CR) 10Bstorm: "Ok. This is confirmed working via cherry-pick into toolsbeta now. It already worked fine on NFS servers, but this version also doesn't b" [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [21:29:59] !log CI / Zuul is no more processing events / T220243 [21:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:03] T220243: CI / Zuul is no more processing events - https://phabricator.wikimedia.org/T220243 [21:37:19] !log restarting gerrit [21:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:33] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org] [21:41:57] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] [21:42:35] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [21:42:35] RECOVERY - puppet last run on db1084 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [21:42:49] (03CR) 10Legoktm: "Nice. Main thing is that this should probably be built with python3 and not python2" (033 comments) [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/500201 (owner: 10MarkAHershberger) [21:45:48] !log thcipriani restarted Gerrit. CI works again # T220243 [21:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:52] T220243: CI / Zuul is no more processing events - https://phabricator.wikimedia.org/T220243 [21:46:51] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [21:47:13] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:47:51] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:48:07] PROBLEM - High lag on wdqs1003 is CRITICAL: 3657 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:53:58] (03CR) 10Alex Monk: labstore: Adapt nfs-exportd to be used on more than one cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [21:54:03] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:55:24] (03CR) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [21:56:29] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:56:37] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:57:11] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:57:43] (03CR) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [21:58:07] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [21:59:06] (03CR) 10Alex Monk: labstore: Adapt nfs-exportd to be used on more than one cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [21:59:07] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [22:00:06] (03CR) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:00:19] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:04:09] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:07:05] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:08:15] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:08:45] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 9 others: Fix inefficient CacheAwarePropertyInfoStore memcached access pattern - https://phabricator.wikimedia.org/T97368 (10Addshore) We should see changes in 1.33.0-wmf.24. It looks like the tr... [22:09:09] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:09:11] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:09:17] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:10:19] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:10:21] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:13:13] (03PS11) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [22:13:31] (03CR) 10Addshore: [C: 03+2] "That is very much my bad, apparently that was missed from this process." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch) [22:13:49] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:15:35] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:17:23] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:17:33] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:17:41] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:18:02] (03CR) 10Bstorm: "I figured out why it worked before. It's that mtime bit. The variables are class-level, so they weren't overwritten just yet. Testing b" [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:18:04] (03CR) 10Addshore: [C: 03+1] WikibaseClient: Conditionally enable mapframe support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501335 (https://phabricator.wikimedia.org/T218051) (owner: 10Hoo man) [22:18:17] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:18:39] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:18:39] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:18:59] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:20:43] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:22:43] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:22:49] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:24:05] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:26:01] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:26:23] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [22:26:31] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:26:32] (03CR) 10Bstorm: "Took my test a step further. I locally changed the file on the toolsbeta puppetmaster, validated that the client abides by the changes an" [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:27:15] (03CR) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:27:21] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:27:47] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:27:59] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:28:31] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:29:05] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:29:09] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 6.128 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:29:24] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Epic: Epic: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 (10debt) [22:29:35] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:30:08] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10debt) [22:30:10] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate mjolnir to stdout/syslog/cee logging output - https://phabricator.wikimedia.org/T218833 (10debt) 05Open→03Resolved [22:30:21] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:33:47] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:34:59] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:35:01] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:35:17] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.206 second response time https://phabricator.wikimedia.org/T174916 [22:36:08] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Epic: Epic: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 (10debt) [22:36:13] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:37:35] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:38:58] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review, 10Wikimedia-Incident: Make spicerack more robust when unfreezing writes to elasticsearch / cirrus - https://phabricator.wikimedia.org/T219640 (10debt) 05Open→03Resolved [22:39:19] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [22:39:55] PROBLEM - pdfrender on scb1004 is CRITICAL: connect to address 10.64.48.29 and port 5252: Connection refused https://phabricator.wikimedia.org/T174916 [22:39:58] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review, 10Wikimedia-Incident: Create cookbook to reset frozen write state on elasticsearch / cirrus - https://phabricator.wikimedia.org/T219638 (10debt) 05Open→03Resolved [22:40:11] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:40:37] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.370 second response time https://phabricator.wikimedia.org/T174916 [22:40:47] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:41:11] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916 [22:41:33] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 474.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:41:51] PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 482.54 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:42:01] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:42:07] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:42:30] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Epic: Epic: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 (10debt) [22:42:47] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:42:51] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:42:51] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:43:15] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:43:19] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:43:19] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 4.116 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:43:55] 10Operations, 10CirrusSearch, 10Wikidata, 10Discovery-Search (Current work): Elasticsearch indices went read-only causing huge lag - https://phabricator.wikimedia.org/T219364 (10debt) 05Open→03Resolved [22:43:57] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:44:01] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:44:01] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:44:05] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:44:31] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [22:45:39] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916 [22:46:11] !log restarted pdfrender on scb1002 T174916 [22:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:15] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [22:47:02] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Epic: Epic: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 (10debt) [22:47:04] 10Operations, 10CirrusSearch, 10Discovery-Search (Current work): Elasticsearch 6: silence deprecation warnings to avoid logspam - https://phabricator.wikimedia.org/T219269 (10debt) 05Open→03Resolved [22:47:19] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:47:53] 10Operations, 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: Create extra elasticsearch clusters in beta cluster - https://phabricator.wikimedia.org/T213940 (10debt) 05Open→03Resolved [22:48:03] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:49:15] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:49:17] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:49:40] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Convert check_elasticsearch.py icinga plugin to py3 - https://phabricator.wikimedia.org/T215439 (10debt) 05Open→03Resolved [22:49:47] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:52:25] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: update elasticsearch curator to 5.6.0 - https://phabricator.wikimedia.org/T218991 (10debt) 05Open→03Resolved [22:53:09] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:53:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10RobH) a:05Andrew→03Cmjohnson So, per @andrew's request I've investigated the switch stack software for the secondary 'instance' connecti... [22:53:29] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review, 10User-fgiunchedi: cleanup reprepro configuration for elasticsearch-curator - https://phabricator.wikimedia.org/T216235 (10debt) 05Open→03Resolved [22:53:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1009 with 10G interfaces - https://phabricator.wikimedia.org/T216324 (10RobH) a:05Andrew→03Cmjohnson So, per @andrew's request I've investigated the switch stack software for the secondary 'instance' connecti... [22:53:47] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:54:33] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:55:49] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:56:01] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:56:19] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:56:23] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:57:03] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:57:05] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:57:07] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [22:58:23] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [23:00:11] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:01:03] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [23:01:21] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [23:02:11] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [23:03:29] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:04:03] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:07:07] (03CR) 10Jforrester: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch) [23:07:43] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [23:10:43] !log revert some recent problematic gerrit acl changes [23:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:23] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:43:29] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 79611 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:48:57] RECOVERY - MariaDB Slave Lag: m3 on db2042 is OK: OK slave_sql_lag Replication lag: 11.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [23:49:55] RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [23:55:11] (03CR) 10Krinkle: [C: 03+1] Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) (owner: 10Gilles)