[00:01:35] <wikibugs>	 (03CR) 10jenkins-bot: wikitech: Lock LDAP accounts when users are blocked [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497866 (https://phabricator.wikimedia.org/T168692) (owner: 10BryanDavis)
[00:01:37] <wikibugs>	 (03CR) 10jenkins-bot: wikitech: Disable Phabricator accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)
[00:07:24] <logmsgbot>	 !log bd808@deploy1001 Synchronized wmf-config/wikitech.php: SWAT: [[gerrit:497866|wikitech: Lock LDAP accounts when users are blocked]], [[gerrit:501123|Disable Phabricator accounts when blocked on wikitech]] (T168692) (duration: 00m 59s)
[00:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:07:29] <stashbot>	 T168692: Blocking an account on wikitech should disable LDAP logins - https://phabricator.wikimedia.org/T168692
[00:09:23] <logmsgbot>	 !log bd808@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:497866|wikitech: Lock LDAP accounts when users are blocked]], [[gerrit:501123|Disable Phabricator accounts when blocked on wikitech]] (T168692) 2/2 (duration: 00m 57s)
[00:09:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:10:17] <wikibugs>	 (03PS1) 10Bstorm: osmdb: set old osmdb servers to spare for decom [puppet] - 10https://gerrit.wikimedia.org/r/501457 (https://phabricator.wikimedia.org/T220144)
[00:11:03] <wikibugs>	 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10Bstorm)
[00:11:45] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] osmdb: set old osmdb servers to spare for decom [puppet] - 10https://gerrit.wikimedia.org/r/501457 (https://phabricator.wikimedia.org/T220144) (owner: 10Bstorm)
[00:18:28] <wikibugs>	 (03PS20) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723
[00:19:12] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi)
[00:19:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi)
[00:19:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi)
[00:21:55] <wikibugs>	 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10Bstorm)
[00:24:44] <wikibugs>	 (03PS21) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723
[00:26:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi)
[00:29:23] <wikibugs>	 (03PS22) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723
[00:29:54] <wikibugs>	 10Operations, 10Puppet, 10puppet-compiler, 10Release-Engineering-Team (Watching / External): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10ayounsi) That's useful, thank you.  It didn't work for https://gerrit.wikimedia.org/r/c/operations/puppet/+/397...
[00:31:18] <wikibugs>	 (03PS6) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992)
[00:32:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi)
[00:35:02] <wikibugs>	 (03PS7) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992)
[00:35:20] <wikibugs>	 (03PS8) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992)
[00:38:35] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi)
[00:40:24] <wikibugs>	 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10Bstorm)
[00:54:19] <icinga-wm>	 PROBLEM - puppet last run on wtp1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:04:49] <icinga-wm>	 RECOVERY - puppet last run on wtp1037 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[01:12:30] <wikibugs>	 (03PS3) 10Alex Monk: profile::cache::ssl::unified: Allow passing certs/certs_active by hiera [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927)
[01:15:21] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[01:16:37] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[01:22:25] <wikibugs>	 (03PS7) 10Alex Monk: Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655
[01:27:43] <wikibugs>	 (03PS1) 10Alex Monk: wikiba.se TLS: Make support for different certificate sources clearer [puppet] - 10https://gerrit.wikimedia.org/r/501461
[01:28:16] <wikibugs>	 (03CR) 10Alex Monk: [C: 03+1] ssl::wikibase: Fix le_subjects hieradata key name [puppet] - 10https://gerrit.wikimedia.org/r/501357 (owner: 10Vgutierrez)
[01:41:53] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol2001-dev is OK: OK - running: The system is fully operational
[01:45:47] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:00:05] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:07:45] <icinga-wm>	 PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:07:58] <wikibugs>	 (03PS1) 10Mathew.onipe: icinga: Ok when total shards is zero [puppet] - 10https://gerrit.wikimedia.org/r/501462 (https://phabricator.wikimedia.org/T214921)
[02:17:39] <icinga-wm>	 PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:19:29] <icinga-wm>	 PROBLEM - tools project instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: k8s-etcd,prometheus class instances not spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:27:05] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[02:27:13] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[02:28:13] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[02:32:03] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[02:32:21] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[02:34:09] <icinga-wm>	 RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[02:44:05] <icinga-wm>	 RECOVERY - puppet last run on db1084 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[03:13:09] <icinga-wm>	 PROBLEM - puppet last run on db1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:21:19] <icinga-wm>	 PROBLEM - Apache HTTP on mw1314 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1312 bytes in 5.707 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:21:33] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1314 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:22:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:22:49] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:44:51] <icinga-wm>	 RECOVERY - puppet last run on db1064 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[04:34:39] <icinga-wm>	 PROBLEM - puppet last run on cloudvirtan1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:49:01] <gilles>	 !log T216594 Start purge of namespace 0 on ruwiki
[04:49:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:49:05] <stashbot>	 T216594: Layout Stability API origin trial - https://phabricator.wikimedia.org/T216594
[04:58:28] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501476
[05:01:01] <icinga-wm>	 RECOVERY - puppet last run on cloudvirtan1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:07:26] <wikibugs>	 (03PS1) 10Dduvall: Revert "all wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501477
[05:07:28] <wikibugs>	 (03PS1) 10Dduvall: Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501478
[05:08:23] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1075 to master [puppet] - 10https://gerrit.wikimedia.org/r/501479 (https://phabricator.wikimedia.org/T219115)
[05:08:46] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] Revert "all wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501477 (owner: 10Dduvall)
[05:08:55] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for thursday 10th april" [puppet] - 10https://gerrit.wikimedia.org/r/501479 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui)
[05:09:22] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501478 (owner: 10Dduvall)
[05:09:53] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "all wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501477 (owner: 10Dduvall)
[05:10:19] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501478 (owner: 10Dduvall)
[05:12:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501476 (owner: 10Marostegui)
[05:12:56] <wikibugs>	 (03CR) 10jenkins-bot: Revert "all wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501477 (owner: 10Dduvall)
[05:12:58] <wikibugs>	 (03CR) 10jenkins-bot: Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501478 (owner: 10Dduvall)
[05:13:12] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501476 (owner: 10Marostegui)
[05:13:25] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501476 (owner: 10Marostegui)
[05:14:36] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1075 (duration: 00m 59s)
[05:14:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:15:19] <marostegui>	 !log Fully upgrade and reboot db1075
[05:15:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:19:47] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115)
[05:20:01] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for Thursday 10th April" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui)
[05:22:42] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115)
[05:23:06] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for Thursday 10th April" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui)
[05:27:21] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501482
[05:29:51] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:30:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501482 (owner: 10Marostegui)
[05:31:17] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501482 (owner: 10Marostegui)
[05:32:29] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1075 with low weight (duration: 00m 58s)
[05:32:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:13] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/501483 (https://phabricator.wikimedia.org/T219115)
[05:33:51] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for Thursday 10th April" [dns] - 10https://gerrit.wikimedia.org/r/501483 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui)
[05:35:31] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501482 (owner: 10Marostegui)
[05:39:05] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) @jcrespo would you mind taking a look at the above patches ^ I have also updated our etherpad with the plan  Thanks!
[05:43:43] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501485
[05:47:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501485 (owner: 10Marostegui)
[05:48:20] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501485 (owner: 10Marostegui)
[05:56:57] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[05:57:55] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501485 (owner: 10Marostegui)
[05:59:09] <wikibugs>	 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 3 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Joe)
[06:04:40] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1075 (duration: 01m 00s)
[06:04:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:11:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui)
[06:21:53] <wikibugs>	 10Operations, 10cloud-services-team: cloudcontrol2001-dev hourly cronspam - https://phabricator.wikimedia.org/T220173 (10jijiki)
[06:22:02] <wikibugs>	 10Operations, 10cloud-services-team: cloudcontrol2001-dev hourly cronspam - https://phabricator.wikimedia.org/T220173 (10jijiki) p:05Triage→03Normal
[06:24:10] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "profile::cache::ssl::wikibase: Simplify" [puppet] - 10https://gerrit.wikimedia.org/r/501346 (owner: 10Alex Monk)
[06:24:23] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "profile::cache::ssl::wikibase: Simplify" [puppet] - 10https://gerrit.wikimedia.org/r/501346 (owner: 10Alex Monk)
[06:28:47] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:29:41] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:34:33] <wikibugs>	 (03CR) 10ArielGlenn: "We don't currently have any testbed hosts but maybe you want to add it to that role too? hieradata/role/common/dumps/generation/worker/tes" [puppet] - 10https://gerrit.wikimedia.org/r/501370 (owner: 10Muehlenhoff)
[06:36:21] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] "NOOP in production: https://puppet-compiler.wmflabs.org/compiler1002/15599/" [puppet] - 10https://gerrit.wikimedia.org/r/501357 (owner: 10Vgutierrez)
[06:36:30] <wikibugs>	 (03PS2) 10Vgutierrez: ssl::wikibase: Fix le_subjects hieradata key name [puppet] - 10https://gerrit.wikimedia.org/r/501357
[06:40:39] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501487
[06:41:14] <wikibugs>	 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Reimage gerrit2001 as stretch - https://phabricator.wikimedia.org/T168562 (10Dzahn)
[06:41:20] <wikibugs>	 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn) 05Resolved→03Open let's only resolve stuff that is actually resolved, not what will be resolved i...
[06:51:48] <wikibugs>	 (03CR) 10Muehlenhoff: "Sure thing, updating the patch." [puppet] - 10https://gerrit.wikimedia.org/r/501370 (owner: 10Muehlenhoff)
[06:52:21] <wikibugs>	 (03PS2) 10Muehlenhoff: snapshot: Add hhvm to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/501370
[06:52:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501487 (owner: 10Marostegui)
[06:54:14] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501487 (owner: 10Marostegui)
[06:54:22] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) tensorflow-rocm 1.13.1 available for Python 3.7 on PyPi! https...
[06:55:53] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] snapshot: Add hhvm to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/501370 (owner: 10Muehlenhoff)
[06:56:53] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1075 (duration: 00m 57s)
[06:56:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:19] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:03:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:04:50] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501487 (owner: 10Marostegui)
[07:08:23] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10Gilles)
[07:08:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:08:51] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:13:37] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:14:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:15:20] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10jcrespo) Please advice as the analytics permission masters.
[07:16:11] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:16:31] <wikibugs>	 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10MoritzMuehlenhoff) >>! In T219764#5087063, @Krenair wrote: > Thanks. Do we know how many production hosts are affected, if any?  Affected in the sense that they are currently in a broken sta...
[07:16:39] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:18:15] <moritzm>	 !log upgrading mw1262-mw1265 to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069)
[07:18:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:19] <stashbot>	 T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069
[07:20:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:20:31] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:23:24] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db2044 is CRITICAL: cluster=mysql device=cciss,11 instance=db2044:9100 job=node site=codfw Jcrespo https://phabricator.wikimedia.org/T220102 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044&var-datasource=codfw+prometheus/ops
[07:23:53] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:24:21] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:26:27] <icinga-wm>	 RECOVERY - MariaDB disk space on dbstore1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[07:26:53] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[07:27:41] <icinga-wm>	 PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[07:27:55] <icinga-wm>	 PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring
[07:27:57] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501493
[07:29:05] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:29:33] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:30:21] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:30:51] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:34:16] <jijiki>	 !log Repooling thumbor1004 until we replace its memory - T215411
[07:34:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:20] <stashbot>	 T215411: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411
[07:37:49] <icinga-wm>	 RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 866 bytes in 0.086 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[07:38:05] <icinga-wm>	 RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26976 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring
[07:38:57] <icinga-wm>	 PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy],Exec[git_pull_research/landing-page]
[07:38:57] <icinga-wm>	 PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[07:38:59] <icinga-wm>	 PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks]
[07:40:15] <icinga-wm>	 PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[07:41:01] <icinga-wm>	 PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 4 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy]
[07:41:43] <icinga-wm>	 PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[07:41:43] <icinga-wm>	 PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[07:41:53] <icinga-wm>	 PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[07:42:01] <icinga-wm>	 PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks]
[07:42:03] <icinga-wm>	 PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars]
[07:42:51] <icinga-wm>	 PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[07:42:51] <icinga-wm>	 PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer]
[07:43:55] <icinga-wm>	 PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer]
[07:44:05] <icinga-wm>	 PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts]
[07:44:15] <icinga-wm>	 RECOVERY - puppet last run on db1124 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[07:44:15] <icinga-wm>	 PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[07:45:47] <wikibugs>	 (03PS3) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203)
[07:47:14] <wikibugs>	 (03PS4) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203)
[07:51:09] <elukey>	 !log restart gerrit on cobalt (timeouts and general slowdown)
[07:51:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:06] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-snapshots: Only create x1 snapshots on dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/501506 (https://phabricator.wikimedia.org/T206203)
[07:55:17] <wikibugs>	 (03CR) 10Marostegui: "> Looks good (with the labsdb1004/1005 caveat)" [puppet] - 10https://gerrit.wikimedia.org/r/461035 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo)
[08:00:03] <icinga-wm>	 RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:00:25] <wikibugs>	 (03PS2) 10Jcrespo: mariadb-snapshots: Only create x1 snapshots on dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/501506 (https://phabricator.wikimedia.org/T206203)
[08:01:49] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/501506 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[08:02:49] <icinga-wm>	 RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[08:02:55] <icinga-wm>	 RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:03:05] <icinga-wm>	 RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:03:47] <wikibugs>	 (03CR) 10Marostegui: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501493 (owner: 10Marostegui)
[08:03:55] <icinga-wm>	 RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[08:03:55] <icinga-wm>	 RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:04:46] <wikibugs>	 (03PS2) 10Gehel: icinga: Ok when total shards is zero [puppet] - 10https://gerrit.wikimedia.org/r/501462 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[08:04:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501493 (owner: 10Marostegui)
[08:05:01] <icinga-wm>	 RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[08:05:11] <icinga-wm>	 RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[08:05:23] <icinga-wm>	 RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[08:05:23] <icinga-wm>	 RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[08:05:41] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] icinga: Ok when total shards is zero [puppet] - 10https://gerrit.wikimedia.org/r/501462 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[08:05:55] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501493 (owner: 10Marostegui)
[08:06:37] <icinga-wm>	 RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:06:47] <logmsgbot>	 !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot
[08:06:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:05] <wikibugs>	 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 3 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Dzahn) a:03Dzahn
[08:07:21] <icinga-wm>	 RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:07:42] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1075 (duration: 00m 59s)
[08:07:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:05] <icinga-wm>	 RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:08:21] <icinga-wm>	 RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[08:11:17] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501493 (owner: 10Marostegui)
[08:16:04] <wikibugs>	 (03PS2) 10Elukey: Update AQS druid datasource to 2019-03 snapshot [puppet] - 10https://gerrit.wikimedia.org/r/501341 (owner: 10Joal)
[08:17:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: update swift dashboard to use new metric names [puppet] - 10https://gerrit.wikimedia.org/r/501399 (https://phabricator.wikimedia.org/T219825) (owner: 10Cwhite)
[08:17:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Update AQS druid datasource to 2019-03 snapshot [puppet] - 10https://gerrit.wikimedia.org/r/501341 (owner: 10Joal)
[08:22:07] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9643 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 592 threshold =0.15 breach: number_of_nodes: 15, number_of_in_flight_fetch: 0, status: yellow, timed_out: False, number_of_pending_tasks: 0, active_shards: 2669, active_shards_percent_as_number: 81.84605949095369, unassigned_shards: 592, relocating_shards: 0, initializing_shards: 0, task_max_wait
[08:22:07] <icinga-wm>	 is: 0, active_primary_shards: 1087, delayed_unassigned_shards: 0, cluster_name: production-search-psi-eqiad, number_of_data_nodes: 15 https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:22:46] <gehel>	 ^ restart in progress, should recover in a few seconds
[08:23:22] <onimisionipe>	 15 nodes confused me a bit before I saw psi... :)
[08:23:22] <elukey>	 super
[08:23:27] <gehel>	 no actual issue, the threshold of this check is a bit high for our newer smaller clusters
[08:24:43] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9643 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-psi-eqiad: cluster_name: production-search-psi-eqiad, initializing_shards: 4, status: yellow, number_of_in_flight_fetch: 0, number_of_nodes: 17, relocating_shards: 0, task_max_waiting_in_queue_millis: 40696, unassigned_shards: 19, number_of_pending_tasks: 5, active_primary_shards: 1087, number_of_d
[08:24:43] <icinga-wm>	 med_out: False, delayed_unassigned_shards: 0, active_shards: 3238, active_shards_percent_as_number: 99.29469487887151 https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:24:44] <wikibugs>	 (03PS4) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967)
[08:25:14] <wikibugs>	 (03PS3) 10Muehlenhoff: toolforge: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/500388 (https://phabricator.wikimedia.org/T219362)
[08:25:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac)
[08:26:43] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: raise the alerting threshold on unassigned shards [puppet] - 10https://gerrit.wikimedia.org/r/501510
[08:27:28] <wikibugs>	 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Logging infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220103 (10fgiunchedi)
[08:28:26] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: raise the alerting threshold on unassigned shards [puppet] - 10https://gerrit.wikimedia.org/r/501510 (owner: 10Gehel)
[08:28:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] toolforge: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/500388 (https://phabricator.wikimedia.org/T219362) (owner: 10Muehlenhoff)
[08:29:01] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: raise the alerting threshold on unassigned shards [puppet] - 10https://gerrit.wikimedia.org/r/501510 (owner: 10Gehel)
[08:29:09] <wikibugs>	 (03PS2) 10Gehel: elasticsearch: raise the alerting threshold on unassigned shards [puppet] - 10https://gerrit.wikimedia.org/r/501510
[08:29:41] <wikibugs>	 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 3 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Dzahn) @Eevans Done. I added:  $application_username = 'sessions' $application_password =...
[08:30:39] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to cloudweb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero)
[08:30:58] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to cloudweb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) p:05Triage→03High a:03aborrero
[08:31:26] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to cloudweb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero)
[08:31:33] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 2.515e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[08:31:47] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 634 threshold =0.15 breach: cluster_name: production-search-omega-eqiad, active_shards_percent_as_number: 80.90361445783132, initializing_shards: 0, number_of_data_nodes: 15, delayed_unassigned_shards: 0, active_shards: 2686, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, timed_out: F
[08:31:47] <icinga-wm>	 ary_shards: 1107, number_of_nodes: 15, status: yellow, number_of_pending_tasks: 0, unassigned_shards: 634, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:32:04] <elukey>	 Mirror Maker is surely due to the restarts --^
[08:32:19] <elukey>	 !log roll restart of aqs on aqs100* to pick up new druid settings
[08:32:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:41] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: labtestservices2002: rename to cloudservices2002-dev and put into service [puppet] - 10https://gerrit.wikimedia.org/r/501314 (https://phabricator.wikimedia.org/T220101)
[08:33:35] <wikibugs>	 (03CR) 10Dzahn: "there is no more redirects.conf that gets created from redirects.dat and i need to upload, right?" [puppet] - 10https://gerrit.wikimedia.org/r/501202 (https://phabricator.wikimedia.org/T219856) (owner: 10Dzahn)
[08:34:04] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Remove tools-checker-grid-start-trusty monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/500409 (owner: 10Muehlenhoff)
[08:34:21] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: number_of_data_nodes: 18, status: yellow, number_of_pending_tasks: 11, active_shards: 2917, task_max_waiting_in_queue_millis: 4653, cluster_name: production-search-omega-eqiad, active_primary_shards: 1107, timed_out: False, initializing_shards: 6, unassigned_shards: 397, number_of_node
[08:34:21] <icinga-wm>	 in_flight_fetch: 0, relocating_shards: 0, active_shards_percent_as_number: 87.86144578313252, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:35:51] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: labtestservices2002: rename to cloudservices2002-dev and put into service [puppet] - 10https://gerrit.wikimedia.org/r/501314 (https://phabricator.wikimedia.org/T220101)
[08:36:16] <wikibugs>	 (03PS2) 10Dzahn: druid: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/501337
[08:36:21] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:36:30] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero)
[08:36:51] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:37:29] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:37:33] <icinga-wm>	 PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[08:37:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I think this is good to merge now. No more trusty VMs are present in Cloud VPS." [puppet] - 10https://gerrit.wikimedia.org/r/499933 (owner: 10Muehlenhoff)
[08:37:56] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestservices2002: rename to cloudservices2002-dev and put into service [puppet] - 10https://gerrit.wikimedia.org/r/501314 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez)
[08:38:11] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[08:38:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:38:21] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:38:21] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:38:23] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:38:33] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:38:45] <elukey>	 checking
[08:39:05] <wikibugs>	 (03PS3) 10Dzahn: druid: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/501337
[08:39:11] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero)
[08:39:19] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[08:39:31] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:39:31] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:39:33] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:39:37] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:39:43] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:40:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:40:46] <elukey>	 so this was us switching a druid datasource, that was not cached 
[08:40:56] <elukey>	 causing timeouts until the cache was warmed up
[08:41:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 80695 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[08:42:03] <mutante>	 elukey: ok to add https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid   to all the druid checks?
[08:42:21] <elukey>	 yep!
[08:42:26] <mutante>	 (it was mere conincidence that i did that at the same tiem )
[08:42:29] <mutante>	 ok, doing :)
[08:43:18] <akosiaris>	 !log upgrade kubernetes staging cluster to 1.11.9
[08:43:20] <wikibugs>	 (03PS2) 10Muehlenhoff: role::labs::instance: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/499933
[08:43:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] druid: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/501337 (owner: 10Dzahn)
[08:44:08] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) p:05Triage→03High
[08:45:57] <wikibugs>	 (03PS3) 10Muehlenhoff: role::labs::instance: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/499933
[08:48:41] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:49:13] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:49:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] role::labs::instance: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/499933 (owner: 10Muehlenhoff)
[08:49:57] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: labtestservices2002: rename to cloudservices2002-dev [dns] - 10https://gerrit.wikimedia.org/r/501511 (https://phabricator.wikimedia.org/T220101)
[08:51:03] <wikibugs>	 (03CR) 10Ema: "I guess we should 's/swift-rw/swift-ro/' (sic) in ./hieradata/role/common/trafficserver/backend.yaml to make the ATS backends serve swift " [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac)
[08:52:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestservices2002: rename to cloudservices2002-dev [dns] - 10https://gerrit.wikimedia.org/r/501511 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez)
[08:54:24] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero)
[08:54:53] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "recheck" [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox)
[08:55:50] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: `...
[08:55:57] <arturo>	 !log T220101 reimaging+renaming labtestservices2002 to cloudservices2002-dev
[08:56:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:01] <stashbot>	 T220101: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101
[08:57:15] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=zotero
[08:57:16] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=blubberoid
[08:57:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:17] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=mathoid
[08:57:18] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=cxserver
[08:57:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:19] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=citoid
[08:57:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:40] <akosiaris>	 !log depool codfw kubernetes apps from discovery in preparation for upgrade
[08:57:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:01] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: nrpe: Add more rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/501512
[08:59:44] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "recheck" [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox)
[09:00:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox)
[09:00:55] <icinga-wm>	 PROBLEM - Host 208.80.153.78 is DOWN: PING CRITICAL - Packet loss = 100%
[09:02:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] nrpe: Add more rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/501512 (owner: 10Alexandros Kosiaris)
[09:02:39] <wikibugs>	 (03PS8) 10Alexandros Kosiaris: Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk)
[09:03:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk)
[09:04:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Sure. Let's say Tuesday 09 Apr" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac)
[09:07:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "/me fixing the tests and then merging" [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk)
[09:09:57] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "recheck" [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox)
[09:10:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox)
[09:10:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis)
[09:11:03] <icinga-wm>	 PROBLEM - DPKG on acrux is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[09:14:15] <icinga-wm>	 PROBLEM - Host 208.80.153.78 is DOWN: PING CRITICAL - Packet loss = 100%
[09:14:18] <icinga-wm>	 ACKNOWLEDGEMENT - EDAC syslog messages on thumbor1004 is CRITICAL: 6.001 ge 4 daniel_zahn https://phabricator.wikimedia.org/T215411 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops
[09:14:53] <wikibugs>	 10Operations, 10cloud-services-team: cloudcontrol2001-dev hourly cronspam - https://phabricator.wikimedia.org/T220173 (10aborrero)
[09:15:25] <icinga-wm>	 ACKNOWLEDGEMENT - Host 208.80.153.78 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn WIP by arturo
[09:15:53] <icinga-wm>	 ACKNOWLEDGEMENT - Recursive DNS on 208.80.153.78 is CRITICAL: CRITICAL - Plugin timed out while executing system call daniel_zahn WIP by arturo https://wikitech.wikimedia.org/wiki/DNS
[09:16:01] <wikibugs>	 (03PS9) 10Alexandros Kosiaris: Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk)
[09:16:21] <wikibugs>	 10Operations, 10cloud-services-team: cloudcontrol2001-dev hourly cronspam - https://phabricator.wikimedia.org/T220173 (10aborrero) This server is part of an openstack deployment which is being bootstrapped in codfw so we can rescue/save databases as part of Trusty HW deprecation process. See parent tasks {T219...
[09:16:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk)
[09:21:38] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: refresh FQDN of DNS servers [puppet] - 10https://gerrit.wikimedia.org/r/501517 (https://phabricator.wikimedia.org/T220101)
[09:23:31] <icinga-wm>	 RECOVERY - DPKG on acrux is OK: All packages OK
[09:23:39] <mutante>	 @seen andre_
[09:23:39] <wm-bot>	 mutante: Last time I saw andre_ they were changing the nickname to Guest17533, but Guest17533 is no longer in channel #wikimedia-dev at 10/18/2018 4:53:24 AM (169d4h30m15s ago)
[09:24:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "> I guess we should 's/swift-rw/swift-ro/' (sic) in ./hieradata/role/common/trafficserver/backend.yaml to make the ATS backends serve swif" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac)
[09:24:55] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: refresh FQDN of DNS servers [puppet] - 10https://gerrit.wikimedia.org/r/501517 (https://phabricator.wikimedia.org/T220101)
[09:25:44] <wikibugs>	 (03CR) 10Ema: "> Unless I am mistaken, it would be better to just set active_active:" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac)
[09:26:16] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[09:26:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:20] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
[09:26:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: grafana: remove frack datasources [puppet] - 10https://gerrit.wikimedia.org/r/501519 (https://phabricator.wikimedia.org/T219825)
[09:27:21] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: refresh DNS related FQDNs [dns] - 10https://gerrit.wikimedia.org/r/501520 (https://phabricator.wikimedia.org/T220101)
[09:27:33] <logmsgbot>	 !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97)
[09:27:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:55] <logmsgbot>	 !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot
[09:27:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: remove frack datasources [puppet] - 10https://gerrit.wikimedia.org/r/501519 (https://phabricator.wikimedia.org/T219825) (owner: 10Filippo Giunchedi)
[09:28:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev: refresh DNS related FQDNs [dns] - 10https://gerrit.wikimedia.org/r/501520 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez)
[09:28:53] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:29:18] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Varnish: serve Swift traffic in active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac)
[09:29:27] <wikibugs>	 (03PS2) 10Filippo Giunchedi: grafana: remove frack datasources [puppet] - 10https://gerrit.wikimedia.org/r/501519 (https://phabricator.wikimedia.org/T219825)
[09:29:38] <mutante>	 ^ scandium - keeps happening ... is the test host. subbu knows
[09:30:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: refresh FQDN of DNS servers [puppet] - 10https://gerrit.wikimedia.org/r/501517 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez)
[09:30:28] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T219933
[09:31:18] <wikibugs>	 (03PS1) 10Dzahn: add fake passwords for cassandra session store [labs/private] - 10https://gerrit.wikimedia.org/r/501521 (https://phabricator.wikimedia.org/T219560)
[09:32:23] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudservices2002-dev.wikimedia.org'] `  and were...
[09:33:12] <wikibugs>	 10Operations, 10Analytics, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10fgiunchedi) @ayounsi this is good to go, unless there's signoff needed?
[09:36:24] <wikibugs>	 (03PS2) 10Dzahn: add fake passwords for cassandra session store [labs/private] - 10https://gerrit.wikimedia.org/r/501521 (https://phabricator.wikimedia.org/T219560)
[09:38:19] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices2002-dev: typo in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/501524 (https://phabricator.wikimedia.org/T220101)
[09:39:24] <wikibugs>	 (03PS5) 10Gehel: icinga: add mediawiki cirrus update lag check [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe)
[09:39:28] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2002-dev: typo in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/501524 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez)
[09:40:01] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake passwords for cassandra session store [labs/private] - 10https://gerrit.wikimedia.org/r/501521 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn)
[09:40:40] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] icinga: add mediawiki cirrus update lag check [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe)
[09:40:50] <wikibugs>	 (03PS6) 10Gehel: icinga: add mediawiki cirrus update lag check [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe)
[09:47:50] <wikibugs>	 10Operations, 10Puppet, 10puppet-compiler, 10Release-Engineering-Team (Watching / External): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10jbond) > In addition, Jenkins doesn't seem to like having more than Change-id and Bug in the footer:  Seems to...
[09:47:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Add restbase2019 / restbase2020 instances [dns] - 10https://gerrit.wikimedia.org/r/501525 (https://phabricator.wikimedia.org/T217368)
[09:48:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] admin: add gpu-users group and assign it to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[09:49:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/501370 (owner: 10Muehlenhoff)
[09:50:11] <icinga-wm>	 PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:52:56] <wikibugs>	 (03PS1) 10Filippo Giunchedi: restbase: add restbase2019 / restbase2020 [puppet] - 10https://gerrit.wikimedia.org/r/501526 (https://phabricator.wikimedia.org/T217368)
[09:53:37] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] profile::cache::ssl::unified: Allow passing certs/certs_active by hiera (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk)
[09:53:37] <wikibugs>	 (03PS1) 10Dzahn: exim: remove wikivoyage.de from wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/501527 (https://phabricator.wikimedia.org/T219867)
[09:53:44] <mutante>	 oooh.. the puppet error on icinga1001 is real
[09:53:46] <mutante>	 for once
[09:54:58] <wikibugs>	 (03PS3) 10DCausse: Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox)
[09:55:00] <wikibugs>	 (03PS1) 10DCausse: Add workaround to surefire [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501528
[09:55:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add workaround to surefire [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501528 (owner: 10DCausse)
[09:55:41] <wikibugs>	 (03Abandoned) 10Jbond: pdebuild: add a new repo for build dependencies [puppet] - 10https://gerrit.wikimedia.org/r/500464 (owner: 10Jbond)
[09:55:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox)
[09:56:05] <mutante>	 gehel: it broke puppet on icinga due to double quotes
[09:56:13] <gehel>	 mutante: yep, I'm on it
[09:56:22] <mutante>	 'k, cool
[09:56:53] <wikibugs>	 (03PS2) 10DCausse: Add workaround to surefire [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501528
[09:56:55] <wikibugs>	 (03PS4) 10DCausse: Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox)
[09:57:01] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[09:57:47] <wikibugs>	 (03PS1) 10Ema: ATS: install libhwloc5 from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/501529 (https://phabricator.wikimedia.org/T219967)
[09:57:50] <wikibugs>	 (03PS3) 10Jbond: jessie-backports: remove redundant pins [puppet] - 10https://gerrit.wikimedia.org/r/499808 (https://phabricator.wikimedia.org/T219333)
[09:58:46] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: fix single quotes in graphite check [puppet] - 10https://gerrit.wikimedia.org/r/501530
[09:59:17] <gehel>	 mutante: ^ have a minute for a review?
[09:59:27] <wikibugs>	 (03PS2) 10Ema: ATS: install libhwloc5 from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/501529 (https://phabricator.wikimedia.org/T219967)
[09:59:36] <wikibugs>	 (03PS1) 10Dzahn: toollabs::bastion: remove trusty cgred code [puppet] - 10https://gerrit.wikimedia.org/r/501531
[10:01:44] <wikibugs>	 (03PS3) 10DCausse: Use latest parent pom [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501528
[10:01:46] <wikibugs>	 (03PS5) 10DCausse: Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox)
[10:01:48] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero)
[10:01:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "yep, this seems right. matches with what puppet says on icinga1001 " consider using double quotes"" [puppet] - 10https://gerrit.wikimedia.org/r/501530 (owner: 10Gehel)
[10:02:03] <mutante>	 sure gehel, looks right
[10:02:14] <gehel>	 mutante: thanks!
[10:02:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/501529 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[10:02:30] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: fix single quotes in graphite check [puppet] - 10https://gerrit.wikimedia.org/r/501530 (owner: 10Gehel)
[10:04:11] <wikibugs>	 (03PS3) 10Ema: ATS: install libhwloc5 from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/501529 (https://phabricator.wikimedia.org/T219967)
[10:04:33] <icinga-wm>	 ACKNOWLEDGEMENT - toolschecker: expect a long running job on stretch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string ok not found on http://checker.tools.wmflabs.org:80/grid/continuous/stretch - 158 bytes in 0.219 second response time daniel_zahn https://phabricator.wikimedia.org/T213413 ?? https://wikitech.wikimedia.org/wiki/portal:toolforge/admin/toolschecker
[10:04:34] <icinga-wm>	 ACKNOWLEDGEMENT - toolschecker: gridengine webservice running on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/gridengine - 351 bytes in 0.344 second response time daniel_zahn https://phabricator.wikimedia.org/T213413 ?? https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[10:04:35] <icinga-wm>	 ACKNOWLEDGEMENT - toolschecker: kubernetes webservice running on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 350 bytes in 0.651 second response time daniel_zahn https://phabricator.wikimedia.org/T213413 ?? https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[10:04:35] <icinga-wm>	 ACKNOWLEDGEMENT - toolschecker: start a job and verify on stretch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string ok not found on http://checker.tools.wmflabs.org:80/grid/start/stretch - 158 bytes in 0.793 second response time daniel_zahn https://phabricator.wikimedia.org/T213413 ?? https://wikitech.wikimedia.org/wiki/portal:toolforge/admin/toolschecker
[10:05:20] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: install libhwloc5 from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/501529 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[10:05:24] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] Use latest parent pom [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501528 (owner: 10DCausse)
[10:05:30] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox)
[10:05:43] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on cloudcontrol2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T220173 ?
[10:05:43] <icinga-wm>	 ACKNOWLEDGEMENT - keystone admin endpoint port 35357 on cloudcontrol2001-dev is CRITICAL: connect to address 208.80.153.59 and port 35357: Connection refused daniel_zahn https://phabricator.wikimedia.org/T220173 ? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[10:05:43] <icinga-wm>	 ACKNOWLEDGEMENT - keystone public endoint port 5000 on cloudcontrol2001-dev is CRITICAL: connect to address 208.80.153.59 and port 5000: Connection refused daniel_zahn https://phabricator.wikimedia.org/T220173 ? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[10:06:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox)
[10:07:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] jessie-backports: remove redundant pins [puppet] - 10https://gerrit.wikimedia.org/r/499808 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond)
[10:07:45] <wikibugs>	 (03PS4) 10Jbond: jessie-backports: remove redundant pins [puppet] - 10https://gerrit.wikimedia.org/r/499808 (https://phabricator.wikimedia.org/T219333)
[10:07:48] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns_recursor: in stretch, don't use the package from jessie [puppet] - 10https://gerrit.wikimedia.org/r/501532 (https://phabricator.wikimedia.org/T220101)
[10:07:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] jessie-backports: remove redundant pins [puppet] - 10https://gerrit.wikimedia.org/r/499808 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond)
[10:07:56] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10Dzahn) labtestnet2003 is still in Icinga:  https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=lab...
[10:08:21] <icinga-wm>	 ACKNOWLEDGEMENT - NTP on labtestnet2003 is CRITICAL: NTP CRITICAL: No response from NTP server daniel_zahn https://phabricator.wikimedia.org/T219776 https://wikitech.wikimedia.org/wiki/NTP
[10:08:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pbuilder: add security updates repository [puppet] - 10https://gerrit.wikimedia.org/r/501163 (https://phabricator.wikimedia.org/T220003) (owner: 10Jbond)
[10:09:03] <wikibugs>	 (03PS5) 10Jbond: pbuilder: add security updates repository [puppet] - 10https://gerrit.wikimedia.org/r/501163 (https://phabricator.wikimedia.org/T220003)
[10:09:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: pdns_recursor: in stretch, don't use the package from jessie [puppet] - 10https://gerrit.wikimedia.org/r/501532 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez)
[10:09:36] <wikibugs>	 (03CR) 10DCausse: "recheck" [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox)
[10:09:42] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: pdns_recursor: in stretch, don't use the package from jessie [puppet] - 10https://gerrit.wikimedia.org/r/501532 (https://phabricator.wikimedia.org/T220101)
[10:10:03] <logmsgbot>	 !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97)
[10:10:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:07] <logmsgbot>	 !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot
[10:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:16] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] Fix pom to use HttpModule rather then Module [software/gerrit/plugins/barricade] - 10https://gerrit.wikimedia.org/r/501362 (owner: 10Paladox)
[10:10:44] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: pdns_recursor: in stretch, don't use the package from jessie [puppet] - 10https://gerrit.wikimedia.org/r/501532 (https://phabricator.wikimedia.org/T220101)
[10:11:00] <icinga-wm>	 RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:12:21] <icinga-wm>	 ACKNOWLEDGEMENT - tools project instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: k8s-etcd,prometheus class instances not spread out enough daniel_zahn https://phabricator.wikimedia.org/T220189 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[10:13:46] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "This patch is OK, but I believe you can drop the toollabs::bastion class entirely." [puppet] - 10https://gerrit.wikimedia.org/r/501531 (owner: 10Dzahn)
[10:14:51] <icinga-wm>	 ACKNOWLEDGEMENT - Mediawiki Cirrussearch update lag - codfw on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] Gehel initial deployment of the check https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[10:14:51] <icinga-wm>	 ACKNOWLEDGEMENT - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] Gehel initial deployment of the check https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[10:15:19] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10elukey) Checked a bit into puppet and the analytics_deploy key is usable by analytics-admins:  `   analytics_deploy:     trusted_groups:       - analytics-admins `  Thi...
[10:15:35] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10User-Elukey: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10elukey)
[10:15:46] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.103e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[10:16:31] <wikibugs>	 (03CR) 10Dzahn: "don't forget private repo has modules/privateexim/manifests/init.pp:		"/etc/exim4/aliases/wikivoyage.de":" [puppet] - 10https://gerrit.wikimedia.org/r/501527 (https://phabricator.wikimedia.org/T219867) (owner: 10Dzahn)
[10:17:15] <wikibugs>	 (03CR) 10Elukey: "I am wondering now if the the gpu-users group could be directly a replacement of 'gpu-testers', since probably Erik doesn't need anymore t" [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[10:18:10] <wikibugs>	 (03PS1) 10Jbond: package-builder: export hook variable [puppet] - 10https://gerrit.wikimedia.org/r/501535 (https://phabricator.wikimedia.org/T220003)
[10:18:12] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.153.78 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/DNS
[10:22:19] <wikibugs>	 (03PS2) 10Jbond: package-builder: export hook variable [puppet] - 10https://gerrit.wikimedia.org/r/501535 (https://phabricator.wikimedia.org/T220003)
[10:25:30] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns: give mariadb stretch support [puppet] - 10https://gerrit.wikimedia.org/r/501536 (https://phabricator.wikimedia.org/T220101)
[10:26:54] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:26:58] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:27:20] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:27:28] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:28:10] <marostegui>	 elukey: ^
[10:28:52] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices2002-dev: cleanup duplicate entries [dns] - 10https://gerrit.wikimedia.org/r/501537 (https://phabricator.wikimedia.org/T220101)
[10:29:26] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:29:26] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:29:40] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:30:02] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2002-dev: cleanup duplicate entries [dns] - 10https://gerrit.wikimedia.org/r/501537 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez)
[10:30:20] <wikibugs>	 (03PS5) 10BBlack: Shortener VCL fixups [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup)
[10:30:26] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[10:30:36] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:31:02] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[10:31:08] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero)
[10:31:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:32:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC noop for eqiad1: https://puppet-compiler.wmflabs.org/compiler1002/15603/" [puppet] - 10https://gerrit.wikimedia.org/r/501536 (https://phabricator.wikimedia.org/T220101) (owner: 10Arturo Borrero Gonzalez)
[10:34:55] <joal>	 Hi ops team - letting you know I'm waiting for elukey to help with the AQS alarms above - We're on it :)
[10:35:05] <marostegui>	 thanks joal :)
[10:35:21] <wikibugs>	 10Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic, 10User-Ladsgroup: Make UrlShortener 404s cacheable - https://phabricator.wikimedia.org/T220190 (10Ladsgroup)
[10:40:40] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:40:46] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:40:48] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:40:54] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:41:19] <elukey>	 !log restart druid broker on druid1004 - exceptions in the logs after old datasource removal
[10:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:56] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero)
[10:42:34] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:42:36] <elukey>	 !log restart druid broker on druid100[5,6] - exceptions in the logs after old datasource removal
[10:42:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:24] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:44:00] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:44:06] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:45:02] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[10:45:04] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:46:16] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[10:46:52] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[10:47:56] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) a:05aborrero→03Papaul @Papaul I'm assigning this task to you to do the changes related to the...
[10:53:26] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: cloudservices2002-dev: include openldap profile [puppet] - 10https://gerrit.wikimedia.org/r/501539 (https://phabricator.wikimedia.org/T218575)
[10:54:31] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10User-Elukey: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10jcrespo) analytics-deployers seems to me like a good idea, because later maybe someone else wants to do the same and we don't want those other people h...
[10:54:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: cloudservices2002-dev: include openldap profile [puppet] - 10https://gerrit.wikimedia.org/r/501539 (https://phabricator.wikimedia.org/T218575) (owner: 10Arturo Borrero Gonzalez)
[10:57:48] <arturo>	 !log updating puppet catalog compiler facts
[10:57:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:52] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:03:21] <wikibugs>	 (03PS1) 10Mathew.onipe: icinga: fix wrong thresholds [puppet] - 10https://gerrit.wikimedia.org/r/501542
[11:05:04] <wikibugs>	 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) A bit more to the picture, managed to get facter to build by updating all refrence of `std::unordered_map` to `std::map`.  however im now getting the fol...
[11:10:04] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: cloudservices2002-dev introduce hiera keys for openldap [puppet] - 10https://gerrit.wikimedia.org/r/501544 (https://phabricator.wikimedia.org/T218575)
[11:13:41] <wikibugs>	 (03PS3) 10Effie Mouzeli: lvs: Use the kubernetes cluster for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/496382 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris)
[11:14:46] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203)
[11:15:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[11:16:32] <icinga-wm>	 PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:16:45] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] "Mystery solved, it was due to service restarts on scb*, we are free to proceed" [puppet] - 10https://gerrit.wikimedia.org/r/496382 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris)
[11:21:31] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: cloudservices2002-dev introduce hiera keys for openldap [puppet] - 10https://gerrit.wikimedia.org/r/501544 (https://phabricator.wikimedia.org/T218575)
[11:24:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: cloudservices2002-dev introduce hiera keys for openldap [puppet] - 10https://gerrit.wikimedia.org/r/501544 (https://phabricator.wikimedia.org/T218575) (owner: 10Arturo Borrero Gonzalez)
[11:24:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc https://puppet-compiler.wmflabs.org/compiler1002/15607/" [puppet] - 10https://gerrit.wikimedia.org/r/501544 (https://phabricator.wikimedia.org/T218575) (owner: 10Arturo Borrero Gonzalez)
[11:25:37] <wikibugs>	 (03PS5) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203)
[11:25:39] <wikibugs>	 (03PS2) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203)
[11:26:42] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[11:27:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[11:27:05] <logmsgbot>	 !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99)
[11:27:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:26] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:31:28] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:31:35] <wikibugs>	 (03Abandoned) 10Dzahn: exim: remove wikivoyage.de from wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/501527 (https://phabricator.wikimedia.org/T219867) (owner: 10Dzahn)
[11:31:36] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:31:53] <wikibugs>	 10Operations: DNS for wikivoyage-old.org - https://phabricator.wikimedia.org/T81727 (10Dzahn)
[11:31:59] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10serviceops, 10Patch-For-Review: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Dzahn) 05Open→03Resolved
[11:32:26] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[11:32:46] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:32:50] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[11:33:32] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[11:33:58] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[11:34:18] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:35:30] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:37:03] <jijiki>	 !log Restarting pybal on lvs1006 and lvs2006 for 496382
[11:37:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:37] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on cp3041 is CRITICAL: connect to address 10.20.0.176 port 5666: No route to host nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T220193
[11:37:40] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on cp3034 is CRITICAL: connect to address 10.20.0.169 port 5666: No route to host nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T220194
[11:37:42] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on cp3041 - https://phabricator.wikimedia.org/T220193 (10ops-monitoring-bot)
[11:37:45] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on cp3034 - https://phabricator.wikimedia.org/T220194 (10ops-monitoring-bot)
[11:38:23] <wikibugs>	 (03CR) 10Dzahn: "thanks! i would prefer to do incrementally" [puppet] - 10https://gerrit.wikimedia.org/r/501531 (owner: 10Dzahn)
[11:39:54] <wikibugs>	 (03PS3) 10Muehlenhoff: snapshot: Add hhvm to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/501370
[11:40:51] <wikibugs>	 (03CR) 10Dzahn: "https://tools.wmflabs.org/openstack-browser/puppetclass/role::toollabs::bastion" [puppet] - 10https://gerrit.wikimedia.org/r/501531 (owner: 10Dzahn)
[11:40:58] <mark>	 is someone looking at the issues with esams?
[11:41:10] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] toollabs::bastion: remove trusty cgred code [puppet] - 10https://gerrit.wikimedia.org/r/501531 (owner: 10Dzahn)
[11:42:06] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:42:58] <icinga-wm>	 RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[11:43:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] snapshot: Add hhvm to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/501370 (owner: 10Muehlenhoff)
[11:44:40] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:44:40] <Zppix>	 Is there a list of what sites are on which shard? I cant remember.
[11:44:48] <Zppix>	 jouncebot: now
[11:44:49] <jouncebot>	 No deployments scheduled for the next 70 hour(s) and 45 minute(s)
[11:44:57] <mutante>	 mark: it looks like the Level3 link between eqiad and esams went down
[11:45:12] <mutante>	 10Gbps wave
[11:45:28] <mark>	 yes
[11:46:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toollabs::bastion: remove trusty cgred code [puppet] - 10https://gerrit.wikimedia.org/r/501531 (owner: 10Dzahn)
[11:48:13] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:49:08] <wikibugs>	 (03PS1) 10BBlack: Depool esams, primary transport is down and seeing packet loss [dns] - 10https://gerrit.wikimedia.org/r/501549
[11:49:29] <Lucas_WMDE>	 Zppix: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s3.dblist and other files, I guess
[11:50:02] <Zppix>	 Lucas_WMDE: ah that was it and actually that was the exact shard i was wanting to look at too :P
[11:50:13] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:50:13] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Depool esams, primary transport is down and seeing packet loss [dns] - 10https://gerrit.wikimedia.org/r/501549 (owner: 10BBlack)
[11:50:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:50:25] <wikibugs>	 (03PS2) 10BBlack: Depool esams, primary transport is down and seeing packet loss [dns] - 10https://gerrit.wikimedia.org/r/501549
[11:53:02] <bblack>	 !log esams depooled in DNS
[11:53:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:15] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:54:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:54:19] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:56:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] postgresql: set recovery.conf to writeable by postgres user [puppet] - 10https://gerrit.wikimedia.org/r/501371 (https://phabricator.wikimedia.org/T219652) (owner: 10Bstorm)
[11:56:29] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:57:52] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: acme_chief: generate certs for cloudservices2002-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/501550 (https://phabricator.wikimedia.org/T218575)
[11:59:23] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "looks good in general, fix the nitpicks :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/501550 (https://phabricator.wikimedia.org/T218575) (owner: 10Arturo Borrero Gonzalez)
[11:59:43] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 59.36 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[12:01:47] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[12:01:53] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: acme_chief: generate certs for cloudservices2002-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/501550 (https://phabricator.wikimedia.org/T218575)
[12:02:41] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[12:02:51] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[12:03:25] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/501550 (https://phabricator.wikimedia.org/T218575) (owner: 10Arturo Borrero Gonzalez)
[12:03:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] acme_chief: generate certs for cloudservices2002-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/501550 (https://phabricator.wikimedia.org/T218575) (owner: 10Arturo Borrero Gonzalez)
[12:04:51] <logmsgbot>	 !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot
[12:04:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:03] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:08:25] <icinga-wm>	 ACKNOWLEDGEMENT - Recursive DNS on 208.80.153.78 is CRITICAL: CRITICAL - Plugin timed out while executing system call daniel_zahn WIP by Arturo https://wikitech.wikimedia.org/wiki/DNS
[12:08:55] <wikibugs>	 (03PS1) 10BBlack: Revert "Depool esams, primary transport is down and seeing packet loss" [dns] - 10https://gerrit.wikimedia.org/r/501552
[12:09:53] <icinga-wm>	 ACKNOWLEDGEMENT - Labs LDAP on cloudservices2002-dev is CRITICAL: Could not search/find objectclasses in dc=wikimedia,dc=org daniel_zahn host in downtime, just services on it were not https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting
[12:09:54] <icinga-wm>	 ACKNOWLEDGEMENT - mysqld processes on cloudservices2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld daniel_zahn host in downtime, just services on it were not https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[12:10:47] <jynus>	 arturo: does cloudservices2002-dev mysql really need to page?
[12:11:03] <arturo>	 not at all
[12:11:12] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "Depool esams, primary transport is down and seeing packet loss" [dns] - 10https://gerrit.wikimedia.org/r/501552 (owner: 10BBlack)
[12:11:28] <arturo>	 but it didn't page, right?
[12:11:34] <mutante>	 the issue was that the host was in downtime but the service on it were not
[12:11:38] <jynus>	 double check puppet class :-)
[12:11:43] <wikibugs>	 (03PS2) 10BBlack: Revert "Depool esams, primary transport is down and seeing packet loss" [dns] - 10https://gerrit.wikimedia.org/r/501552
[12:11:45] <wikibugs>	 (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "Depool esams, primary transport is down and seeing packet loss" [dns] - 10https://gerrit.wikimedia.org/r/501552 (owner: 10BBlack)
[12:12:03] <bblack>	 !log repool esams
[12:12:03] <mutante>	 there is an action in the icinga web ui like "this host and all services on it"
[12:12:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:08] <jynus>	 I think you may be using production configuration
[12:12:16] <jynus>	 on a non-production host
[12:12:44] <mutante>	 yea, probably "if -dev in host name then not CRIT" ?
[12:13:03] <mutante>	 could turn off the paging that way in hiera
[12:14:12] <mutante>	 icinga unhandled issues back to just 2 again now.. one of which is esams and one is  known/common
[12:15:09] <wikibugs>	 (03Abandoned) 10Hashar: (WIP) run tests against multiple mw versions (WIP) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332777 (https://phabricator.wikimedia.org/T115713) (owner: 10Hashar)
[12:15:31] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[12:15:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:35] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
[12:15:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:11] <wikibugs>	 (03Abandoned) 10Hashar: Fix .gitreview to point to proper repo [debs/php-excimer] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/481613 (owner: 10Hashar)
[12:18:17] <wikibugs>	 (03Abandoned) 10Hashar: gbp: use upstream branch master, not tags [debs/php-excimer] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/481615 (owner: 10Hashar)
[12:18:27] <logmsgbot>	 !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97)
[12:18:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:38] <logmsgbot>	 !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot
[12:18:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:08] <wikibugs>	 (03PS3) 10Hashar: Rake: honor rubocop AllCops/Excludes [puppet] - 10https://gerrit.wikimedia.org/r/484410
[12:19:22] <wikibugs>	 (03PS5) 10Hashar: rsync: readd incoming and outgoing chmod [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890)
[12:19:45] <wikibugs>	 (03PS5) 10Hashar: doc: make published files group writable [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890)
[12:21:48] <wikibugs>	 (03PS3) 10Hashar: shinken: add basic spec [puppet] - 10https://gerrit.wikimedia.org/r/497253
[12:23:09] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 74.52 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[12:23:11] <wikibugs>	 (03CR) 10Ema: [C: 03+1] Shortener VCL fixups [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup)
[12:25:02] <wikibugs>	 (03CR) 10Hashar: "I have missed:" [puppet] - 10https://gerrit.wikimedia.org/r/497595 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar)
[12:25:52] <wikibugs>	 (03PS2) 10Hashar: contint: update sury.org gpg key for apt [puppet] - 10https://gerrit.wikimedia.org/r/497605 (https://phabricator.wikimedia.org/T218735)
[12:26:53] <wikibugs>	 (03PS1) 10Ema: ATS: make error template directory depend on package [puppet] - 10https://gerrit.wikimedia.org/r/501553 (https://phabricator.wikimedia.org/T219967)
[12:26:55] <wikibugs>	 (03PS3) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203)
[12:29:18] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=zotero
[12:29:19] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=blubberoid
[12:29:20] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=mathoid
[12:29:21] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=cxserver
[12:29:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:22] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=citoid
[12:29:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:25] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 2.401e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[12:29:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:34] <akosiaris>	 !log repool codfw for all kubernetes services
[12:29:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:51] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 54.7 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[12:30:08] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: make error template directory depend on package [puppet] - 10https://gerrit.wikimedia.org/r/501553 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[12:30:40] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[12:30:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:44] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
[12:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:26] <wikibugs>	 (03PS3) 10Dzahn: contint: update sury.org gpg key for apt [puppet] - 10https://gerrit.wikimedia.org/r/497605 (https://phabricator.wikimedia.org/T218735) (owner: 10Hashar)
[12:31:30] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-snapshot: Reduce codfw mariabackup generation to x1 and m5 [puppet] - 10https://gerrit.wikimedia.org/r/501555 (https://phabricator.wikimedia.org/T206203)
[12:31:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed this is the same file as https://packages.sury.org/php/apt.gpg" [puppet] - 10https://gerrit.wikimedia.org/r/497605 (https://phabricator.wikimedia.org/T218735) (owner: 10Hashar)
[12:31:54] <akosiaris>	 !log repool codfw for all kubernetes services T217426
[12:31:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:09] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=zotero
[12:32:10] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=blubberoid
[12:32:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:11] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mathoid
[12:32:12] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=cxserver
[12:32:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:13] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=citoid
[12:32:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:18] <akosiaris>	 !log depool eqiad for all kubernetes services T217426
[12:32:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:17] <logmsgbot>	 !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97)
[12:33:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:20] <logmsgbot>	 !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot
[12:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:56] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: cleanup logging during shard allocation [software/spicerack] - 10https://gerrit.wikimedia.org/r/501556
[12:39:01] <wikibugs>	 (03CR) 10Hashar: "The paths in .Dockerignore are all joined with the directory that contains the Dockerfile.  git does the same with .gitignore." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484547 (https://phabricator.wikimedia.org/T183546) (owner: 10Hashar)
[12:41:07] <wikibugs>	 (03CR) 10Jcrespo: "This is not yet implemented, but let me know what you think of the idea/option. By making it an option (despite I hate too many options), " [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[12:43:03] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[12:43:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:08] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
[12:43:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:52] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[12:43:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:56] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
[12:43:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:22] <jijiki>	 !log Restarting pybal on lvs1016 and lvs2003 for 496382
[12:48:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:53] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[12:49:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:13] <logmsgbot>	 !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=99)
[12:50:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:45] <icinga-wm>	 PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:54:13] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 82.59 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[12:54:31] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) a:03Vgutierrez
[12:58:29] <wikibugs>	 (03PS4) 10Alex Monk: profile::cache::ssl::unified: Allow passing certs/certs_active by hiera [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927)
[13:03:16] <wikibugs>	 (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/499355 (owner: 10Alex Monk)
[13:03:20] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] icinga: fix wrong thresholds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501542 (owner: 10Mathew.onipe)
[13:04:27] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "pcc is happy and shows a NOOP now: https://puppet-compiler.wmflabs.org/compiler1002/15608/" [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk)
[13:04:50] <wikibugs>	 (03PS1) 10Ema: cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967)
[13:04:52] <wikibugs>	 10Operations, 10cloud-services-team: cloudcontrol2001-dev hourly cronspam - https://phabricator.wikimedia.org/T220173 (10aborrero) 05Open→03Resolved a:03aborrero I just put `exit 0` in `/etc/cron.hourly/keystone`. Is a cleanup cron that makes no sense while the service is unpopulated.
[13:05:40] <wikibugs>	 (03PS5) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967)
[13:06:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[13:07:08] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499737 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui)
[13:08:00] <wikibugs>	 (03PS2) 10Ema: cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967)
[13:09:26] <wikibugs>	 (03PS6) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967)
[13:09:33] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: docker::baseimages: sync apt with our production settings [puppet] - 10https://gerrit.wikimedia.org/r/501563
[13:09:35] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: docker::baseimages: improve build-base-images [puppet] - 10https://gerrit.wikimedia.org/r/501564
[13:09:37] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: apt: remove redundant Install-Recommends [puppet] - 10https://gerrit.wikimedia.org/r/501565
[13:10:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/501563 (owner: 10Giuseppe Lavagetto)
[13:11:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] docker::baseimages: sync apt with our production settings [puppet] - 10https://gerrit.wikimedia.org/r/501563 (owner: 10Giuseppe Lavagetto)
[13:14:23] <wikibugs>	 (03PS2) 10Dzahn: toollabs::bastion: remove trusty cgred code [puppet] - 10https://gerrit.wikimedia.org/r/501531
[13:16:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::baseimages: sync apt with our production settings [puppet] - 10https://gerrit.wikimedia.org/r/501563 (owner: 10Giuseppe Lavagetto)
[13:16:15] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: docker::baseimages: sync apt with our production settings [puppet] - 10https://gerrit.wikimedia.org/r/501563
[13:16:18] <wikibugs>	 10Operations, 10Office-IT, 10Research, 10Wikimedia-Mailing-lists: Create research-alerts mailing list - https://phabricator.wikimedia.org/T219309 (10bmansurov) @eliza replied. Thanks!
[13:17:18] <wikibugs>	 10Operations, 10Office-IT, 10Research, 10Wikimedia-Mailing-lists: Create research-alerts mailing list - https://phabricator.wikimedia.org/T219309 (10Dzahn) 05Open→03Resolved a:03Dzahn cool, looks like we can close it here as resolved then
[13:17:37] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203 (10aborrero)
[13:17:49] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203 (10aborrero) p:05Triage→03High
[13:19:37] <icinga-wm>	 PROBLEM - Check systemd state on cloudservices2002-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:20:22] <wikibugs>	 (03CR) 10Marostegui: "Question now that I realised that retention: 1" [puppet] - 10https://gerrit.wikimedia.org/r/501555 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[13:22:25] <icinga-wm>	 RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[13:23:25] <icinga-wm>	 PROBLEM - puppet last run on cloudservices2002-dev is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[mariadb]
[13:23:46] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: docker::baseimages: improve build-base-images [puppet] - 10https://gerrit.wikimedia.org/r/501564
[13:24:19] <wikibugs>	 (03CR) 10Marostegui: "I am talking from memory,but I thought we had some sort of implementation of this on WMFReplication, why not using that one?" [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[13:25:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::baseimages: improve build-base-images [puppet] - 10https://gerrit.wikimedia.org/r/501564 (owner: 10Giuseppe Lavagetto)
[13:26:55] <wikibugs>	 (03CR) 10Eevans: "I think this is missing the rack information (ala `hieradata/hosts/restbase2019.yaml` and `hieradata/hosts/restbase2020.yaml` files)." [puppet] - 10https://gerrit.wikimedia.org/r/501526 (https://phabricator.wikimedia.org/T217368) (owner: 10Filippo Giunchedi)
[13:28:40] <wikibugs>	 (03PS2) 10Mathew.onipe: icinga: correct direction of check [puppet] - 10https://gerrit.wikimedia.org/r/501542
[13:29:36] <wikibugs>	 (03CR) 10Mathew.onipe: icinga: correct direction of check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501542 (owner: 10Mathew.onipe)
[13:30:07] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:30:11] <wikibugs>	 (03CR) 10Jcrespo: "> Question now that I realised that retention: 1" [puppet] - 10https://gerrit.wikimedia.org/r/501555 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[13:31:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: labtestnet2002: spare server in stretch [puppet] - 10https://gerrit.wikimedia.org/r/501567 (https://phabricator.wikimedia.org/T220203)
[13:31:20] <wikibugs>	 (03PS1) 10Dzahn: add Icinga notes_url to various NRPE monitor checks, pt 3 [puppet] - 10https://gerrit.wikimedia.org/r/501568
[13:31:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "> > Question now that I realised that retention: 1" [puppet] - 10https://gerrit.wikimedia.org/r/501555 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[13:31:34] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T219933
[13:33:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestnet2002: spare server in stretch [puppet] - 10https://gerrit.wikimedia.org/r/501567 (https://phabricator.wikimedia.org/T220203) (owner: 10Arturo Borrero Gonzalez)
[13:33:45] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10Dzahn) cloudservices2002-dev is alerting in Icinga:  https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cg...
[13:34:44] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on cloudservices2002-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T220101
[13:34:44] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on cloudservices2002-dev is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 14 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[mariadb] daniel_zahn https://phabricator.wikimedia.org/T220101
[13:35:14] <mutante>	 cwd: something broke on frack puppet related to icinga
[13:35:18] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[13:35:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:22] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
[13:35:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:25] <wikibugs>	 (03PS1) 10Ema: ATS: make config files depend on package [puppet] - 10https://gerrit.wikimedia.org/r/501569 (https://phabricator.wikimedia.org/T219967)
[13:35:40] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[13:35:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:45] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
[13:35:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:10] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10fgiunchedi) a:05fgiunchedi→03Papaul @Papaul we need to allocate these hosts in the same rows as the hosts they are replacing (restbase2007 and restbase2008) thus p...
[13:36:46] <arturo>	 !log T220101 disable active icinga checks for cloudcontrol2002-dev
[13:36:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:51] <stashbot>	 T220101: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101
[13:37:31] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 1073 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[13:37:52] <mutante>	 arturo: please dont disable active checks. can we do downtime instead ?
[13:38:32] <mutante>	 imho it should disappear from icinga when going through the rename workflow 
[13:38:50] <wikibugs>	 (03CR) 10Jcrespo: "> I am talking from memory,but I thought we had some sort of" [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[13:38:53] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) @EBernhardson question for you - while working on https://gerr...
[13:38:57] <cwd>	 mutante: Jeff is fixing
[13:39:15] <mutante>	 cwd: :)
[13:39:44] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Add restbase2019 / restbase2020 instances [dns] - 10https://gerrit.wikimedia.org/r/501525 (https://phabricator.wikimedia.org/T217368) (owner: 10Filippo Giunchedi)
[13:40:21] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` labtestnet2002.codfw.wm...
[13:41:19] <arturo>	 !log T220203 reimage labtestnet2002 as spare in stretch
[13:41:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:23] <stashbot>	 T220203: labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203
[13:41:23] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: docker::baseimages: fix yaml syntax [puppet] - 10https://gerrit.wikimedia.org/r/501570
[13:41:52] <arturo>	 mutante: was trying to reduce noise as much as possible. Will revisit later... or probably next week if you don't revisit before
[13:42:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::baseimages: fix yaml syntax [puppet] - 10https://gerrit.wikimedia.org/r/501570 (owner: 10Giuseppe Lavagetto)
[13:42:30] <mutante>	 arturo: ok, downtime also means no alerts
[13:43:20] <jynus>	 * unless the downtime happens after the alerting
[13:43:36] <jynus>	 for example, if it pages -> downtime -> the recovery will page
[13:43:43] <jynus>	 (even on downtime)
[13:44:11] <wikibugs>	 (03CR) 10Marostegui: "> Also WMFMariaDB is not being used by any of these scripts so we have to think if we want to use mariadb to connect (in addition to cumin" [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[13:44:56] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=zotero
[13:44:57] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=blubberoid
[13:44:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:58] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mathoid
[13:44:59] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=cxserver
[13:45:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:00] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=citoid
[13:45:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:06] <akosiaris>	 !log ρepool eqiad for all kubernetes services T217426
[13:45:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:11] <akosiaris>	 !log repool eqiad for all kubernetes services T217426
[13:45:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:00] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: make config files depend on package [puppet] - 10https://gerrit.wikimedia.org/r/501569 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[13:49:08] <wikibugs>	 (03PS2) 10Ema: ATS: make config files depend on package [puppet] - 10https://gerrit.wikimedia.org/r/501569 (https://phabricator.wikimedia.org/T219967)
[13:49:13] <wikibugs>	 (03CR) 10Jcrespo: "> We might be creating some tech debt." [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[13:51:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Add an update action (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto)
[13:53:00] <logmsgbot>	 !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99)
[13:53:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:24] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10EBernhardson) Doesn't seem to be needed anymore, feel free to start mo...
[13:57:45] <wikibugs>	 (03CR) 10Dzahn: "still don't really understand why we want to manually edit files on doc and for the rsync options i will add Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar)
[13:59:17] <wikibugs>	 (03Abandoned) 10Elukey: admin: add gpu-users group and assign it to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[14:04:21] <wikibugs>	 (03PS1) 10Elukey: admin: remove sudo permissions from gpu-testers and add users to it [puppet] - 10https://gerrit.wikimedia.org/r/501575 (https://phabricator.wikimedia.org/T148843)
[14:06:08] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/15611/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/501575 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[14:06:27] <wikibugs>	 (03PS3) 10Ema: cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967)
[14:06:50] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[14:07:42] <wikibugs>	 10Operations: parsoid-vd on scandium randomly died - https://phabricator.wikimedia.org/T219933 (10ssastry) See logs below from scandium. Seems to reliably shut down after 1 hour of no activity. Must be some config setting in some nodejs library. ` $ sudo journalctl -n 1000 -u parsoid-vd | egrep "8011|FAIL" .......
[14:08:02] <wikibugs>	 (03PS7) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967)
[14:09:21] <wikibugs>	 (03PS3) 10Gehel: icinga: correct direction of check [puppet] - 10https://gerrit.wikimedia.org/r/501542 (owner: 10Mathew.onipe)
[14:10:14] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] icinga: correct direction of check [puppet] - 10https://gerrit.wikimedia.org/r/501542 (owner: 10Mathew.onipe)
[14:10:41] <wikibugs>	 10Operations: parsoid-vd on scandium randomly died - https://phabricator.wikimedia.org/T219933 (10ssastry) @Arlolra @cscott Any insights here? Why would express terminate after an hour of being idle? (testreduce codebase for parsoid-vd service on scandium).  This is not critical. I asked @Dzahn to turn off these...
[14:14:07] <wikibugs>	 (03PS2) 10Elukey: admin: remove sudo permissions from gpu-testers and add users to it [puppet] - 10https://gerrit.wikimedia.org/r/501575 (https://phabricator.wikimedia.org/T148843)
[14:14:09] <wikibugs>	 (03PS1) 10Elukey: admin: create analytics-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/501578 (https://phabricator.wikimedia.org/T220175)
[14:15:38] <wikibugs>	 (03PS2) 10Elukey: admin: create analytics-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/501578 (https://phabricator.wikimedia.org/T220175)
[14:16:00] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['labtestnet2002.codfw.wmnet'] `  and were **ALL** successful.
[14:16:42] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Elukey: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10elukey) p:05Triage→03Normal a:03elukey
[14:21:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Aside from my comment, rest lgtm" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto)
[14:25:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Move pulling logic to us, away from the docker daemon [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501017 (owner: 10Giuseppe Lavagetto)
[14:26:49] <wikibugs>	 (03PS1) 10Elukey: role::statistics::gpu: add common statistics packages [puppet] - 10https://gerrit.wikimedia.org/r/501580 (https://phabricator.wikimedia.org/T148843)
[14:27:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Add changelog (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501186 (owner: 10Giuseppe Lavagetto)
[14:27:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role::statistics::gpu: add common statistics packages [puppet] - 10https://gerrit.wikimedia.org/r/501580 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[14:28:43] <elukey>	 how dare you Luca using tabs
[14:29:44] <wikibugs>	 (03PS1) 10Alex Monk: openstack::puppet::master::encapi: work on stretch with python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/501581 (https://phabricator.wikimedia.org/T171188)
[14:29:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor typo, logic LGTM" (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501185 (owner: 10Giuseppe Lavagetto)
[14:30:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10Andrew)
[14:30:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Depend on docker-py 3.x [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501184 (owner: 10Giuseppe Lavagetto)
[14:30:15] <mutante>	 did somebody just disable notificaitons for everyting on scandium? please dont
[14:31:04] <wikibugs>	 (03PS2) 10Elukey: role::statistics::gpu: add common statistics packages [puppet] - 10https://gerrit.wikimedia.org/r/501580 (https://phabricator.wikimedia.org/T148843)
[14:32:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, but some tests for --snapshot would be nice" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501183 (owner: 10Giuseppe Lavagetto)
[14:32:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15613/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/501580 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[14:33:16] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack::puppet::master::encapi: work on stretch with python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/501581 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk)
[14:33:23] <wikibugs>	 (03PS2) 10Andrew Bogott: openstack::puppet::master::encapi: work on stretch with python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/501581 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk)
[14:39:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Well, the fact we populate it on installation should not mean that we avoid enforcing it in runtime." [puppet] - 10https://gerrit.wikimedia.org/r/501565 (owner: 10Giuseppe Lavagetto)
[14:40:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/501535 (https://phabricator.wikimedia.org/T220003) (owner: 10Jbond)
[14:41:33] <wikibugs>	 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Eevans) @Dzahn I think this was just for !!labs/private.git!!, could you do the same for p...
[14:42:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] package-builder: export hook variable [puppet] - 10https://gerrit.wikimedia.org/r/501535 (https://phabricator.wikimedia.org/T220003) (owner: 10Jbond)
[14:42:40] <wikibugs>	 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Dzahn) @eevans I did both, the private repo part just doesn't show up on ticket.
[14:44:03] <wikibugs>	 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Eevans) >>! In T219560#5088509, @Dzahn wrote: > @eevans I did both, the private repo part...
[14:45:03] <wikibugs>	 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Dzahn) @Eevans Yep, i noticed that too and couldn't find the puppet code that would use th...
[14:46:21] <wikibugs>	 (03PS1) 10Dzahn: icinga/parsoid: disable systemd monitoring on scandium test host [puppet] - 10https://gerrit.wikimedia.org/r/501586 (https://phabricator.wikimedia.org/T219933)
[14:47:39] <wikibugs>	 (03PS1) 10Alex Monk: openstack::puppet::master::encapi: Avoid nginx-apache conflict [puppet] - 10https://gerrit.wikimedia.org/r/501587
[14:48:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] icinga/parsoid: disable systemd monitoring on scandium test host [puppet] - 10https://gerrit.wikimedia.org/r/501586 (https://phabricator.wikimedia.org/T219933) (owner: 10Dzahn)
[14:49:14] <wikibugs>	 (03PS2) 10Alex Monk: openstack::puppet::master::encapi: Avoid nginx-apache conflict [puppet] - 10https://gerrit.wikimedia.org/r/501587 (https://phabricator.wikimedia.org/T171188)
[14:49:52] <mutante>	 urandom: can you see the puppet code location that should use the new private variables ?
[14:50:24] <wikibugs>	 (03PS4) 10Ema: cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967)
[14:51:49] <urandom>	 mutante: there are two templates, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/cassandra/templates/adduser.cql.erb and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/cassandra/templates/cqlshrc.erb
[14:53:44] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] icinga/parsoid: disable systemd monitoring on scandium test host [puppet] - 10https://gerrit.wikimedia.org/r/501586 (https://phabricator.wikimedia.org/T219933) (owner: 10Dzahn)
[14:54:21] <wikibugs>	 (03CR) 10Jcrespo: "Comment." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui)
[14:54:32] <mutante>	 urandom: yes, i saw those, but something else must be missing in .pp files
[14:54:54] <urandom>	 oh.
[14:55:00] <urandom>	 hrm.
[14:55:22] <papaul>	 !log powering down restbase2019 and 2020 for relocation
[14:55:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:26] <mutante>	 there is definitely class passwords::cassandra with $application_username = 'seessions'
[14:55:30] <mutante>	 sessions
[14:56:05] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] grafana: remove frack datasources [puppet] - 10https://gerrit.wikimedia.org/r/501519 (https://phabricator.wikimedia.org/T219825) (owner: 10Filippo Giunchedi)
[14:56:50] <icinga-wm>	 PROBLEM - Host restbase2020 is DOWN: PING CRITICAL - Packet loss = 100%
[14:56:53] <mutante>	 maybe the "include ::passwords::cassandra"
[14:57:01] <mutante>	 hmm.. 
[14:57:06] <urandom>	 mutante: I'll bet it's missing from role::sessionstore
[14:57:12] <urandom>	 s/from/in/
[14:57:19] <urandom>	 i.e. it ought to be there, and isn't :/
[14:57:34] <icinga-wm>	 PROBLEM - Host restbase2019 is DOWN: PING CRITICAL - Packet loss = 100%
[14:57:39] <mutante>	 that includes profile::cassandra though
[14:57:53] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) p:05High→03Normal Lowering priority since the remaining steps are less urgent.
[14:57:54] <mutante>	 and that includes the password class.. hrmm
[14:59:12] <urandom>	 mutante: the restbase role has a `include ::passwords::cassandra`, and the sessionstore one does not
[14:59:36] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::cluster::packages::statistics: add myspell guards [puppet] - 10https://gerrit.wikimedia.org/r/501589 (https://phabricator.wikimedia.org/T148843)
[15:00:13] <mutante>	 urandom: heh, i guess that is it.. though profile::cassandra also has  include ::passwords::cassandra
[15:00:43] <mutante>	 that one probably not in scope then
[15:01:15] <urandom>	 I wonder what would happen if it were removed
[15:01:42] <mutante>	 we can jupload a patch to do that and then compile it
[15:01:52] <urandom>	 ya
[15:02:05] <icinga-wm>	 PROBLEM - Host restbase2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:02:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15615/" [puppet] - 10https://gerrit.wikimedia.org/r/501589 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[15:04:00] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203 (10aborrero) Right now this server is spare, waiting for further decisions, see https://phabricator.wikimedia.org/T217891#5088306
[15:04:21] <mutante>	 adds the include to sessionstore role
[15:04:33] <wikibugs>	 (03PS1) 10Dzahn: sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560)
[15:04:59] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn)
[15:05:06] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] db-eqiad.php: Promote db1075 to master (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui)
[15:05:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn)
[15:05:45] <urandom>	 doh
[15:05:45] <wikibugs>	 10Operations, 10Patch-For-Review: parsoid-vd on scandium randomly died - https://phabricator.wikimedia.org/T219933 (10Dzahn) systemd monitoring has been removed from icinga entirely for scandium. that's what was flapping due to this. all other base monitoring checks are still here and enabled (again).
[15:05:53] <mutante>	 gah.. style check? 
[15:06:15] <mutante>	 yep... 
[15:06:30] <mutante>	 we are not supposed to include them in the role in the first place ...
[15:08:13] <icinga-wm>	 PROBLEM - Host restbase2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:09:17] <mutante>	 how is it even possible that we get alerts for restbase2020 before it is in DNS  https://gerrit.wikimedia.org/r/c/operations/dns/+/501525
[15:09:22] <mutante>	 sigh..it never stops
[15:10:46] <mutante>	 papaul: seems they are in the wrong row
[15:11:10] <icinga-wm>	 ACKNOWLEDGEMENT - Host restbase2019 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T217368
[15:11:10] <icinga-wm>	 ACKNOWLEDGEMENT - Host restbase2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T217368
[15:11:10] <icinga-wm>	 ACKNOWLEDGEMENT - Host restbase2020 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T217368
[15:11:10] <icinga-wm>	 ACKNOWLEDGEMENT - Host restbase2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T217368
[15:12:27] <mutante>	 urandom: not sure yet why the include in profile::cassandra doesnt give us what we want already ..
[15:13:16] <urandom>	 mutante: or why it's OK for it to be included in the restbase role...
[15:13:23] <urandom>	 does the order matter?
[15:13:49] <mutante>	 urandom: it's probably not ok.. just that jenkins check only looks for NEW violations and ignores existing ones
[15:13:58] <urandom>	 OK
[15:14:13] <wikibugs>	 10Operations, 10Traffic, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10jijiki) We can gradually increase API PHP7 traffic by switching completely each server to PHP7 one or more at a time. If we assume for example that each A...
[15:14:57] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki)
[15:15:00] <wikibugs>	 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki)
[15:15:04] <wikibugs>	 10Operations, 10Traffic, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10jijiki)
[15:15:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for gtirloni [puppet] - 10https://gerrit.wikimedia.org/r/501596 (https://phabricator.wikimedia.org/T220211)
[15:16:07] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki)
[15:16:10] <wikibugs>	 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki)
[15:16:13] <wikibugs>	 10Operations, 10serviceops, 10Beta-Feature: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10jijiki)
[15:16:33] <icinga-wm>	 PROBLEM - Host serpens is DOWN: PING CRITICAL - Packet loss = 100%
[15:16:35] <wikibugs>	 (03PS8) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967)
[15:16:43] <icinga-wm>	 PROBLEM - Host poolcounter2002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:17:19] <icinga-wm>	 PROBLEM - Host install2002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:17:19] <icinga-wm>	 PROBLEM - Host ganeti2004 is DOWN: PING CRITICAL - Packet loss = 100%
[15:17:21] <icinga-wm>	 PROBLEM - Host pollux is DOWN: PING CRITICAL - Packet loss = 100%
[15:17:21] <icinga-wm>	 PROBLEM - Host tureis is DOWN: PING CRITICAL - Packet loss = 100%
[15:17:35] <icinga-wm>	 PROBLEM - Host releases2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:18:17] <mutante>	 ugh
[15:18:21] <herron>	 hmmm
[15:18:55] <herron>	 we lost ganeti2004 just now?
[15:19:27] <mutante>	 urandom: for some reason it set @super_password to "cassandra" in cqlshrc on sessionstore1001 ?
[15:19:34] <mutante>	 herron: ganeti.. yea :(
[15:19:42] <chaomodus>	 ahh that explains it
[15:19:47] <herron>	 will these automatically relocate to another host?
[15:19:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "makes sense to me. Worth double-checking that we are not breaking the code live in production when this change is merged. Specially some m" [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[15:20:04] <urandom>	 mutante: that's the default
[15:20:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for gtirloni [puppet] - 10https://gerrit.wikimedia.org/r/501596 (https://phabricator.wikimedia.org/T220211) (owner: 10Muehlenhoff)
[15:21:40] <mutante>	 chaomodus: cant reach mgmt either
[15:21:47] <mutante>	 asking in dcops about ongoing work
[15:22:19] <chaomodus>	 2003 isn't reachable?
[15:22:39] <moritzm>	 2004 mgmt works for me
[15:22:44] <wikibugs>	 (03PS1) 10Elukey: profile::an::cluster::pkgs::statistics: add better handling of myspell pkgs [puppet] - 10https://gerrit.wikimedia.org/r/501600 (https://phabricator.wikimedia.org/T148843)
[15:23:32] <herron>	 I’m on 2003 at the moment
[15:23:35] <icinga-wm>	 RECOVERY - Host restbase2019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.60 ms
[15:24:08] <chaomodus>	 ganeti2004.codfw.wmnet      ?     ?      ?     ?     ?     6     7
[15:24:12] <chaomodus>	 in the node list
[15:24:46] <herron>	 https://www.irccloud.com/pastebin/dIJXyDST/
[15:24:51] <mutante>	 papaul is checking the cable
[15:24:57] <mutante>	 this is B5
[15:25:17] <herron>	 safe to attempt to restart these, or might that make things worse?
[15:25:22] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: reimage to stretch and rename to cloudnet2004-dev - https://phabricator.wikimedia.org/T220203 (10aborrero) 05Open→03Declined We are keeping labtestnet2002 as spare for now. Is running Debian already, with role::spare.
[15:25:30] <mutante>	 herron: ip link show 
[15:25:32] <chaomodus>	 herron: don't they have to be migrated before they can start again?
[15:25:33] <mutante>	 does it still have link?
[15:26:17] <herron>	 mutante: what do you mean “it” ?
[15:26:34] <wikibugs>	 (03PS2) 10Elukey: profile::an::cluster::pkgs::statistics: add better handling of myspell pkgs [puppet] - 10https://gerrit.wikimedia.org/r/501600 (https://phabricator.wikimedia.org/T148843)
[15:26:51] <mutante>	 herron: ganeti2004 or ganeti2003
[15:27:01] <mutante>	 you mentioned both
[15:27:16] <herron>	 I can’t reach ganeti2004, but I’m logged in to ganeti2003
[15:27:17] <icinga-wm>	 RECOVERY - Host serpens is UP: PING OK - Packet loss = 0%, RTA = 37.24 ms
[15:27:17] <icinga-wm>	 RECOVERY - Host install2002 is UP: PING OK - Packet loss = 0%, RTA = 37.83 ms
[15:27:19] <icinga-wm>	 RECOVERY - Host releases2001 is UP: PING OK - Packet loss = 0%, RTA = 38.27 ms
[15:27:23] <icinga-wm>	 RECOVERY - Host ganeti2004 is UP: PING OK - Packet loss = 0%, RTA = 73.08 ms
[15:27:25] <icinga-wm>	 RECOVERY - Host poolcounter2002 is UP: PING OK - Packet loss = 0%, RTA = 36.38 ms
[15:27:25] <icinga-wm>	 RECOVERY - Host pollux is UP: PING OK - Packet loss = 0%, RTA = 36.31 ms
[15:27:25] <mutante>	 there we go
[15:27:31] <icinga-wm>	 RECOVERY - Host tureis is UP: PING OK - Packet loss = 0%, RTA = 36.41 ms
[15:27:35] <chaomodus>	 fwew
[15:27:39] <chaomodus>	 cable issue?
[15:27:44] <mutante>	 it was the cable on 2004
[15:27:48] <mutante>	 -dcops
[15:27:55] <chaomodus>	 at least it was nothing serious :)
[15:28:01] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Elukey: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10RobH) Does this new group have sudo?  If they are becoming 'analytics' users, it sounds like it will?  (I don't see sudo rights i...
[15:28:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::an::cluster::pkgs::statistics: add better handling of myspell pkgs [puppet] - 10https://gerrit.wikimedia.org/r/501600 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[15:28:56] <wikibugs>	 (03PS3) 10Bstorm: postgresql: set recovery.conf to writeable by postgres user [puppet] - 10https://gerrit.wikimedia.org/r/501371 (https://phabricator.wikimedia.org/T219652)
[15:30:13] <icinga-wm>	 PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:30:13] <icinga-wm>	 PROBLEM - puppet last run on ganeti2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:30:13] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Elukey: Requesting access to analytics sudo on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10elukey) >>! In T220175#5088605, @RobH wrote: > Does this new group have sudo?  If they are becoming 'analytics' users, it sounds...
[15:30:16] <herron>	 I think the answer to my question is trying to restart them would have made it worse
[15:30:21] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] postgresql: set recovery.conf to writeable by postgres user [puppet] - 10https://gerrit.wikimedia.org/r/501371 (https://phabricator.wikimedia.org/T219652) (owner: 10Bstorm)
[15:30:35] <herron>	 since the cluster was probably split brain until the cable was reconnected?
[15:30:49] <icinga-wm>	 PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:31:17] <icinga-wm>	 PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:31:17] <herron>	 maybe it wouldn’t have allowed that action though.  anyway glad it back to normal 
[15:31:22] <chaomodus>	 herron: should probably come up with a strategy to deal with that, like the wikitech page is about when you know it's gonna go down - you drain the node and stuff
[15:32:31] <icinga-wm>	 PROBLEM - puppet last run on poolcounter2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:32:40] <chaomodus>	 ahah
[15:32:48] <cdanis>	 https://wikitech.wikimedia.org/wiki/Service_restarts#Ganeti
[15:32:57] <icinga-wm>	 PROBLEM - puppet last run on tureis is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:32:58] <cdanis>	 for planned maintenance on hosts there are docs
[15:33:00] <chaomodus>	 more evidence that this problem is due to load on puppetmaster
[15:33:28] <chaomodus>	 cdanis: yes, i mean, if there's an /unexpected/ outage of a node
[15:33:36] <chaomodus>	 cdanis: there should be a strategy
[15:33:45] <herron>	 but also the vms on ganeti2004 were without network for a bit
[15:33:50] <herron>	 re: puppet fails
[15:34:21] <mutante>	 it's going to be just that ... running puppet
[15:34:26] <chaomodus>	 right so they are all running simultaneously
[15:34:36] <chaomodus>	 and loading puppetmaster in some way that causes the catalog fetch fail
[15:34:53] <cdanis>	 chaomodus: ah, yeah
[15:34:58] <herron>	 oh I guess it could be, I was thinking more along the lines of the scheduled cron failed due to cable issue and now icinga noticed
[15:35:16] <mutante>	 nah, it's fine on poolcounter2002 but had to manually run it
[15:35:29] <chaomodus>	 i was thinking they were blocking on connect so they all were able to connect at the same time
[15:35:33] <mutante>	 i dont think they were running at the same time.its' randomized
[15:35:33] <chaomodus>	 but anyways
[15:35:35] <chaomodus>	 https://nsrc.org/workshops/2016/sanog27/raw-attachment/wiki/Track2Virt/ex-ganeti-failure-scenarios.htm 
[15:37:47] <icinga-wm>	 RECOVERY - puppet last run on poolcounter2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[15:38:13] <icinga-wm>	 RECOVERY - puppet last run on tureis is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:39:39] <herron>	 I think these puppet fails are a different condition from our suspected load related 503s.  the error was poolcounter2002 puppet-agent[13231]: Could not retrieve catalog from remote server: getaddrinfo: Name or service not known
[15:39:54] <chaomodus>	 interesting
[15:41:02] <chaomodus>	 good point
[15:41:07] <icinga-wm>	 RECOVERY - Host restbase2020.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 345.43 ms
[15:41:56] * Krinkle staging on mwdebug1002 to roll out three patches for wmf.24 (group0 only)
[15:45:21] <wikibugs>	 (03PS2) 10Dzahn: sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560)
[15:45:38] <mutante>	 herron: i think that had nothing to do with master overload
[15:45:43] <mutante>	 it was just the network being down
[15:46:01] <icinga-wm>	 RECOVERY - puppet last run on ganeti2004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[15:46:12] <wikibugs>	 (03PS1) 10Elukey: ores::base: fix package requires for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843)
[15:46:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn)
[15:47:28] <wikibugs>	 (03PS3) 10Dzahn: sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560)
[15:49:44] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Papaul) a:05Papaul→03Marostegui complete
[15:50:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove gtirloni from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/501609 (https://phabricator.wikimedia.org/T220211)
[15:50:12] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn)
[15:51:19] <icinga-wm>	 RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[15:51:45] <wikibugs>	 (03PS1) 10BBlack: wm.org no-op cleanup: no empty left-hand-side [dns] - 10https://gerrit.wikimedia.org/r/501610
[15:51:48] <wikibugs>	 (03PS1) 10BBlack: wm.org no-op cleanup: prefer @ to wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/501611
[15:51:49] <wikibugs>	 (03PS1) 10BBlack: wm.org no-op cleanup: Group on hostnames for meta [dns] - 10https://gerrit.wikimedia.org/r/501612
[15:51:51] <wikibugs>	 (03PS1) 10BBlack: wm.org no-op cleanup: re-arrange top section [dns] - 10https://gerrit.wikimedia.org/r/501613
[15:51:52] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/GlobalBlocking/includes/specials/: I5843cd181ca7d (duration: 01m 02s)
[15:51:54] <wikibugs>	 (03PS1) 10BBlack: ns[012].wikimedia.org: 1D TTLs [dns] - 10https://gerrit.wikimedia.org/r/501614
[15:51:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:56] <wikibugs>	 (03PS1) 10BBlack: wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615
[15:52:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wm.org no-op cleanup: prefer @ to wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/501611 (owner: 10BBlack)
[15:52:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wm.org no-op cleanup: no empty left-hand-side [dns] - 10https://gerrit.wikimedia.org/r/501610 (owner: 10BBlack)
[15:52:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wm.org no-op cleanup: Group on hostnames for meta [dns] - 10https://gerrit.wikimedia.org/r/501612 (owner: 10BBlack)
[15:52:19] <icinga-wm>	 RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:52:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] Remove gtirloni from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/501609 (https://phabricator.wikimedia.org/T220211) (owner: 10Muehlenhoff)
[15:52:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615 (owner: 10BBlack)
[15:52:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ns[012].wikimedia.org: 1D TTLs [dns] - 10https://gerrit.wikimedia.org/r/501614 (owner: 10BBlack)
[15:52:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wm.org no-op cleanup: re-arrange top section [dns] - 10https://gerrit.wikimedia.org/r/501613 (owner: 10BBlack)
[15:52:58] <mutante>	 urandom: "Compilation results for sessionstore1001.eqiad.wmnet: no change"  :(
[15:53:02] <mutante>	 wtf
[15:53:13] <mutante>	 merges it anyways to see :)
[15:53:29] <wikibugs>	 (03CR) 10Muehlenhoff: ores::base: fix package requires for Debian Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[15:53:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn)
[15:53:44] <wikibugs>	 (03PS1) 10Jbond: aptrepo: create new components for facter3 and puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/501616 (https://phabricator.wikimedia.org/T219803)
[15:53:46] <wikibugs>	 (03PS4) 10Dzahn: sessionstore: include cassandra passwords [puppet] - 10https://gerrit.wikimedia.org/r/501591 (https://phabricator.wikimedia.org/T219560)
[15:53:48] <wikibugs>	 (03PS1) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803)
[15:53:50] <wikibugs>	 (03PS1) 10Jbond: facter3 and puppet5: add repositories for puppet5 and facter3 [puppet] - 10https://gerrit.wikimedia.org/r/501618 (https://phabricator.wikimedia.org/T219803)
[15:54:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] aptrepo: create new components for facter3 and puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/501616 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[15:54:57] <wikibugs>	 (03CR) 10Elukey: ores::base: fix package requires for Debian Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[15:55:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[15:55:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] aptrepo: create new components for facter3 and puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/501616 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[15:55:29] <wikibugs>	 (03PS2) 10Jbond: aptrepo: create new components for facter3 and puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/501616 (https://phabricator.wikimedia.org/T219803)
[15:55:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] facter3 and puppet5: add repositories for puppet5 and facter3 [puppet] - 10https://gerrit.wikimedia.org/r/501618 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[15:56:37] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove gtirloni from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/501609 (https://phabricator.wikimedia.org/T220211)
[15:57:07] <icinga-wm>	 RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[15:57:27] <wikibugs>	 (03PS3) 10Jbond: aptrepo: create new components for facter3 and puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/501616 (https://phabricator.wikimedia.org/T219803)
[15:57:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] aptrepo: create new components for facter3 and puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/501616 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[15:57:55] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/NavigationTiming/: I6b23be850d35c7d19 / T220156 (duration: 01m 00s)
[15:57:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:58] <stashbot>	 T220156: navtiming: firstPaint.mobile metric broken on wmf.24 - https://phabricator.wikimedia.org/T220156
[15:58:17] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) - restbase2019 relocate to B5 restbase2020 relocate to C5 - Netbox update - clean asw-a-codfw
[15:58:28] <mutante>	 urandom: merged and ran puppet and nothing happened. it must be something else we dont see yet :(
[15:58:31] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove gtirloni from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/501609 (https://phabricator.wikimedia.org/T220211)
[15:58:33] <mutante>	 not that include
[15:58:45] <urandom>	 wth
[15:58:46] <wikibugs>	 (03PS2) 10BBlack: wm.org no-op cleanup: no empty left-hand-side [dns] - 10https://gerrit.wikimedia.org/r/501610
[15:58:48] <wikibugs>	 (03PS2) 10BBlack: wm.org no-op cleanup: prefer @ to wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/501611
[15:58:49] <urandom>	 mutante: OK
[15:58:49] <wikibugs>	 (03PS2) 10BBlack: wm.org no-op cleanup: Group on hostnames for meta [dns] - 10https://gerrit.wikimedia.org/r/501612
[15:58:51] <wikibugs>	 (03PS2) 10BBlack: wm.org no-op cleanup: re-arrange top section [dns] - 10https://gerrit.wikimedia.org/r/501613
[15:58:54] <wikibugs>	 (03PS2) 10BBlack: ns[012].wikimedia.org: 1D TTLs [dns] - 10https://gerrit.wikimedia.org/r/501614
[15:58:56] <wikibugs>	 (03PS2) 10BBlack: wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615
[15:59:25] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10Cmjohnson)
[15:59:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615 (owner: 10BBlack)
[15:59:40] <wikibugs>	 (03PS3) 10Jbond: package-builder: export hook variable [puppet] - 10https://gerrit.wikimedia.org/r/501535 (https://phabricator.wikimedia.org/T220003)
[15:59:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Remove gtirloni from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/501609 (https://phabricator.wikimedia.org/T220211) (owner: 10Muehlenhoff)
[15:59:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove gtirloni from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/501609 (https://phabricator.wikimedia.org/T220211) (owner: 10Muehlenhoff)
[16:00:37] <wikibugs>	 (03PS4) 10Jbond: package-builder: export hook variable [puppet] - 10https://gerrit.wikimedia.org/r/501535 (https://phabricator.wikimedia.org/T220003)
[16:00:56] <mutante>	 urandom: at least then we dont have to wonder why the include in the profile doesnt work.. trying to see the silver lining. heh. but also need to run soon
[16:01:25] <wikibugs>	 (03PS2) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803)
[16:01:29] <urandom>	 mutante: I understand
[16:01:42] <urandom>	 mutante: for another day then; thanks!
[16:01:49] <mutante>	 right on Monday!
[16:02:11] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.24/includes/jobqueue/jobs/RefreshLinksJob.php: Ib1ac31365f9c / T220037 (duration: 00m 59s)
[16:02:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:15] <stashbot>	 T220037: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037
[16:02:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[16:02:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10Cmjohnson) a:05Cmjohnson→03RobH Assigning back to robh NIC has been enabled to PXE, second cable has been run, por...
[16:02:45] <wikibugs>	 (03PS1) 10Dzahn: Revert "sessionstore: include cassandra passwords" [puppet] - 10https://gerrit.wikimedia.org/r/501620
[16:03:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: relocate/reimage cloudvirt1015 with 10G interfaces - https://phabricator.wikimedia.org/T217140 (10Cmjohnson)
[16:03:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: relocate/reimage cloudvirt1015 with 10G interfaces - https://phabricator.wikimedia.org/T217140 (10Cmjohnson) a:05Cmjohnson→03Andrew This server should be ready to go
[16:04:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10Cmjohnson)
[16:04:56] <wikibugs>	 (03PS3) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803)
[16:05:22] <wikibugs>	 (03PS2) 10Dzahn: Revert "sessionstore: include cassandra passwords" [puppet] - 10https://gerrit.wikimedia.org/r/501620
[16:05:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10Cmjohnson) a:05Cmjohnson→03RobH @robh connected the second cable, updated switch cfg with cloud-virt-instance-trunk. Ran the SPP
[16:05:58] <wikibugs>	 (03PS3) 10Dzahn: Revert "sessionstore: include cassandra passwords" [puppet] - 10https://gerrit.wikimedia.org/r/501620
[16:06:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "sessionstore: include cassandra passwords" [puppet] - 10https://gerrit.wikimedia.org/r/501620 (owner: 10Dzahn)
[16:06:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1018 with 10G interfaces - https://phabricator.wikimedia.org/T217347 (10Cmjohnson)
[16:07:09] <wikibugs>	 (03PS3) 10BBlack: wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615
[16:08:05] <wikibugs>	 (03PS2) 10Elukey: ores::base: fix package requires for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843)
[16:08:42] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1018 with 10G interfaces - https://phabricator.wikimedia.org/T217347 (10Cmjohnson) Everything is done with this server but I am not getting any link lights on the 10G card. I verified and re-verified that the car...
[16:09:34] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1024 with 10G interfaces - https://phabricator.wikimedia.org/T216724 (10Cmjohnson)
[16:09:41] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1024 with 10G interfaces - https://phabricator.wikimedia.org/T216724 (10Cmjohnson) Everything is done with this server but I am not getting any link lights on the 10G card. I verified and re-verified that the car...
[16:09:48] <wikibugs>	 (03CR) 10Elukey: "Mortiz: not sure if this version is correct, because on the same pre-existing stretch host we'd end up with, for example, both hunspell-ca" [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[16:09:50] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10Papaul) papaul@asw-c-codfw# run show interfaces ge-1/0/17 descriptions  Interface       Admin Link Descript...
[16:10:02] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1009 with 10G interfaces - https://phabricator.wikimedia.org/T216324 (10Cmjohnson)
[16:10:14] <dmaza>	 I'm trying to push a patch to gerrit and I keep getting a disconnect issue with "Too many authentication failures: 7" message.
[16:10:19] <dmaza>	 Any idea what's going on?
[16:10:20] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10Papaul)
[16:10:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1009 with 10G interfaces - https://phabricator.wikimedia.org/T216324 (10Cmjohnson) a:05Cmjohnson→03Andrew The server is moved and is ready to install
[16:10:55] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10Papaul) a:05Papaul→03aborrero
[16:12:40] <wikibugs>	 (03PS2) 10Jbond: facter3 and puppet5: add repositories for puppet5 and facter3 [puppet] - 10https://gerrit.wikimedia.org/r/501618 (https://phabricator.wikimedia.org/T219803)
[16:14:46] <wikibugs>	 10Operations, 10wikidiff2, 10Wikimedia-production-error: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 (10Krinkle) p:05Triage→03High
[16:15:44] <cdanis>	 dmaza: do you know when it stopped working for you?
[16:16:03] <dmaza>	 it worked fine 2 days ago
[16:16:23] <dmaza>	 actually, it worked fine yesterday
[16:16:24] <cdanis>	 and you can log into the web ui fine?
[16:16:28] <dmaza>	 yes
[16:16:50] <cdanis>	 checked ssh pubkey and all that?
[16:17:18] <dmaza>	 yup.. Lemme double check it is the same but nothing has changed on my system
[16:18:00] <hashar>	 dmaza: the ssh authentication fails due to the ssh key not matching
[16:18:39] <hashar>	 found via:  ssh cobalt.wikimedia.org grep dmaza /var/log/gerrit/sshd_log
[16:18:53] <dmaza>	 let me re-add the keys.. that's very odd
[16:19:04] <hashar>	 you can check at https://gerrit.wikimedia.org/r/#/settings/ssh-keys
[16:19:05] <cdanis>	 yeah, I see it working yesterday in those logs as well
[16:19:14] <hashar>	 and also verify your local ssh config to make sure it offers the proper key
[16:19:45] <wikibugs>	 (03PS1) 10Elukey: Fix more common packages deployed to Buster based Analytics nodes [puppet] - 10https://gerrit.wikimedia.org/r/501621 (https://phabricator.wikimedia.org/T148843)
[16:20:08] <cdanis>	 if it is a 'client not offering the correct key' issue, then this might be helpful dmaza: ssh -v -p 29418 gerrit.wikimedia.org
[16:20:26] <dmaza>	 welp.. I restarted sshd and re-added my key and it works now
[16:21:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Fix more common packages deployed to Buster based Analytics nodes [puppet] - 10https://gerrit.wikimedia.org/r/501621 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[16:21:07] <dmaza>	 thank you and sorry for the inconvenience 
[16:21:33] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Marostegui) Thanks, it is rebuilding `       logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 54% complete)        physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)       physicaldrive 1I:1:2 (port...
[16:21:34] <hashar>	 dmaza: and your key have not been changed in gerrit :)
[16:21:40] <hashar>	 dmaza: well done!
[16:22:17] <hashar>	 cdanis: so for Gerrit, I usually jump to that cobalt.wikimedia.org:/var/log/gerrit/sshd_log and usually that gives some good enough clues (eg: auth failure)
[16:22:23] <dmaza>	 haha.. I assumed that someone was brute-forcing my account 🤷‍♂️ 
[16:22:27] <hashar>	 sometime that is just using the wrong username
[16:22:42] <hashar>	 since ssh would typically use the local username which might not match the WMCS shell name
[16:22:46] <wikibugs>	 10Operations, 10wikidiff2, 10Wikimedia-production-error: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 (10WMDE-Fisch) We changed the signature in wikidiff2 version 1.8.0 so not using $wikiDiff2MovedParagraphDetectionCuto...
[16:22:54] <cdanis>	 yeah, makes sense
[16:23:17] <dmaza>	 right. I wonder if git review has a "verbose" option
[16:24:22] <dmaza>	 oh it does (-v). Maybe I'll try that if I have any other issues. It might spit out some useful info
[16:24:52] <wikibugs>	 (03PS4) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803)
[16:25:32] <wikibugs>	 10Operations, 10wikidiff2, 10Wikimedia-production-error: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 (10MoritzMuehlenhoff) @Krinkle, @WMDE-Fisch : Shall we depool the five servers already upgraded until that is resolved?
[16:27:51] <wikibugs>	 (03CR) 10Jbond: "catalogue compile output" [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[16:30:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/501575 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[16:30:36] <wikibugs>	 10Operations, 10wikidiff2, 10Patch-For-Review, 10Wikimedia-production-error: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 (10Krinkle) @WMDE-Fisch Does this mean the feature is no longer exists, or is no longer configu...
[16:30:54] <wikibugs>	 (03CR) 10CRusnov: "Just a quick look through. I'm no puppet expert but overall seems good moving toward the standard modren way things are done and not the o" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[16:30:57] <wikibugs>	 (03CR) 10Muehlenhoff: "The packages should probably audited one by one on a stretch system, e.g. myspell-ca is a transitional package in stretch and a virtual pa" [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[16:33:41] <wikibugs>	 10Operations, 10wikidiff2, 10Patch-For-Review, 10Wikimedia-production-error: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 (10WMDE-Fisch) >>! In T220217#5088833, @Krinkle wrote: > @WMDE-Fisch Does this mean the feature...
[16:33:43] <wikibugs>	 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10CDanis) >>! In T219803#5088004, @jbond wrote: > A bit more to the picture, managed to get facter to build by updating all refrence of `std::unordered_map` to `s...
[16:34:43] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) ` papaul@asw-b-codfw> show interfaces ge-5/0/18 descriptions  Interface       Admin Link Description ge-5/0/18       up    down restbase2019  papaul@asw-c-codf...
[16:36:18] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) Ran into this in {...
[16:37:26] <wikibugs>	 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) @CDanis thanks i will try a patch with some of the other maps however the problem is that the std::unsorted_map is available but it has a bug in the libr...
[16:38:21] <wikibugs>	 (03PS1) 10BBlack: Add CNAME-variant langlist template [dns] - 10https://gerrit.wikimedia.org/r/501628 (https://phabricator.wikimedia.org/T208263)
[16:38:23] <wikibugs>	 (03PS1) 10BBlack: wiktionary: test with zone-local CNAME->DYNA [dns] - 10https://gerrit.wikimedia.org/r/501629 (https://phabricator.wikimedia.org/T208263)
[16:39:42] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis)
[16:40:11] <wikibugs>	 (03PS1) 10Andrew Bogott: Toolforge: update indic font packages [puppet] - 10https://gerrit.wikimedia.org/r/501630
[16:42:18] <wikibugs>	 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10CDanis) Ah, got it.  Sorry for not reading more of the context here, just saw that one line and thought "uh oh" :)
[16:43:19] <wikibugs>	 (03CR) 10Muehlenhoff: "Maybe sync that up with mediawiki::packages::fonts, e.g. the Malayalam fonts are missing in your list." [puppet] - 10https://gerrit.wikimedia.org/r/501630 (owner: 10Andrew Bogott)
[16:43:31] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10Krinkle)
[16:44:13] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10Krinkle) (Adding to our radar to look at navtiming/dns metrics impact after it is rolled out.)
[16:45:25] <wikibugs>	 (03PS1) 10Elukey: Fix more common packages for Analytics hosts for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501632 (https://phabricator.wikimedia.org/T148843)
[16:46:16] <wikibugs>	 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10MoritzMuehlenhoff) >>! In T219803#5088899, @CDanis wrote: > Ah, got it.  Sorry for not reading more of the context here, just saw that > one line and thought "u...
[16:47:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Fix more common packages for Analytics hosts for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501632 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[16:47:31] <wikibugs>	 (03CR) 10Hashar: "Thank you :]" [puppet] - 10https://gerrit.wikimedia.org/r/497605 (https://phabricator.wikimedia.org/T218735) (owner: 10Hashar)
[16:47:54] <wikibugs>	 (03CR) 10Nuria: [C: 03+1] admin: remove sudo permissions from gpu-testers and add users to it [puppet] - 10https://gerrit.wikimedia.org/r/501575 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[16:48:52] <wikibugs>	 (03CR) 10Nuria: [C: 03+1] "Nice, yes , agreed this is much better." [puppet] - 10https://gerrit.wikimedia.org/r/501578 (https://phabricator.wikimedia.org/T220175) (owner: 10Elukey)
[16:49:12] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10BBlack) We may try the wiktionary patch early next week.  The goal with that test is just to see if we get any...
[16:50:03] <wikibugs>	 (03PS2) 10Andrew Bogott: Toolforge: remove old trusty-specific indic font packages [puppet] - 10https://gerrit.wikimedia.org/r/501630
[16:53:46] <wikibugs>	 (03PS1) 10Elukey: Fix last common packages for Analytics hosts for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501635 (https://phabricator.wikimedia.org/T148843)
[16:54:33] <wikibugs>	 (03PS3) 10Andrew Bogott: Toolforge: remove old trusty-specific indic font packages [puppet] - 10https://gerrit.wikimedia.org/r/501630
[16:54:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Fix last common packages for Analytics hosts for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501635 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[16:55:54] <wikibugs>	 (03PS4) 10Andrew Bogott: Toolforge: remove old trusty-specific indic font packages [puppet] - 10https://gerrit.wikimedia.org/r/501630
[16:57:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Toolforge: remove old trusty-specific indic font packages [puppet] - 10https://gerrit.wikimedia.org/r/501630 (owner: 10Andrew Bogott)
[16:59:09] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: waf: Add dummy data for it [labs/private] - 10https://gerrit.wikimedia.org/r/501636
[16:59:13] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) The long list of patches above was needed to allow to deploy t...
[17:01:32] <icinga-wm>	 RECOVERY - HP RAID on db2044 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK
[17:10:45] <wikibugs>	 (03PS1) 10Thcipriani: Revert "Gerrit 2.15.12 (update core only)" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/501638
[17:12:26] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on db2044 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044&var-datasource=codfw+prometheus/ops
[17:12:32] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.24/includes/diff/TextSlotDiffRenderer.php: Ia326c67de28a4e / T220217 (duration: 01m 00s)
[17:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:36] <stashbot>	 T220217: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217
[17:12:37] <wikibugs>	 (03CR) 10Thcipriani: [V: 03+2 C: 03+2] Revert "Gerrit 2.15.12 (update core only)" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/501638 (owner: 10Thcipriani)
[17:12:50] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: waf: Move to httpd::conf instead of httpd::site [puppet] - 10https://gerrit.wikimedia.org/r/501639
[17:12:52] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: waf: Remove realm if guards [puppet] - 10https://gerrit.wikimedia.org/r/501640
[17:16:47] <wikibugs>	 10Operations, 10Analytics, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10ayounsi) a:03ayounsi
[17:17:54] <wikibugs>	 (03CR) 10Herron: [C: 03+1] waf: Add dummy data for it [labs/private] - 10https://gerrit.wikimedia.org/r/501636 (owner: 10Alexandros Kosiaris)
[17:18:02] <wikibugs>	 10Operations, 10Release Pipeline, 10Core Platform Team Kanban (Done with CPT), 10Release-Engineering-Team (Watching / External), 10Services (done): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10mobrovac) FYI, [service-runner v2.6...
[17:19:35] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.23/includes/diff/TextSlotDiffRenderer.php: Ia326c67de28a4e / T220217 (duration: 01m 02s)
[17:19:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:39] <stashbot>	 T220217: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217
[17:21:21] <wikibugs>	 10Operations, 10wikidiff2, 10Patch-For-Review, 10Wikimedia-production-error: PHP Warning: wgWikiDiff2MovedParagraphDetectionCutoff is set WikiDiff2 does not support it - https://phabricator.wikimedia.org/T220217 (10Krinkle) 05Open→03Resolved a:03WMDE-Fisch
[17:23:32] <thcipriani>	 Krinkle: are you in the middle of some backports? I was just about to do a quick gerrit restart.
[17:23:39] <Krinkle>	 thcipriani: done
[17:23:47] <thcipriani>	 k, thanks
[17:24:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] waf: Add dummy data for it [labs/private] - 10https://gerrit.wikimedia.org/r/501636 (owner: 10Alexandros Kosiaris)
[17:25:33] <logmsgbot>	 !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@a4e66d4]: Gerrit to back to 2.15.11 (on gerrit2001 only)
[17:25:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] ores::base: fix package requires for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey)
[17:25:43] <logmsgbot>	 !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@a4e66d4]: Gerrit to back to 2.15.11 (on gerrit2001 only) (duration: 00m 10s)
[17:25:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:28] <logmsgbot>	 !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@a4e66d4]: Gerrit to back to 2.15.11 on cobalt (restart incoming)
[17:26:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:40] <logmsgbot>	 !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@a4e66d4]: Gerrit to back to 2.15.11 on cobalt (restart incoming) (duration: 00m 11s)
[17:26:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:28] <thcipriani>	 !log restart gerrit
[17:27:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:13] <thcipriani>	 !log gerrit back on 2.15.11
[17:29:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:21] <wikibugs>	 (03CR) 10Herron: [C: 03+1] waf: Remove realm if guards [puppet] - 10https://gerrit.wikimedia.org/r/501640 (owner: 10Alexandros Kosiaris)
[17:31:41] <icinga-wm>	 PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[17:33:47] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "looks good to me, just one nitpick" [puppet] - 10https://gerrit.wikimedia.org/r/501639 (owner: 10Alexandros Kosiaris)
[17:34:16] <wikibugs>	 (03CR) 10Mobrovac: [C: 03+1] "Nice catch, Ema!" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac)
[17:35:41] <icinga-wm>	 PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts]
[17:47:29] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[17:49:02] <cdanis>	 that is quite a lot
[17:52:22] * greg-g looks at the week: https://grafana.wikimedia.org/d/000000438/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen&from=now-7d&to=now
[17:53:31] <cdanis>	 yeah, you are right greg-g
[17:55:41] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[17:55:58] <greg-g>	 cdanis: sadly, of course :/
[17:56:38] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Marostegui) 05Open→03Resolved Finished correctly, thanks! ` root@db2044:~# hpssacli controller all show config  Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380264FFFB0)      Port Name: 1I     P...
[18:01:25] <icinga-wm>	 RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[18:02:35] <icinga-wm>	 RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[18:04:11] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Move graphoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219923 (10mobrovac) >>! In T219923#5086476, @Pchelolo wrote: > Apparently `g...
[18:12:34] <wikibugs>	 (03CR) 10Aaron Schulz: [C: 03+1] db-eqiad.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499737 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui)
[18:16:51] <wikibugs>	 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac)
[18:24:06] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[18:27:15] <wikibugs>	 (03PS2) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527)
[18:51:33] <wikibugs>	 (03PS2) 10Jcrespo: mariadb-snapshot: Reduce codfw mariabackup generation to x1 and m5 [puppet] - 10https://gerrit.wikimedia.org/r/501555 (https://phabricator.wikimedia.org/T206203)
[18:52:42] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb-snapshot: Reduce codfw mariabackup generation to x1 and m5 [puppet] - 10https://gerrit.wikimedia.org/r/501555 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[18:53:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10RobH) I'll describe the issue I'm seeing, and how I've troubleshot it, so far to no avail:  cloudvirt1008 will PXE boo...
[18:54:27] <wikibugs>	 (03PS1) 10RobH: cloudvirt1008 dhcp lease file correction [puppet] - 10https://gerrit.wikimedia.org/r/501666 (https://phabricator.wikimedia.org/T216661)
[18:55:58] <wikibugs>	 (03CR) 10RobH: [C: 03+2] cloudvirt1008 dhcp lease file correction [puppet] - 10https://gerrit.wikimedia.org/r/501666 (https://phabricator.wikimedia.org/T216661) (owner: 10RobH)
[18:56:06] <wikibugs>	 (03PS2) 10RobH: cloudvirt1008 dhcp lease file correction [puppet] - 10https://gerrit.wikimedia.org/r/501666 (https://phabricator.wikimedia.org/T216661)
[19:02:36] <wikibugs>	 (03CR) 10Bstorm: "That looks better.  I can deal with an extra line break if things line up right." [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[19:07:04] <wikibugs>	 (03PS3) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527)
[19:09:25] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[19:13:45] <wikibugs>	 (03PS6) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203)
[19:13:47] <wikibugs>	 (03PS4) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203)
[19:18:58] <wikibugs>	 (03CR) 10Bstorm: "Confirmed it works!  :)" [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[19:26:25] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[19:27:35] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:29:55] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] grafana: update swift dashboard to use new metric names [puppet] - 10https://gerrit.wikimedia.org/r/501399 (https://phabricator.wikimedia.org/T219825) (owner: 10Cwhite)
[19:30:02] <wikibugs>	 (03PS2) 10Cwhite: grafana: update swift dashboard to use new metric names [puppet] - 10https://gerrit.wikimedia.org/r/501399 (https://phabricator.wikimedia.org/T219825)
[19:32:08] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] grafana: remove frack datasources [puppet] - 10https://gerrit.wikimedia.org/r/501519 (https://phabricator.wikimedia.org/T219825) (owner: 10Filippo Giunchedi)
[19:36:41] <wikibugs>	 (03PS1) 10Bstorm: Revert "labstore: Adapt nfs-exportd to be used on more than one cluster" [puppet] - 10https://gerrit.wikimedia.org/r/501676
[19:38:01] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] Revert "labstore: Adapt nfs-exportd to be used on more than one cluster" [puppet] - 10https://gerrit.wikimedia.org/r/501676 (owner: 10Bstorm)
[19:38:11] <wikibugs>	 (03PS2) 10Bstorm: Revert "labstore: Adapt nfs-exportd to be used on more than one cluster" [puppet] - 10https://gerrit.wikimedia.org/r/501676
[19:38:39] <wikibugs>	 (03CR) 10BryanDavis: "> I don't recall grid exec nodes having public IPs (in terms of the" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis)
[19:44:12] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis)
[19:44:33] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis)
[19:53:12] <wikibugs>	 (03PS1) 10Papaul: DNS: Change production DNS for restbase2019 and restbase2020 [dns] - 10https://gerrit.wikimedia.org/r/501686
[19:53:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10Andrew) a:05RobH→03Andrew
[20:04:47] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1012 is CRITICAL: connect to address 10.64.20.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[20:06:09] <wikibugs>	 (03PS1) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527)
[20:06:30] * arturo looking ^^^
[20:22:54] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudvirts: update nic names for 10Gb [puppet] - 10https://gerrit.wikimedia.org/r/501712 (https://phabricator.wikimedia.org/T216195)
[20:23:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudvirts: update nic names for 10Gb [puppet] - 10https://gerrit.wikimedia.org/r/501712 (https://phabricator.wikimedia.org/T216195) (owner: 10Andrew Bogott)
[20:24:04] <wikibugs>	 (03PS2) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527)
[20:34:19] <wikibugs>	 (03PS3) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527)
[20:35:53] <wikibugs>	 (03PS4) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527)
[20:47:02] <wikibugs>	 (03PS5) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527)
[20:47:48] <wikibugs>	 (03PS6) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527)
[20:50:16] <wikibugs>	 (03PS2) 10BryanDavis: dynamicproxy: Prevent STS header from non-TLS connections [puppet] - 10https://gerrit.wikimedia.org/r/499669 (https://phabricator.wikimedia.org/T102367)
[20:52:46] <wikibugs>	 10Operations, 10serviceops, 10Beta-Feature: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10Jdforrester-WMF) > Alternatively, we could decide to migrate all logged-in users before all the other users.  Do we want to just do this? The only downside I can see is that content exclusively...
[20:55:48] <wikibugs>	 (03PS1) 10Andrew Bogott: site.pp: Make cloudvirt1008 a cloudvirt host [puppet] - 10https://gerrit.wikimedia.org/r/501791
[20:57:56] <wikibugs>	 (03PS7) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527)
[20:58:53] <wikibugs>	 (03PS8) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527)
[21:02:45] <Krenair>	 <Krenair> Anyone else run into ferm trying to run this ip6tables command resulting in that error? https://phabricator.wikimedia.org/P8355
[21:02:46] <Krenair>	 <Krenair> or know what it means
[21:03:41] <wikibugs>	 (03CR) 10BryanDavis: "> Did the previous/old setup work? Why is there a need for a change" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis)
[21:10:43] <wikibugs>	 (03PS9) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527)
[21:12:44] <wikibugs>	 (03PS10) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527)
[21:13:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: relocate/reimage cloudvirt1015 with 10G interfaces - https://phabricator.wikimedia.org/T217140 (10Andrew)
[21:13:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: relocate/reimage cloudvirt1015 with 10G interfaces - https://phabricator.wikimedia.org/T217140 (10Andrew) I reimaged and built a canary VM and everything looks good.  Will put into proper service soon.
[21:16:55] <icinga-wm>	 PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:17:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1009 with 10G interfaces - https://phabricator.wikimedia.org/T216324 (10Andrew) I reimaged and built a canary VM -- the hosted VM cannot access any external networks.  I haven't investigated this more deeply yet,...
[21:18:10] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10Andrew) I reimaged and built a canary VM -- the hosted VM cannot access any external networks.  I haven't investigated this more deeply yet,...
[21:18:19] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10Andrew)
[21:27:01] <wikibugs>	 (03CR) 10Bstorm: "Ok.  This is confirmed working via cherry-pick into toolsbeta now.  It already worked fine on NFS servers, but this version also doesn't b" [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[21:29:59] <hashar>	 !log CI / Zuul is no more processing events / T220243
[21:30:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:03] <stashbot>	 T220243: CI / Zuul is no more processing events - https://phabricator.wikimedia.org/T220243
[21:37:19] <thcipriani>	 !log restarting gerrit
[21:37:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:33] <icinga-wm>	 PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org]
[21:41:57] <icinga-wm>	 PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports]
[21:42:35] <icinga-wm>	 PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[21:42:35] <icinga-wm>	 RECOVERY - puppet last run on db1084 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[21:42:49] <wikibugs>	 (03CR) 10Legoktm: "Nice. Main thing is that this should probably be built with python3 and not python2" (033 comments) [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/500201 (owner: 10MarkAHershberger)
[21:45:48] <hashar>	 !log thcipriani restarted Gerrit. CI works again # T220243
[21:45:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:52] <stashbot>	 T220243: CI / Zuul is no more processing events - https://phabricator.wikimedia.org/T220243
[21:46:51] <icinga-wm>	 RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[21:47:13] <icinga-wm>	 RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[21:47:51] <icinga-wm>	 RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:48:07] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 3657 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[21:53:58] <wikibugs>	 (03CR) 10Alex Monk: labstore: Adapt nfs-exportd to be used on more than one cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[21:54:03] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[21:55:24] <wikibugs>	 (03CR) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[21:56:29] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:56:37] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[21:57:11] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:57:43] <wikibugs>	 (03CR) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[21:58:07] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[21:59:06] <wikibugs>	 (03CR) 10Alex Monk: labstore: Adapt nfs-exportd to be used on more than one cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[21:59:07] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[22:00:06] <wikibugs>	 (03CR) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[22:00:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:04:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:07:05] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:08:15] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:08:45] <wikibugs>	 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 9 others: Fix inefficient CacheAwarePropertyInfoStore memcached access pattern - https://phabricator.wikimedia.org/T97368 (10Addshore) We should see changes in 1.33.0-wmf.24. It looks like the tr...
[22:09:09] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:09:11] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:09:17] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:10:19] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:10:21] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:13:13] <wikibugs>	 (03PS11) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527)
[22:13:31] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] "That is very much my bad, apparently that was missed from this process." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch)
[22:13:49] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:15:35] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:17:23] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:17:33] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[22:17:41] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:18:02] <wikibugs>	 (03CR) 10Bstorm: "I figured out why it worked before.  It's that mtime bit.  The variables are class-level, so they weren't overwritten just yet.  Testing b" [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[22:18:04] <wikibugs>	 (03CR) 10Addshore: [C: 03+1] WikibaseClient: Conditionally enable mapframe support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501335 (https://phabricator.wikimedia.org/T218051) (owner: 10Hoo man)
[22:18:17] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:18:39] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:18:39] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[22:18:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:20:43] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:22:43] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[22:22:49] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:24:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:26:01] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:26:23] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[22:26:31] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:26:32] <wikibugs>	 (03CR) 10Bstorm: "Took my test a step further.  I locally changed the file on the toolsbeta puppetmaster, validated that the client abides by the changes an" [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[22:27:15] <wikibugs>	 (03CR) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[22:27:21] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:27:47] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:27:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:28:31] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:29:05] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:29:09] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 6.128 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[22:29:24] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery-Search, 10Epic: Epic: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 (10debt)
[22:29:35] <icinga-wm>	 PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:30:08] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10debt)
[22:30:10] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate mjolnir to stdout/syslog/cee logging output - https://phabricator.wikimedia.org/T218833 (10debt) 05Open→03Resolved
[22:30:21] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:33:47] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:34:59] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:35:01] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:35:17] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.206 second response time https://phabricator.wikimedia.org/T174916
[22:36:08] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery-Search, 10Epic: Epic: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 (10debt)
[22:36:13] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:37:35] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:38:58] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review, 10Wikimedia-Incident: Make spicerack more robust when unfreezing writes to elasticsearch / cirrus - https://phabricator.wikimedia.org/T219640 (10debt) 05Open→03Resolved
[22:39:19] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[22:39:55] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: connect to address 10.64.48.29 and port 5252: Connection refused https://phabricator.wikimedia.org/T174916
[22:39:58] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review, 10Wikimedia-Incident: Create cookbook to reset frozen write state on elasticsearch / cirrus - https://phabricator.wikimedia.org/T219638 (10debt) 05Open→03Resolved
[22:40:11] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:40:37] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.370 second response time https://phabricator.wikimedia.org/T174916
[22:40:47] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:41:11] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916
[22:41:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 474.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[22:41:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 482.54 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[22:42:01] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:42:07] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[22:42:30] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery-Search, 10Epic: Epic: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 (10debt)
[22:42:47] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:42:51] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:42:51] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:43:15] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:43:19] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:43:19] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 4.116 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[22:43:55] <wikibugs>	 10Operations, 10CirrusSearch, 10Wikidata, 10Discovery-Search (Current work): Elasticsearch indices went read-only causing huge lag - https://phabricator.wikimedia.org/T219364 (10debt) 05Open→03Resolved
[22:43:57] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:44:01] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:44:01] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:44:05] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:44:31] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[22:45:39] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916
[22:46:11] <chaomodus>	 !log restarted pdfrender on scb1002 T174916
[22:46:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:46:15] <stashbot>	 T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916
[22:47:02] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery-Search, 10Epic: Epic: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 (10debt)
[22:47:04] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery-Search (Current work): Elasticsearch 6: silence deprecation warnings to avoid logspam - https://phabricator.wikimedia.org/T219269 (10debt) 05Open→03Resolved
[22:47:19] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[22:47:53] <wikibugs>	 10Operations, 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: Create extra elasticsearch clusters in beta cluster - https://phabricator.wikimedia.org/T213940 (10debt) 05Open→03Resolved
[22:48:03] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:49:15] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:49:17] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:49:40] <wikibugs>	 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Convert check_elasticsearch.py icinga plugin to py3 - https://phabricator.wikimedia.org/T215439 (10debt) 05Open→03Resolved
[22:49:47] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[22:52:25] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: update elasticsearch curator to 5.6.0 - https://phabricator.wikimedia.org/T218991 (10debt) 05Open→03Resolved
[22:53:09] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:53:22] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10RobH) a:05Andrew→03Cmjohnson So, per @andrew's request I've investigated the switch stack software for the secondary 'instance' connecti...
[22:53:29] <wikibugs>	 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review, 10User-fgiunchedi: cleanup reprepro configuration for elasticsearch-curator - https://phabricator.wikimedia.org/T216235 (10debt) 05Open→03Resolved
[22:53:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1009 with 10G interfaces - https://phabricator.wikimedia.org/T216324 (10RobH) a:05Andrew→03Cmjohnson So, per @andrew's request I've investigated the switch stack software for the secondary 'instance' connecti...
[22:53:47] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:54:33] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:55:49] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:56:01] <icinga-wm>	 RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:56:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:56:23] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:57:03] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:57:05] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:57:07] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[22:58:23] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[23:00:11] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:01:03] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[23:01:21] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[23:02:11] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[23:03:29] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:04:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:07:07] <wikibugs>	 (03CR) 10Jforrester: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch)
[23:07:43] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[23:10:43] <thcipriani>	 !log revert some recent problematic gerrit acl changes
[23:10:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:42:23] <icinga-wm>	 PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[23:43:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 79611 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[23:48:57] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on db2042 is OK: OK slave_sql_lag Replication lag: 11.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[23:49:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[23:55:11] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) (owner: 10Gilles)