[00:00:51] can be deployed [00:00:53] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: I24a5469dbfd0 / T216206 for testwikidatawiki (duration: 00m 50s) [00:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:58] T216206: Set up WikibaseLexemeCirrusSearch extension for Elastic code in WikibaseLexeme - https://phabricator.wikimedia.org/T216206 [00:01:51] SMalyshev: Ok to proceed? [00:01:58] Krinkle: yes [00:02:01] (03CR) 10Krinkle: [C: 03+2] Enable new Lexeme search on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499868 (owner: 10Smalyshev) [00:03:17] (03Merged) 10jenkins-bot: Enable new Lexeme search on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499868 (owner: 10Smalyshev) [00:04:00] SMalyshev: staged on mwdebug1002 [00:04:40] SMalyshev: This may be unrelated but I'm seeing PHP errors from mwdebug1002 [00:04:44] [XJ1gugpAAC4AADrz5YwAAABE] /w/index.php?search=test&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns146=1 ErrorException from line 85 of /srv/mediawiki/php-1.33.0-wmf.23/extensions/ArticlePlaceholder/includes/ItemNotabilityFilter.php: PHP Notice: Undefined index: Q3519023 [00:06:37] I guess this is https://phabricator.wikimedia.org/T207235 [00:06:41] (03CR) 10jenkins-bot: Enable new Lexeme search on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499867 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [00:06:43] (03CR) 10jenkins-bot: Enable new Lexeme search on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499868 (owner: 10Smalyshev) [00:07:01] hmm [00:07:23] Krinkle: it is search, but the code place is wrong [00:07:33] it's ArticlePlaceholder [00:08:22] Krinkle: are there a lot of them? When did they start? [00:08:31] 23:59 [00:08:38] when I synced the previous commit to test [00:08:51] but I'm sure it's just because there was no activity there prior and it's probbaly in prod as well [00:08:52] checking now [00:09:18] hmm not sure... the fact it's search does make me suspicious but could be also a coincidence [00:09:41] and it says ArticlePlaceholder which has nothing to do with what I'm doing... let me see [00:09:48] Yeah, it's not becoming more common [00:09:58] https://logstash.wikimedia.org/goto/29470c49933fb1a7e4ee3f7b119789c3 [00:10:59] SMalyshev: proceeding to prod? [00:11:08] hmm it does seem to be related to search... [00:11:12] Krinkle: give me a minute [00:11:14] https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002 [00:11:17] Okay, no worries :) [00:12:24] Krinkle: I see same errors before that - days ago [00:13:08] e.g. on 22th. So I presume it's not something we broke [00:13:35] it also happens on wmf.22 according to Kibana so probably been broken for a while [00:14:05] Yeah [00:14:12] Krinkle: also, it seems to only happen on testwikis [00:14:12] I've left a new trace on the task. [00:14:51] which makes me say we can proceed [00:15:11] Okay [00:16:05] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: I8887ce013a8 (duration: 00m 51s) [00:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:27] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.23/includes/api/ApiStashEdit.php: I35213d83a0 (duration: 00m 49s) [00:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:43] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:24:41] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 78721 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:24:51] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) @Addshore btw do I understand right that constraints can not be fetched per-revision? In this case, do... [00:34:37] Krinkle: thanks! [00:41:52] (03CR) 10BryanDavis: [C: 03+1] "Pretty sure this was a typo on my part, yes." [puppet] - 10https://gerrit.wikimedia.org/r/499887 (owner: 10Andrew Bogott) [02:08:55] RECOVERY - nova-compute proc maximum on cloudvirt1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:14:37] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:14:43] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:15:17] PROBLEM - nova-compute proc maximum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:18:28] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott I can not make icinga shut up about this! https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:18:28] ACKNOWLEDGEMENT - nova-compute proc maximum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute andrew bogott I can not make icinga shut up about this! https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:18:29] ACKNOWLEDGEMENT - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute andrew bogott I can not make icinga shut up about this! https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:18:33] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:19:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:50:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:50:19] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:53:21] PROBLEM - puppet last run on snapshot1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:00:29] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:00:35] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:05:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:05:41] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:09:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:09:29] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:19:24] RECOVERY - puppet last run on snapshot1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:20:02] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:20:10] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:21:54] PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:25:12] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:25:20] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:45:16] PROBLEM - puppet last run on druid1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:52:24] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:52:41] ACKNOWLEDGEMENT - MD RAID on cp4032 is CRITICAL: connect to address 10.128.0.132 port 5666: No route to host nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T219586 [03:52:46] 10Operations, 10ops-ulsfo: Degraded RAID on cp4032 - https://phabricator.wikimedia.org/T219586 (10ops-monitoring-bot) [03:53:05] RECOVERY - puppet last run on relforge1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [03:53:09] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:54:27] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:54:37] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:55:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:04:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:04:47] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:11:39] RECOVERY - puppet last run on druid1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:12:03] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring [04:12:35] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:27:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:27:39] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:28:05] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:31:09] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:31:29] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:32:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:35:03] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:35:35] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:35:53] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:38:53] RECOVERY - puppet last run on labpuppetmaster1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:40:05] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:47:27] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:48:35] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:55:47] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:56:39] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:57:27] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:03:59] PROBLEM - puppet last run on db1121 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:13:37] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:13:45] RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:14:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:14:43] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:19:39] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:19:49] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:23:37] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:24:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:25:25] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:27:53] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:28:29] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:28:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:28:45] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:30:17] RECOVERY - puppet last run on db1121 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:34:19] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:41:57] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:42:39] (03PS1) 10Marostegui: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499970 [05:44:13] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499970 (owner: 10Marostegui) [05:45:36] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499970 (owner: 10Marostegui) [05:46:05] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499970 (owner: 10Marostegui) [05:46:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1075 (duration: 00m 52s) [05:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:09] !log Remove labsdb1004 and labsdb1005 from tendril - T216749 [05:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:12] T216749: Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready - https://phabricator.wikimedia.org/T216749 [05:49:26] !log Disable notifications on labsdb1004 and labsdb1005 - T216749 [05:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:35] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:59:00] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready - https://phabricator.wikimedia.org/T216749 (10Marostegui) @Bstorm I have removed the hosts from Tendril (ten... [06:01:25] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready - https://phabricator.wikimedia.org/T216749 (10Marostegui) [06:14:19] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499971 [06:14:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:15:05] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:15:28] (03PS10) 10Vgutierrez: acme_chief: Provide OCSP stapling support [puppet] - 10https://gerrit.wikimedia.org/r/499746 (https://phabricator.wikimedia.org/T213705) [06:15:30] (03PS14) 10Vgutierrez: Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [06:15:32] (03PS7) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [06:15:34] (03PS2) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [06:15:36] (03PS4) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [06:15:58] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499971 (owner: 10Marostegui) [06:18:45] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499971 (owner: 10Marostegui) [06:19:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1075 (duration: 00m 50s) [06:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:17] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:28:03] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499972 [06:29:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499972 (owner: 10Marostegui) [06:29:39] PROBLEM - puppet last run on cloudvirt1029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt-upgrade-activity] [06:29:49] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499971 (owner: 10Marostegui) [06:30:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499972 (owner: 10Marostegui) [06:30:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499972 (owner: 10Marostegui) [06:31:23] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:32:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1009 (duration: 00m 50s) [06:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:51] (03CR) 10Vgutierrez: "pcc looks good on existing acme-chief clients: https://puppet-compiler.wmflabs.org/compiler1002/15427/" [puppet] - 10https://gerrit.wikimedia.org/r/499746 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [06:39:39] PROBLEM - MariaDB Slave IO: pc3 on pc2009 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@pc1009.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on pc1009.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:40:00] ^ that is me [06:41:03] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:41:04] !log Upgrade pc1009 [06:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1009" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499973 [06:42:56] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool pc1009" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499973 (owner: 10Marostegui) [06:43:25] RECOVERY - MariaDB Slave IO: pc3 on pc2009 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:43:53] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1009" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499973 (owner: 10Marostegui) [06:44:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1009 (duration: 00m 49s) [06:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:23] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:48:25] (03PS2) 10Gilles: Element Timing for Images and Layout Stability on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499152 (https://phabricator.wikimedia.org/T216598) [06:49:08] (03PS11) 10Vgutierrez: acme_chief: Provide OCSP stapling support [puppet] - 10https://gerrit.wikimedia.org/r/499746 (https://phabricator.wikimedia.org/T213705) [06:49:10] (03PS15) 10Vgutierrez: Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [06:49:13] (03PS8) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [06:49:15] (03PS3) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [06:49:17] (03PS5) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [06:49:19] (03PS1) 10Vgutierrez: acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705) [06:49:21] (03PS1) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in the cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705) [06:50:44] (03PS2) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705) [06:50:46] (03PS9) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [06:50:48] (03PS4) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [06:50:50] (03PS6) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [06:51:57] (03CR) 10Gilles: [C: 03+2] Element Timing for Images and Layout Stability on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499152 (https://phabricator.wikimedia.org/T216598) (owner: 10Gilles) [06:52:00] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1009" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499973 (owner: 10Marostegui) [06:55:57] RECOVERY - puppet last run on cloudvirt1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:13] !log Remove tools section from tendril by doing: update shards set display='0' where name='tools'; T216749 [06:56:13] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:18] T216749: Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready - https://phabricator.wikimedia.org/T216749 [06:57:33] (03Merged) 10jenkins-bot: Element Timing for Images and Layout Stability on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499152 (https://phabricator.wikimedia.org/T216598) (owner: 10Gilles) [06:58:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499977 [06:58:55] (03CR) 10Vgutierrez: "pcc is happy and shows almost a NOOP (just setting acme_chief => False) in all DCs for text and upload cache servers: https://puppet-compi" [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [07:01:07] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T216598 T216594 Element Timing for Images and Layout Stability on ruwiki (duration: 00m 51s) [07:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:17] T216594: Layout Stability API origin trial - https://phabricator.wikimedia.org/T216594 [07:01:18] T216598: Element Timing for Images origin trial - https://phabricator.wikimedia.org/T216598 [07:03:22] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499977 (owner: 10Marostegui) [07:04:34] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499977 (owner: 10Marostegui) [07:06:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1110 (duration: 00m 49s) [07:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:11] (03CR) 10Vgutierrez: "pcc shows unified being deployed in cp1008 and the upload/text servers being unaffected: https://puppet-compiler.wmflabs.org/compiler1002/" [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [07:06:21] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:06:29] !log Upgrade db1110 [07:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:14] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499978 [07:10:42] (03PS16) 10Vgutierrez: Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [07:10:45] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:10:55] (03PS2) 10Vgutierrez: acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705) [07:10:57] (03PS3) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705) [07:10:59] (03PS10) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [07:11:01] (03PS5) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [07:11:03] (03PS7) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [07:11:59] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:12:11] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:14:19] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499978 (owner: 10Marostegui) [07:15:43] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499978 (owner: 10Marostegui) [07:16:07] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:16:15] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Provide OCSP stapling support [puppet] - 10https://gerrit.wikimedia.org/r/499746 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [07:17:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1110 (duration: 01m 06s) [07:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:38] !log reenabling puppet in acme-chief clients after verifying NOOP in netmon2001 [07:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:41] (03PS1) 10Marostegui: db-eqiad.php: More weight to db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499979 [07:22:29] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:23:17] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:26:16] (03CR) 10jenkins-bot: Element Timing for Images and Layout Stability on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499152 (https://phabricator.wikimedia.org/T216598) (owner: 10Gilles) [07:26:18] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More weight to db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499979 (owner: 10Marostegui) [07:26:23] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499977 (owner: 10Marostegui) [07:27:19] (03Merged) 10jenkins-bot: db-eqiad.php: More weight to db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499979 (owner: 10Marostegui) [07:28:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1110 (duration: 00m 50s) [07:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:25] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10jcrespo) Adding release engineering, although they should not own this, but so th... [07:30:37] (03PS1) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [07:30:49] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:31:57] (03CR) 10Vgutierrez: [C: 04-2] "Do not merge till I222b8ef48bf0ca2b23c091ebafd2bb933a9faa99 has been tested thoroughly" [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [07:34:27] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:34:27] (03PS1) 10Marostegui: db-eqiad.php: More weight to db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499982 [07:34:37] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:34:40] (03CR) 10Vgutierrez: [C: 04-2] "pcc shows the expected NOOPs in the upload cluster and the proper changes in text: https://puppet-compiler.wmflabs.org/compiler1002/15430/" [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [07:36:18] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499978 (owner: 10Marostegui) [07:36:35] (03CR) 10jenkins-bot: db-eqiad.php: More weight to db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499979 (owner: 10Marostegui) [07:38:25] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:39:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:43:35] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:44:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:45:02] (03CR) 10Elukey: [C: 03+1] Enable base::service_auto_restart for rsync/namenode standby [puppet] - 10https://gerrit.wikimedia.org/r/498834 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:48:29] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:48:39] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:51:21] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:51:42] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More weight to db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499982 (owner: 10Marostegui) [07:52:59] (03Merged) 10jenkins-bot: db-eqiad.php: More weight to db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499982 (owner: 10Marostegui) [07:54:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1110 (duration: 00m 51s) [07:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:35] (03CR) 10jenkins-bot: db-eqiad.php: More weight to db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499982 (owner: 10Marostegui) [07:58:42] !log gilles@deploy1001 Synchronized php-1.33.0-wmf.23/includes/media/MediaTransformOutput.php: T216499 Only apply high priority half the time (duration: 00m 50s) [07:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:46] T216499: Priority Hints origin trial - https://phabricator.wikimedia.org/T216499 [08:00:04] (03PS8) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [08:00:06] (03PS2) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [08:05:32] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for rsync/namenode standby [puppet] - 10https://gerrit.wikimedia.org/r/498834 (https://phabricator.wikimedia.org/T135991) [08:07:33] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for rsync/namenode standby [puppet] - 10https://gerrit.wikimedia.org/r/498834 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:10:00] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499984 [08:15:21] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:17:35] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:17:58] (03PS34) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [08:19:10] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:22:39] (03PS2) 10Arturo Borrero Gonzalez: wmcs-spreadcheck: return 0 on success [puppet] - 10https://gerrit.wikimedia.org/r/499887 (owner: 10Andrew Bogott) [08:23:57] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:23:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs-spreadcheck: return 0 on success [puppet] - 10https://gerrit.wikimedia.org/r/499887 (owner: 10Andrew Bogott) [08:25:12] (03PS35) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [08:27:41] 10Operations, 10Traffic, 10netops: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591 (10elukey) p:05Triage→03High [08:28:10] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/499355 (owner: 10Alex Monk) [08:28:12] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499984 (owner: 10Marostegui) [08:28:16] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10User-zeljkofilipin: npm 6 consistently fails with "Z_DATA_ERROR: invalid distance too far back" on some repos - https://phabricator.wikimedia.org/T215562 (10MoritzMuehlenhoff) >>! In T215562#5066711, @Krinkle wrote: > As such, it is effec... [08:28:49] (03PS2) 10Arturo Borrero Gonzalez: Stop serving trusty repositories in aptly [puppet] - 10https://gerrit.wikimedia.org/r/499935 (owner: 10Muehlenhoff) [08:29:36] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499984 (owner: 10Marostegui) [08:30:33] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499984 (owner: 10Marostegui) [08:30:41] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1110 (duration: 00m 50s) [08:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "By the time of this patch, we only have 2 trusty servers in toolforge: toolscheker. And we are actively working on removing them." [puppet] - 10https://gerrit.wikimedia.org/r/499935 (owner: 10Muehlenhoff) [08:34:13] (03PS1) 10Filippo Giunchedi: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/499987 (https://phabricator.wikimedia.org/T219591) [08:34:18] elukey: ^ [08:35:03] (03CR) 10Elukey: [C: 03+1] Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/499987 (https://phabricator.wikimedia.org/T219591) (owner: 10Filippo Giunchedi) [08:35:11] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Addshore) >>! In T217897#5066900, @Smalyshev wrote: >> WDQS does know what the latest version of the entity that... [08:35:40] (03CR) 10Filippo Giunchedi: [C: 03+2] Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/499987 (https://phabricator.wikimedia.org/T219591) (owner: 10Filippo Giunchedi) [08:36:20] (03CR) 10Arturo Borrero Gonzalez: "I'm not sure about this one. Leaving it for Andrew to decide if we can merge this now or worth waiting a couple of days until all Trusty i" [puppet] - 10https://gerrit.wikimedia.org/r/499933 (owner: 10Muehlenhoff) [08:36:42] !log depool ulsfo as precaution -- link repair in progress [08:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:17] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:42:05] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_upload site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:42:27] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-upload site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:44:39] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 52.32 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:48:12] ah! Just in time :) [08:50:39] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:50:43] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:53:15] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:53:17] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:54:19] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:59:41] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:02:15] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:03:32] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov2002.codfw.wmnet'] ` The... [09:05:03] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2002.codfw.wmnet,service=nginx [09:05:04] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2002.codfw.wmnet,service=varnish-fe [09:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:39] Hey ops, I have a last-minute request to temporary lift of IP cap on fr.wikipedia.org https://phabricator.wikimedia.org/T219594 [09:05:40] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2005.codfw.wmnet,service=nginx [09:05:41] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2005.codfw.wmnet,service=varnish-fe [09:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:56] Last minute as-in I just get the IP for an event starting in 3 hours. [09:08:41] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:08:41] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:09:57] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:09:59] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:10:30] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:10:37] (03PS1) 10Arturo Borrero Gonzalez: openstack: add mitaka/stretch support for neutron server in cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/499992 (https://phabricator.wikimedia.org/T215407) [09:12:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:15:35] (03PS2) 10Dzahn: xvfb: remove upstart support [puppet] - 10https://gerrit.wikimedia.org/r/499771 [09:16:15] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:16:29] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:16:44] @seen hashar [09:16:44] mutante: Last time I saw hashar they were quitting the network with reason: Quit: I am a virus. Please copy paste me in your /quit message to help me propagate N/A at 3/28/2019 2:56:32 PM (18h20m12s ago) [09:18:02] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591 (10ema) [09:21:27] (03PS1) 10Dzahn: xvfb: replace base::service_unit with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/499993 (https://phabricator.wikimedia.org/T194724) [09:21:53] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:22:44] (03CR) 10Mathew.onipe: "PCC output is Ok. Changes are expected: https://puppet-compiler.wmflabs.org/compiler1002/15432/" [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [09:23:43] re: icinga alert on operational latencies on kubernetes1003 - looking at the actual graph does not look like it and 1003 is not the slowest ? [09:24:45] (03PS2) 10Arturo Borrero Gonzalez: openstack: add mitaka/stretch support for neutron server in cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/499992 (https://phabricator.wikimedia.org/T215407) [09:27:21] jouncebot: next [09:27:21] In 73 hour(s) and 2 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1030) [09:27:35] #wikimedia-tech [09:27:57] nevermind, somebody was asking for a deploy there. but it's Friday [09:31:10] (03PS1) 10Aklapper: Add throttling rule for frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499994 [09:32:42] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10jcrespo) a:05jcrespo→03Papaul @papaul we need help from you. We cannot network boot on dbprov2002 (we did on dbprov2001 already). `... [09:32:56] (03PS2) 10Aklapper: Add throttling rule for frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499994 (https://phabricator.wikimedia.org/T219594) [09:33:39] (03PS4) 10Ladsgroup: ores: use hiera for statsd host [puppet] - 10https://gerrit.wikimedia.org/r/499875 (https://phabricator.wikimedia.org/T218567) [09:34:09] PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:35:37] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Summary of what each team is currently wo... [09:36:37] (03PS36) 10Gehel: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [09:37:31] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/499992 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [09:37:37] !log restarting zuul on contint1001 [09:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: add mitaka/stretch support for neutron server in cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/499992 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [09:43:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc https://puppet-compiler.wmflabs.org/compiler1002/15435/" [puppet] - 10https://gerrit.wikimedia.org/r/499992 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [09:43:13] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 70.56 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:51:23] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Elasticsearch: Create checks that alerts on cirrussearch update lags - https://phabricator.wikimedia.org/T219601 (10Mathew.onipe) [09:51:57] (03CR) 10Hashar: [C: 03+1] "At least for CI, we are no more using Xvfb." [puppet] - 10https://gerrit.wikimedia.org/r/499771 (owner: 10Dzahn) [09:58:18] (03CR) 10Dzahn: [C: 03+2] xvfb: remove upstart support [puppet] - 10https://gerrit.wikimedia.org/r/499771 (owner: 10Dzahn) [09:58:31] (03PS3) 10Dzahn: xvfb: remove upstart support [puppet] - 10https://gerrit.wikimedia.org/r/499771 [09:58:59] (03PS5) 10Ladsgroup: ores: use hiera for statsd host [puppet] - 10https://gerrit.wikimedia.org/r/499875 (https://phabricator.wikimedia.org/T218567) [10:00:25] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:01:05] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:01:27] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:03:32] 10Operations, 10MediaWiki-General-or-Unknown, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) [10:06:07] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:11:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:11:25] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:13:59] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:15:09] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:16:51] (03PS1) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) [10:17:59] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [10:18:28] (03PS6) 10Ladsgroup: ores: use hiera for statsd host [puppet] - 10https://gerrit.wikimedia.org/r/499875 (https://phabricator.wikimedia.org/T218567) [10:20:13] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:20:15] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:21:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:22:02] (03CR) 10Ladsgroup: "Now it's ready:" [puppet] - 10https://gerrit.wikimedia.org/r/499875 (https://phabricator.wikimedia.org/T218567) (owner: 10Ladsgroup) [10:24:51] (03PS2) 10Dzahn: xvfb: replace base::service_unit with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/499993 (https://phabricator.wikimedia.org/T194724) [10:24:55] (03PS2) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) [10:26:21] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [10:27:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:27:51] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:30:31] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:31:49] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:33:06] (03PS3) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) [10:35:02] briefly pooling cp2002's varnish-fe again to try reproduce the 503s we got earlier [10:35:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:35:27] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2002.codfw.wmnet,service=nginx [10:35:28] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2002.codfw.wmnet,service=varnish-fe [10:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:29] (03CR) 10Jcrespo: "Not sure about this..." [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [10:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:06] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2002.codfw.wmnet,service=nginx [10:36:07] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2002.codfw.wmnet,service=varnish-fe [10:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:27] (03PS4) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) [10:37:39] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:37:47] (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: mitaka: stretch: ignore python-cryptography from bpo [puppet] - 10https://gerrit.wikimedia.org/r/499998 [10:38:50] (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: mitaka: stretch: ignore python-cryptography from bpo [puppet] - 10https://gerrit.wikimedia.org/r/499998 (owner: 10Arturo Borrero Gonzalez) [10:40:11] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:40:51] (03PS2) 10Arturo Borrero Gonzalez: openstack: serverpackages: mitaka: stretch: ignore python-cryptography from bpo [puppet] - 10https://gerrit.wikimedia.org/r/499998 [10:41:03] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/499998 (owner: 10Arturo Borrero Gonzalez) [10:41:17] (03CR) 10Marostegui: "Agreed it is unclear, but let's try to see if it works so at least we know whether we have the ability to produce remote dumps or not." [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [10:41:39] (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: mitaka: stretch: ignore python-cryptography from bpo [puppet] - 10https://gerrit.wikimedia.org/r/499998 (owner: 10Arturo Borrero Gonzalez) [10:42:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:43:00] (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: mitaka: stretch: ignore python-cryptography from bpo [puppet] - 10https://gerrit.wikimedia.org/r/499998 (owner: 10Arturo Borrero Gonzalez) [10:43:11] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:46:50] (03PS1) 10Ladsgroup: Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) [10:46:59] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:47:52] (03CR) 10jerkins-bot: [V: 04-1] Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup) [10:47:59] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:50:03] (03PS2) 10Ladsgroup: Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) [10:50:30] (03PS3) 10Arturo Borrero Gonzalez: openstack: serverpackages: mitaka: stretch: ignore python-cryptography from bpo [puppet] - 10https://gerrit.wikimedia.org/r/499998 [10:53:06] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/499998 (owner: 10Arturo Borrero Gonzalez) [10:53:59] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:54:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: serverpackages: mitaka: stretch: ignore python-cryptography from bpo [puppet] - 10https://gerrit.wikimedia.org/r/499998 (owner: 10Arturo Borrero Gonzalez) [11:00:33] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:03:57] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:06:13] 10Operations, 10MediaWiki-General-or-Unknown, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10MoritzMuehlenhoff) >>! In T219279#5068956, @Joe wrote: > @Anomie so you're suggesting we need to complete the... [11:07:17] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:09:17] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:10:19] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:10:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add throttling rule for frwiki event (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499994 (https://phabricator.wikimedia.org/T219594) (owner: 10Aklapper) [11:11:01] PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:12:09] RECOVERY - Apache HTTP on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:14:05] RECOVERY - nova-compute proc maximum on cloudvirt1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:16:55] (03Abandoned) 10Lucas Werkmeister (WMDE): Add throttling rule for frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499994 (https://phabricator.wikimedia.org/T219594) (owner: 10Aklapper) [11:17:42] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: remove references to tideways [puppet] - 10https://gerrit.wikimedia.org/r/499144 [11:19:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "Not sure about changes to fixtures and spec but LGTM in principle" [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:25:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php: remove references to tideways [puppet] - 10https://gerrit.wikimedia.org/r/499144 (owner: 10Giuseppe Lavagetto) [11:29:39] (03PS1) 10Arturo Borrero Gonzalez: openstack: admin_scripts: add missing directory dependency for root ssh key [puppet] - 10https://gerrit.wikimedia.org/r/500002 [11:30:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] docker: Remove support for trusty images [puppet] - 10https://gerrit.wikimedia.org/r/499929 (owner: 10Muehlenhoff) [11:31:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: admin_scripts: add missing directory dependency for root ssh key [puppet] - 10https://gerrit.wikimedia.org/r/500002 (owner: 10Arturo Borrero Gonzalez) [11:32:49] PROBLEM - puppet last run on an-worker1078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:35:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "This is great, thanks a lot. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [11:35:37] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:36:00] (03PS1) 10Arturo Borrero Gonzalez: openstack: admin_script: delete code used for renaming cleanup [puppet] - 10https://gerrit.wikimedia.org/r/500003 [11:36:18] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: raise mysql.connect_timeout to 3 seconds [puppet] - 10https://gerrit.wikimedia.org/r/499143 (https://phabricator.wikimedia.org/T211488) [11:39:53] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:40:07] <_joe_> I did notice [11:40:23] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [11:42:50] gerrit? [11:42:54] ah [11:44:11] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] [11:45:22] !log cobalt - systemctl restart gerrit [11:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:47] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::php: raise mysql.connect_timeout to 3 seconds [puppet] - 10https://gerrit.wikimedia.org/r/499143 (https://phabricator.wikimedia.org/T211488) (owner: 10Giuseppe Lavagetto) [11:47:29] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 950 bytes in 0.090 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:47:42] gerrit back for me [11:47:59] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27453 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [11:48:39] same [11:48:49] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/499143 (https://phabricator.wikimedia.org/T211488) (owner: 10Giuseppe Lavagetto) [11:49:31] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [11:49:53] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [11:51:07] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:51:39] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [11:51:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: admin_script: delete code used for renaming cleanup [puppet] - 10https://gerrit.wikimedia.org/r/500003 (owner: 10Arturo Borrero Gonzalez) [11:51:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15445/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/499143 (https://phabricator.wikimedia.org/T211488) (owner: 10Giuseppe Lavagetto) [11:52:07] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [11:52:07] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 6 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater] [11:52:19] (03PS13) 10Jbond: Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 [11:52:21] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 5 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas] [11:52:27] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::php: raise mysql.connect_timeout to 3 seconds [puppet] - 10https://gerrit.wikimedia.org/r/499143 (https://phabricator.wikimedia.org/T211488) [11:53:07] (03CR) 10Jbond: [C: 03+2] Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [11:53:13] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [11:53:29] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [11:53:37] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:53:49] (03CR) 10Alex Monk: "What's the deal with those hosts under 'Hosts that fail to compile when the change is applied'? do they not work in puppet-compiler?" [puppet] - 10https://gerrit.wikimedia.org/r/499355 (owner: 10Alex Monk) [11:53:51] runs puppet on some of those [11:53:59] PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [11:54:17] running puppet on cumin worked fine [11:54:18] (03PS4) 10Giuseppe Lavagetto: profile::mediawiki::php: raise mysql.connect_timeout to 3 seconds [puppet] - 10https://gerrit.wikimedia.org/r/499143 (https://phabricator.wikimedia.org/T211488) [11:54:29] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] [11:54:43] however i did get a 503 from gerrit a second ago so wonder if it was having issues, most the errors seem to be related to pulling [11:54:44] jbond42: it's fallout from gerrit crash. they are all expected to recover [11:54:53] ok cool :) [11:55:20] i was just going to run it on a few to get the recoveries [11:55:39] ack [11:57:21] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [11:57:21] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [11:58:45] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:58:52] (03PS1) 10Elukey: role::piwik: simplify profile's parameters and remove dead code [puppet] - 10https://gerrit.wikimedia.org/r/500007 (https://phabricator.wikimedia.org/T218037) [11:59:17] RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:01:30] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [12:03:03] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15446/ - no op for matomo1001" [puppet] - 10https://gerrit.wikimedia.org/r/500007 (https://phabricator.wikimedia.org/T218037) (owner: 10Elukey) [12:03:57] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:04:06] (03PS1) 10Ema: ATS: unset AE:gzip [puppet] - 10https://gerrit.wikimedia.org/r/500011 (https://phabricator.wikimedia.org/T125938) [12:04:27] RECOVERY - puppet last run on an-worker1078 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:05:13] (03PS2) 10Ema: ATS: unset Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/500011 (https://phabricator.wikimedia.org/T125938) [12:05:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:05:59] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:06:05] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:07:08] 10Operations, 10Analytics, 10EventBus, 10vm-requests, 10Services (watching): Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Pchelolo) [12:07:26] (03CR) 10Dzahn: "hashar: fyi and re: puppet cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/499993 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [12:07:38] 10Operations, 10Analytics, 10EventBus, 10vm-requests, and 2 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Pchelolo) [12:08:59] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: drop jessie-backport install option [puppet] - 10https://gerrit.wikimedia.org/r/500013 (https://phabricator.wikimedia.org/T216497) [12:09:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:09:53] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:10:33] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:12:27] (03PS2) 10Arturo Borrero Gonzalez: openstack: keystone: drop jessie-backport install option [puppet] - 10https://gerrit.wikimedia.org/r/500013 (https://phabricator.wikimedia.org/T216497) [12:12:31] (03CR) 10Ema: [C: 03+2] ATS: unset Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/500011 (https://phabricator.wikimedia.org/T125938) (owner: 10Ema) [12:12:43] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:13:09] (03PS3) 10Arturo Borrero Gonzalez: openstack: keystone: drop jessie-backport install option [puppet] - 10https://gerrit.wikimedia.org/r/500013 (https://phabricator.wikimedia.org/T216497) [12:13:14] (03CR) 10jerkins-bot: [V: 04-1] openstack: keystone: drop jessie-backport install option [puppet] - 10https://gerrit.wikimedia.org/r/500013 (https://phabricator.wikimedia.org/T216497) (owner: 10Arturo Borrero Gonzalez) [12:13:52] (03CR) 10jerkins-bot: [V: 04-1] openstack: keystone: drop jessie-backport install option [puppet] - 10https://gerrit.wikimedia.org/r/500013 (https://phabricator.wikimedia.org/T216497) (owner: 10Arturo Borrero Gonzalez) [12:14:07] (03PS4) 10Arturo Borrero Gonzalez: openstack: keystone: drop jessie-backport install option [puppet] - 10https://gerrit.wikimedia.org/r/500013 (https://phabricator.wikimedia.org/T216497) [12:14:17] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:15:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: keystone: drop jessie-backport install option [puppet] - 10https://gerrit.wikimedia.org/r/500013 (https://phabricator.wikimedia.org/T216497) (owner: 10Arturo Borrero Gonzalez) [12:15:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Expose rsyslog_udp_port to services configs. [puppet] - 10https://gerrit.wikimedia.org/r/498872 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [12:15:51] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:16:13] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:16:23] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:16:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add rsyslog kafka to service nodes. [puppet] - 10https://gerrit.wikimedia.org/r/496813 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [12:18:39] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:49] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:23:52] !log rolling ATS restarts to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500011/ T213263 [12:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:57] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [12:25:33] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: mitaka: stretch: use python-pyldap instead of python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/500014 (https://phabricator.wikimedia.org/T215407) [12:26:54] (03PS2) 10Arturo Borrero Gonzalez: openstack: keystone: mitaka: stretch: use python-pyldap instead of python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/500014 (https://phabricator.wikimedia.org/T215407) [12:27:41] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:28:25] (03PS9) 10Jbond: jessie-backports: warn users if they try to use backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) [12:28:41] (03CR) 10Dzahn: "@integration-slave-jessie-1001 has an unrelated puppet issue: php-xdebug : PreDepends: php-common (>= 2:69~) but 1:51~bpo8+1+wmf1 is to b" [puppet] - 10https://gerrit.wikimedia.org/r/499993 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [12:28:45] (03CR) 10Jbond: "comments addresses" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [12:29:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: keystone: mitaka: stretch: use python-pyldap instead of python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/500014 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [12:32:33] If anyone has any details that could help upstream here https://groups.google.com/forum/m/#!topic/repo-discuss/pBMh09-XJsw with the gerrit issue please add :) [12:33:13] (03PS1) 10Thcipriani: Revert "Gerrit 2.15.12 release" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/500016 [12:33:59] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:34:32] (03CR) 10Paladox: [V: 03+2 C: 03+2] Revert "Gerrit 2.15.12 release" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/500016 (owner: 10Thcipriani) [12:35:23] (03CR) 10Paladox: [C: 03+2] Revert "Gerrit 2.15.12 release" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/500016 (owner: 10Thcipriani) [12:36:35] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:37:13] (03CR) 10Thcipriani: [V: 03+2] Revert "Gerrit 2.15.12 release" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/500016 (owner: 10Thcipriani) [12:40:23] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:41:15] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:42:01] (03PS1) 10Thcipriani: Revert "Revert "gerrit: Disable jgit gc"" [puppet] - 10https://gerrit.wikimedia.org/r/500017 [12:43:37] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "gerrit: Disable jgit gc"" [puppet] - 10https://gerrit.wikimedia.org/r/500017 (owner: 10Thcipriani) [12:46:48] !log upgrading snapshot1005-1007/1009 to component/php72 (T218193) [12:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:52] T218193: Switch dumps to component/php7.2 - https://phabricator.wikimedia.org/T218193 [12:46:59] !log upgrading snapshot1008 to component/php72 (T218193) [12:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:45] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@670ddb8]: Gerrit (back) to version 2.15.11 (gerrit2001 only) [12:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:55] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@670ddb8]: Gerrit (back) to version 2.15.11 (gerrit2001 only) (duration: 00m 10s) [12:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:15] !log removing php 7.0 packages from snapshot1008, dumps are only using 7.2 (T218193) [12:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:42] 10Operations, 10puppet-compiler, 10Continuous-Integration-Config, 10Release-Engineering-Team (Kanban): operations-puppet-catalog-compiler-test fails due to commit message validator linter error - https://phabricator.wikimedia.org/T219615 (10hashar) [12:52:45] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@670ddb8]: Gerrit (back) to version 2.15.11 on cobalt -- restart of gerrit incoming [12:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:56] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@670ddb8]: Gerrit (back) to version 2.15.11 on cobalt -- restart of gerrit incoming (duration: 00m 11s) [12:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:40] !log restarting gerrit to finish rollback to 2.15.11 [12:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:24] !log gerrit running on 2.15.11 [12:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:50] !tzag is Time Zone Appropriate Greeting - https://www.urbandictionary.com/define.php?term=TZAG [12:55:51] Key was added [13:04:00] (03PS4) 10ArielGlenn: use MediaWiki maintenance script to get db user and password [dumps] - 10https://gerrit.wikimedia.org/r/498245 (https://phabricator.wikimedia.org/T218923) [13:04:17] (03CR) 10jerkins-bot: [V: 04-1] use MediaWiki maintenance script to get db user and password [dumps] - 10https://gerrit.wikimedia.org/r/498245 (https://phabricator.wikimedia.org/T218923) (owner: 10ArielGlenn) [13:05:56] !log cp2002/cp2005: repool varnish-fe for user traffic T213263 [13:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:59] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [13:06:04] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2002.codfw.wmnet,service=nginx [13:06:05] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2002.codfw.wmnet,service=varnish-fe [13:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:30] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2005.codfw.wmnet,service=nginx [13:06:31] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2005.codfw.wmnet,service=varnish-fe [13:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:35] (03PS5) 10ArielGlenn: use MediaWiki maintenance script to get db user and password [dumps] - 10https://gerrit.wikimedia.org/r/498245 (https://phabricator.wikimedia.org/T218923) [13:07:35] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:10:52] (03PS1) 10Ema: Revert "Depool ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/500031 (https://phabricator.wikimedia.org/T219591) [13:12:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:15:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:18:29] (03PS1) 10Jcrespo: transfer.py: Allow for a 3rd transfer type: decompression [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) [13:18:51] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Allow for a 3rd transfer type: decompression [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [13:20:32] (03PS2) 10Jcrespo: transfer.py: Allow for a 3rd transfer type: decompression [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) [13:20:59] (03CR) 10Filippo Giunchedi: "AFAICS link maintenance will be performed tonight, I'm not opposed to repooling ulsfo but we should make sure the link under maintenance i" [dns] - 10https://gerrit.wikimedia.org/r/500031 (https://phabricator.wikimedia.org/T219591) (owner: 10Ema) [13:21:10] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Allow for a 3rd transfer type: decompression [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [13:23:22] (03PS1) 10Arturo Borrero Gonzalez: wmcs: introduce codfw1dev puppet code [puppet] - 10https://gerrit.wikimedia.org/r/500044 (https://phabricator.wikimedia.org/T219626) [13:23:32] (03CR) 10Filippo Giunchedi: "LGTM, see last inline comment re: ARCHIVE vs ARCHIVE_BACKPORTS" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [13:29:44] (03PS10) 10Jbond: jessie-backports: warn users if they try to use backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) [13:30:07] (03CR) 10Jbond: jessie-backports: warn users if they try to use backports on jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [13:31:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:34:32] (03CR) 10Muehlenhoff: [C: 03+1] jessie-backports: warn users if they try to use backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [13:34:50] (03CR) 10Filippo Giunchedi: [C: 03+1] jessie-backports: warn users if they try to use backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [13:35:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:41:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: introduce codfw1dev puppet code [puppet] - 10https://gerrit.wikimedia.org/r/500044 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [13:43:54] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: assign proper role [puppet] - 10https://gerrit.wikimedia.org/r/500047 (https://phabricator.wikimedia.org/T219626) [13:46:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: assign proper role [puppet] - 10https://gerrit.wikimedia.org/r/500047 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [13:50:31] 10Operations, 10media-storage: swift falsely claims 404s are gzipped - https://phabricator.wikimedia.org/T219635 (10ema) [13:50:50] 10Operations, 10media-storage: swift falsely claims 404s are gzipped - https://phabricator.wikimedia.org/T219635 (10ema) p:05Triage→03Normal [13:53:54] (03CR) 10Jbond: [C: 03+2] jessie-backports: warn users if they try to use backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [13:54:03] (03PS11) 10Jbond: jessie-backports: warn users if they try to use backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) [13:54:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:59:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:03:42] (03PS1) 10Mholloway: Cleanup: Remove obsolete WikimediaEditorTasks beta cluster prefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 [14:07:15] (03CR) 10Volans: admin: allow users to be removed preserving their home directories (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/498399 (https://phabricator.wikimedia.org/T215171) (owner: 10Elukey) [14:16:19] (03CR) 10Marostegui: "that will be a snapshot that has already being prepared on source, right?" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [14:22:31] PROBLEM - puppet last run on analytics1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:30:11] (03PS3) 10Dzahn: xvfb: replace base::service_unit with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/499993 (https://phabricator.wikimedia.org/T194724) [14:35:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] docker: Remove support for trusty images [puppet] - 10https://gerrit.wikimedia.org/r/499929 (owner: 10Muehlenhoff) [14:35:34] (03PS2) 10Alexandros Kosiaris: docker: Remove support for trusty images [puppet] - 10https://gerrit.wikimedia.org/r/499929 (owner: 10Muehlenhoff) [14:35:39] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] docker: Remove support for trusty images [puppet] - 10https://gerrit.wikimedia.org/r/499929 (owner: 10Muehlenhoff) [14:42:25] (03PS4) 10Dzahn: xvfb: replace base::service_unit with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/499993 (https://phabricator.wikimedia.org/T194724) [14:44:59] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: remove backports from jessie [puppet] - 10https://gerrit.wikimedia.org/r/500054 (https://phabricator.wikimedia.org/T219580) [14:47:41] (03CR) 10Dzahn: [C: 03+2] xvfb: replace base::service_unit with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/499993 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [14:48:42] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Indeed, entirely unused branching. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/499944 (owner: 10Alex Monk) [14:48:49] (03PS2) 10Giuseppe Lavagetto: docker::baseimages: remove backports from jessie [puppet] - 10https://gerrit.wikimedia.org/r/500054 (https://phabricator.wikimedia.org/T219580) [14:48:52] (03PS2) 10Alexandros Kosiaris: base::firewall: rm seemingly unused realm branch [puppet] - 10https://gerrit.wikimedia.org/r/499944 (owner: 10Alex Monk) [14:54:05] RECOVERY - puppet last run on analytics1046 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:56:10] (03CR) 10Dzahn: "thanks for the follow-up on mediawiki params, Krenair!" [puppet] - 10https://gerrit.wikimedia.org/r/499567 (owner: 10Alex Monk) [14:56:57] PROBLEM - Disk space on ldap-eqiad-replica02 is CRITICAL: DISK CRITICAL - free space: / 676 MB (3% inode=96%) [15:00:07] !log ldap-eqiad-replica02 - running out of disk - apt-get clean - gzipping /var/log/debug [15:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:40] 10Operations, 10Traffic: Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10BBlack) There's some complexities here that I've been stewing on for a while, mostly noted in the original description, but I like this general direction. Most of... [15:04:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500054 (https://phabricator.wikimedia.org/T219580) (owner: 10Giuseppe Lavagetto) [15:05:22] <_joe_> thanks! [15:05:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::baseimages: remove backports from jessie [puppet] - 10https://gerrit.wikimedia.org/r/500054 (https://phabricator.wikimedia.org/T219580) (owner: 10Giuseppe Lavagetto) [15:05:45] (03PS3) 10Giuseppe Lavagetto: docker::baseimages: remove backports from jessie [puppet] - 10https://gerrit.wikimedia.org/r/500054 (https://phabricator.wikimedia.org/T219580) [15:13:27] RECOVERY - Disk space on ldap-eqiad-replica02 is OK: DISK OK [15:14:39] <_joe_> !log pruning old images and containers on boron [15:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:48] 10Operations, 10Discovery-Search (Current work), 10Wikimedia-Incident: Create cookbook to reset frozen write state on elasticsearch / cirrus - https://phabricator.wikimedia.org/T219638 (10Gehel) [15:22:25] (03PS1) 10Gehel: Cookbook to reset frozen writes on elasticsearch / cirrus. [cookbooks] - 10https://gerrit.wikimedia.org/r/500064 (https://phabricator.wikimedia.org/T219638) [15:23:02] (03PS2) 10Gehel: Cookbook to reset frozen writes on elasticsearch / cirrus. [cookbooks] - 10https://gerrit.wikimedia.org/r/500064 (https://phabricator.wikimedia.org/T219638) [15:23:20] (03CR) 10Jcrespo: "> that will be a snapshot that has already being prepared on source," [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [15:24:06] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Elasticsearch, 10Wikimedia-Incident: Create checks that alerts on cirrussearch update lags - https://phabricator.wikimedia.org/T219601 (10Gehel) [15:24:17] (03CR) 10Marostegui: "As per https://phabricator.wikimedia.org/T219631#5069579 that is this very same code?" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [15:25:05] (03CR) 10jerkins-bot: [V: 04-1] Cookbook to reset frozen writes on elasticsearch / cirrus. [cookbooks] - 10https://gerrit.wikimedia.org/r/500064 (https://phabricator.wikimedia.org/T219638) (owner: 10Gehel) [15:25:31] 10Operations, 10Discovery-Search (Current work), 10Wikimedia-Incident: Make spicerack more robust when unfreezing writes to elasticsearch / cirrus - https://phabricator.wikimedia.org/T219640 (10Gehel) [15:25:49] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review, 10Wikimedia-Incident: Create cookbook to reset frozen write state on elasticsearch / cirrus - https://phabricator.wikimedia.org/T219638 (10Gehel) a:03Gehel [15:25:55] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review, 10Wikimedia-Incident: Create cookbook to reset frozen write state on elasticsearch / cirrus - https://phabricator.wikimedia.org/T219638 (10Gehel) p:05Triage→03High [15:26:01] 10Operations, 10Discovery-Search (Current work), 10Wikimedia-Incident: Make spicerack more robust when unfreezing writes to elasticsearch / cirrus - https://phabricator.wikimedia.org/T219640 (10Gehel) p:05Triage→03High a:03Gehel [15:26:51] (03PS3) 10Gehel: Cookbook to reset frozen writes on elasticsearch / cirrus. [cookbooks] - 10https://gerrit.wikimedia.org/r/500064 (https://phabricator.wikimedia.org/T219638) [15:28:32] (03CR) 10Jcrespo: "> As per https://phabricator.wikimedia.org/T219631#5069579 that is" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [15:30:06] (03CR) 10Marostegui: "Yeah sure, just asking if it was already functional, as in, it was already tested that works :)" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [15:30:09] (03PS1) 10Muehlenhoff: Pull in kibana/logstash 5.6.15 [puppet] - 10https://gerrit.wikimedia.org/r/500066 [15:30:13] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:32:16] (03PS1) 10Gehel: Elasticsearch: make unfreezing writes more robust. [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) [15:34:40] (03PS6) 10ArielGlenn: use MediaWiki maintenance script to get db user and password [dumps] - 10https://gerrit.wikimedia.org/r/498245 (https://phabricator.wikimedia.org/T218923) [15:35:34] (03PS4) 10CRusnov: Add synchronizing nodes to ganeti-netbox sync. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/498268 [15:35:47] (03CR) 10CRusnov: Add synchronizing nodes to ganeti-netbox sync. (036 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/498268 (owner: 10CRusnov) [15:36:57] (03PS5) 10CRusnov: netbox ganeti sync: Fix path to logfiles. [puppet] - 10https://gerrit.wikimedia.org/r/499288 [15:38:19] (03CR) 10CRusnov: [C: 03+2] netbox ganeti sync: Fix path to logfiles. [puppet] - 10https://gerrit.wikimedia.org/r/499288 (owner: 10CRusnov) [15:40:08] (03CR) 10EBernhardson: Elasticsearch: make unfreezing writes more robust. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) (owner: 10Gehel) [15:43:47] (03CR) 10Dzahn: "the URL parameters are fine now. now just: Class[Tilerator::Ui]: parameter 'sources_to_invalidate' expects a String value, got Tuple" [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) (owner: 10MSantos) [15:44:27] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/15447/maps1001.eqiad.wmnet/change.maps1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) (owner: 10MSantos) [15:48:43] !log bump ulsfo-codfw ospf link cost to 1000 - T219591 [15:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:48] T219591: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591 [15:50:21] (03CR) 10Dzahn: "tested to stop and start with systemctl on both integration-slave-jessie-1001.integration and jenkins-slave-01.git , no issues" [puppet] - 10https://gerrit.wikimedia.org/r/499993 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [15:50:59] (03PS3) 10Jcrespo: transfer.py: Allow for a 3rd transfer type: decompression [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) [15:51:23] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Allow for a 3rd transfer type: decompression [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [15:51:41] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/500031 (https://phabricator.wikimedia.org/T219591) (owner: 10Ema) [15:51:54] !log repool ulsfo - T219591 [15:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:42] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591 (10ayounsi) a:03ayounsi [15:56:33] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:57:41] (03PS2) 10Jbond: jessie-backports: remove redundant pins [puppet] - 10https://gerrit.wikimedia.org/r/499808 (https://phabricator.wikimedia.org/T219333) [15:57:43] (03PS1) 10Jbond: jessie-backports: remove updates from jessie bootstrap-vz config [puppet] - 10https://gerrit.wikimedia.org/r/500069 (https://phabricator.wikimedia.org/T219580) [16:00:00] (03PS2) 10Gehel: Elasticsearch: make unfreezing writes more robust. [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) [16:00:08] (03CR) 10Gehel: Elasticsearch: make unfreezing writes more robust. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) (owner: 10Gehel) [16:00:59] (03CR) 10EBernhardson: [C: 03+1] Elasticsearch: make unfreezing writes more robust. [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) (owner: 10Gehel) [16:01:25] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:01:45] (03PS2) 10Muehlenhoff: Pull in kibana/logstash 5.6.15 [puppet] - 10https://gerrit.wikimedia.org/r/500066 [16:02:43] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:07:22] (03PS1) 10EBernhardson: Disable wbcs dispatching query builder on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [16:07:31] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.25 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:14:17] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:15:35] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:19:29] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:20:45] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:21:42] looks like the "Varnish traffic drop" alert was just ulsfo being repooled? [16:21:54] seems reasonable explanation [16:21:57] are the BFD status alerts of any concern XioNoX? [16:23:03] cdanis: "Varnish traffic drop" alert was just ulsfo being repooled? <- correct [16:23:52] about the BFD, not a concern, that link is now the backup one [16:26:45] (03PS1) 10Nuria: Removing TestSearchSatisfaction from it being persisted to MySQL [puppet] - 10https://gerrit.wikimedia.org/r/500076 (https://phabricator.wikimedia.org/T216055) [16:28:29] (03CR) 10Nuria: "Let's wait for @bearloga to be done with moving dashboards to run on top of hadoop to execute this." [puppet] - 10https://gerrit.wikimedia.org/r/500076 (https://phabricator.wikimedia.org/T216055) (owner: 10Nuria) [16:32:59] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 72.26 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:33:36] brb [16:39:55] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:41:11] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:44:36] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Contrary to what one would expect from the documentation, jessie-updates gets added anyways. See https://github.com/andsens/bootstrap-vz/b" [puppet] - 10https://gerrit.wikimedia.org/r/500069 (https://phabricator.wikimedia.org/T219580) (owner: 10Jbond) [16:55:15] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:56:31] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:00:02] (03CR) 10Smalyshev: Disable wbcs dispatching query builder on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [17:02:25] (03PS1) 10Cparle: Add 'depicts' statements to search index on testcommons and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500080 [17:03:07] (03PS2) 10Cparle: Add 'depicts' statements to search index on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500080 [17:08:48] (03CR) 10Gehel: "I think this looks reasonable. I'll do a last pass and merge this next Monday." [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [17:09:33] 10Operations, 10serviceops: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [17:09:46] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500066 (owner: 10Muehlenhoff) [17:10:07] (03PS3) 10Cparle: Add 'depicts' statements to search index on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500080 [17:11:30] (03CR) 10Eric Gardner: [C: 03+1] Add 'depicts' statements to search index on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500080 (owner: 10Cparle) [17:12:54] 10Operations, 10serviceops, 10Services (watching): Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10Pchelolo) [17:23:42] 10Operations, 10serviceops, 10Services (watching): Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [17:29:48] 10Operations, 10serviceops, 10Services (watching), 10User-jijiki: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [17:31:27] 10Operations, 10serviceops, 10Services (watching), 10User-jijiki: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10Pchelolo) If I understand correctly, in order to switch a particular job execution to PHP7 all we need to do is to add `Cookie: PHP_ENGINE=php7` header to the requ... [17:41:52] (03PS1) 10Bstorm: osmdb: set the CNAME for osmdb to the new instance in Cloud VPS [dns] - 10https://gerrit.wikimedia.org/r/500086 (https://phabricator.wikimedia.org/T219652) [17:44:59] (03PS10) 10MSantos: Pass flag use_nodejs10 for maps services [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) [17:48:34] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready - https://phabricator.wikimedia.org/T216749 (10Bstorm) Thanks @Marostegui ! [17:55:26] (03CR) 10Herron: [C: 03+1] Pull in kibana/logstash 5.6.15 [puppet] - 10https://gerrit.wikimedia.org/r/500066 (owner: 10Muehlenhoff) [17:58:03] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:59:17] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:02:08] I downtimed the BFD alerts [18:13:51] (03CR) 10EBernhardson: Disable wbcs dispatching query builder on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [18:17:11] (03PS1) 10Bstorm: labsdb: remove old and likely unused cname for labsdb1004 [dns] - 10https://gerrit.wikimedia.org/r/500090 (https://phabricator.wikimedia.org/T216749) [18:20:44] (03CR) 10EBernhardson: Disable wbcs dispatching query builder on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [18:20:45] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:23:17] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:39:59] (03CR) 10Marostegui: "I don't know if it is in use or not, but a good way to test it is to stop mysql and postresql for a few days, if no one complains those ho" [dns] - 10https://gerrit.wikimedia.org/r/500090 (https://phabricator.wikimedia.org/T216749) (owner: 10Bstorm) [18:44:52] (03PS2) 10EBernhardson: Disable wbcs dispatching query builder on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [18:45:01] (03PS1) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [18:46:12] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [18:49:06] (03PS2) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [18:50:03] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) @jcrespo the problem was that ge-4/0/3 was already part of private1-b-codfw and not xe-4/0/3 so the install is in progress. will... [18:50:11] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [18:53:17] (03PS3) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [18:54:18] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [18:56:16] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) a:05Papaul→03jcrespo @jcrespo all yours let me know if you have any questions [19:01:55] any idea why this is not going through after the +2? I can't find it in here (https://integration.wikimedia.org/zuul/) either [19:04:01] (03PS1) 10Cwhite: profile: do not mutate level for mjolnir [puppet] - 10https://gerrit.wikimedia.org/r/500099 (https://phabricator.wikimedia.org/T213899) [19:09:26] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/500064 (https://phabricator.wikimedia.org/T219638) (owner: 10Gehel) [19:10:16] 10Operations, 10Release Pipeline, 10Core Platform Team Kanban (Done with CPT), 10Release-Engineering-Team (Watching / External), 10Services (done): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10mobrovac) 05Open→03Resolved a:... [19:16:12] (03CR) 10Volans: [C: 03+1] "LGTM, a couple of question inline, feel free to merge as needed." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) (owner: 10Gehel) [19:31:01] 10Operations, 10Cassandra, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), and 2 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10mobrovac) [19:35:13] (03PS4) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [19:36:24] 10Operations, 10Analytics, 10EventBus, 10vm-requests, and 3 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10mobrovac) [19:42:01] (03PS1) 10Mholloway: Add cron job to update WikimediaEditorTasks suggestions table [puppet] - 10https://gerrit.wikimedia.org/r/500104 (https://phabricator.wikimedia.org/T218136) [19:43:06] (03CR) 10jerkins-bot: [V: 04-1] Add cron job to update WikimediaEditorTasks suggestions table [puppet] - 10https://gerrit.wikimedia.org/r/500104 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [19:44:11] (03PS2) 10Mholloway: Add cron job to update WikimediaEditorTasks suggestions table [puppet] - 10https://gerrit.wikimedia.org/r/500104 (https://phabricator.wikimedia.org/T218136) [19:53:14] (03CR) 10Smalyshev: Disable wbcs dispatching query builder on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [19:54:58] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Services (watching), 10User-jijiki: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10mobrovac) [20:07:19] (03CR) 10BryanDavis: "PCC output for icinga[12]001.wikimedia.org: https://puppet-compiler.wmflabs.org/compiler1002/15456/" [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [20:09:08] (03CR) 10EBernhardson: [C: 03+1] Elasticsearch: make unfreezing writes more robust. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) (owner: 10Gehel) [20:20:52] (03PS1) 10Bstorm: labsdb: decommissioning labsdb1004/5 [puppet] - 10https://gerrit.wikimedia.org/r/500117 (https://phabricator.wikimedia.org/T216749) [20:23:05] (03PS2) 10Bstorm: labsdb: decommissioning labsdb1004/5 [puppet] - 10https://gerrit.wikimedia.org/r/500117 (https://phabricator.wikimedia.org/T216749) [20:25:34] (03CR) 10Bstorm: [C: 03+2] labsdb: decommissioning labsdb1004/5 [puppet] - 10https://gerrit.wikimedia.org/r/500117 (https://phabricator.wikimedia.org/T216749) (owner: 10Bstorm) [20:29:38] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@ff9d424]: Elasticsearch 6 fixes for content-type headers [20:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:08] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@ff9d424]: Elasticsearch 6 fixes for content-type headers (duration: 00m 30s) [20:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:15] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:31:47] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@ff9d424]: Elasticsearch 6 fixes for content-type headers (part 2) [20:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:51] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:35:17] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@ff9d424]: Elasticsearch 6 fixes for content-type headers (part 2) (duration: 03m 30s) [20:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:09] 10Operations, 10Cloud-VPS, 10DNS, 10Maps, and 2 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256 (10TheDJ) FYI, I have configured [abc].tiles.wmflabs.org webhosts to redirect to http://tiles.wmflabs.org during {T204506}... [20:46:12] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:46:44] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@ff9d424]: Elasticsearch 6 fixes for content-type headers (part 3) [20:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:02] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:49:58] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@ff9d424]: Elasticsearch 6 fixes for content-type headers (part 3) (duration: 03m 13s) [20:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:01] (03PS4) 10CRusnov: Add basic Ganeti RAPI module and tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 [20:55:56] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@ff9d424]: Elasticsearch 6 fixes for content-type headers (part 3) [20:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:12] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:01:10] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@ff9d424]: Elasticsearch 6 fixes for content-type headers (part 3) (duration: 05m 14s) [21:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:31] RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:07:08] PROBLEM - HHVM rendering on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:07:11] (03PS5) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [21:08:22] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 78663 bytes in 0.363 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:14:14] PROBLEM - puppet last run on kafkamon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:54] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:20:48] PROBLEM - puppet last run on kafka-jumbo1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:23:14] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:29:42] RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:30:15] (03PS6) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [21:40:30] RECOVERY - puppet last run on kafkamon1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:45:48] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:47:04] RECOVERY - puppet last run on kafka-jumbo1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:06:39] !log stopped database services on labsdb1004 and labsdb1005 [22:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:24] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready - https://phabricator.wikimedia.org/T216749 (10Bstorm) Database services (postgres and mariadb) are now shut... [22:09:19] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready - https://phabricator.wikimedia.org/T216749 (10Bstorm) [22:15:04] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:24:00] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:28:04] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:35:06] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:40:43] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10Bstorm) [22:45:34] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:46:34] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:46:50] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 4 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:48:34] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:49:50] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:00:22] (03CR) 10BryanDavis: [C: 04-1] wmcs: Migrate tools-checker to Stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [23:15:20] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:25:48] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:36:44] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:42:02] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:44:36] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:51:10] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:56:10] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:57:26] RECOVERY - puppet last run on ms-be1026 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:58:44] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status