[00:02:15] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:02:21] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:12:33] (03CR) 10BryanDavis: [C: 04-1] "Need to address issues from Brooke's review" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [00:13:26] (03PS7) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 [00:17:02] (03PS8) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 [00:20:57] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:25:58] 10Operations, 10ops-codfw, 10Patch-For-Review: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10Papaul) @Volans the server is running on old firmware. HPE Smart Storage Battery 1 Firmware 1.1 Embedded iLO 2.40 Dec 02 2015 System Board Intelligent Platform Abstraction Data 20.3 System... [00:35:55] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:41:45] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:42:09] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:47:43] (03PS12) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [00:49:45] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:53:37] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:54:41] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:08:51] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:11:13] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [01:15:13] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:15:17] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:20:33] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:20:41] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 934 days) https://wikitech.wikimedia.org/wiki/Logs [01:24:15] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:29:29] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:38:59] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:42:53] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:52:11] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:56:37] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:59:27] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:59:37] (03PS1) 10Andrew Bogott: pdns-recursor: reduce maximum number of file descriptors [puppet] - 10https://gerrit.wikimedia.org/r/500880 (https://phabricator.wikimedia.org/T219953) [02:02:02] (03CR) 10Andrew Bogott: [C: 03+2] pdns-recursor: reduce maximum number of file descriptors [puppet] - 10https://gerrit.wikimedia.org/r/500880 (https://phabricator.wikimedia.org/T219953) (owner: 10Andrew Bogott) [02:02:59] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:04:55] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:26:13] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:26:29] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [02:36:31] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 21701744 and 0 seconds [02:37:49] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1688 and 2 seconds [02:39:11] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:42:09] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:43:19] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 75009 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:10:33] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:29:51] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:36:59] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [03:37:37] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:37:55] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:56:57] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [03:59:33] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:15:29] (03PS25) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [04:16:09] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:18:30] (03CR) 10BryanDavis: wmcs: Migrate tools-checker to Stretch (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [04:22:39] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:22:51] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:26:37] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:30:29] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:32:27] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:33:03] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:33:17] 10Operations, 10ORES, 10Scoring-platform-team (Research): Investigate memory usage of ORES in kubernetes - https://phabricator.wikimedia.org/T210264 (10Harej) [04:35:03] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:35:41] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:38:15] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:38:57] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:39:45] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:44:23] (03PS9) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) [04:47:47] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:49:15] (03CR) 10CRusnov: "Woowee lots of changes." (0312 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [04:49:29] (03CR) 10Mobrovac: "> Lemme know when it's ready to go (what's blocking it?) and I 'll" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [04:49:47] (03CR) 10jerkins-bot: [V: 04-1] Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [04:54:46] (03PS10) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) [04:56:25] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:59:02] (03CR) 10jerkins-bot: [V: 04-1] Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [05:01:02] (03PS11) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) [05:06:10] (03CR) 10jerkins-bot: [V: 04-1] Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [05:11:48] (03PS12) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) [05:13:11] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:14:11] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [05:19:10] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500892 [05:21:23] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500892 (owner: 10Marostegui) [05:22:17] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500892 (owner: 10Marostegui) [05:23:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1007 for upgrade (duration: 01m 00s) [05:23:44] !log Upgrade pc1007 [05:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:25:55] uh? [05:26:11] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:26:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500892 (owner: 10Marostegui) [05:26:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:26:45] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:26:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:27:06] what's going on? [05:28:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500893 [05:28:57] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [05:29:05] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [05:29:08] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500893 (owner: 10Marostegui) [05:29:21] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:29:23] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:29:29] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:30:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:30:08] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500893 (owner: 10Marostegui) [05:30:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:31:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1007 (duration: 00m 59s) [05:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:55] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:34:07] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [05:34:17] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [05:34:33] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:35:51] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:37:11] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:37:22] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500893 (owner: 10Marostegui) [05:38:41] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:40:05] (03PS1) 10Marostegui: mariadb: Move wikimedia_editor_tasks_entity_description_exists [puppet] - 10https://gerrit.wikimedia.org/r/500894 (https://phabricator.wikimedia.org/T218302) [05:43:01] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:43:11] (03PS2) 10Marostegui: mariadb: Move wikimedia_editor_tasks_entity_description_exists [puppet] - 10https://gerrit.wikimedia.org/r/500894 (https://phabricator.wikimedia.org/T218302) [05:43:31] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:43:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:43:55] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:44:21] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:44:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [05:46:15] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:46:15] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:46:25] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:46:31] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:46:41] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [05:46:57] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:47:07] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [05:47:39] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:48:17] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:48:43] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:52:25] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [05:54:01] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:54:45] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:54:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:55:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:55:27] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:55:31] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:57:29] !log Fix data drifts on bnwikisource on x1 - T219493 [05:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:33] T219493: Decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 [05:57:35] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [05:57:42] <_joe_> marostegui: can we stop deploying things? [05:57:53] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:57:56] yep, I am not deploying anything for a while [05:58:37] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:58:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:59:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:59:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:59:49] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:03:53] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [06:04:03] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [06:04:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [06:04:26] <_joe_> !log restart varnish backend on cp1085, causing unavailability [06:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:04:47] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [06:17:02] Krinkle, AaronSchulz - o/ is https://gerrit.wikimedia.org/r/499693 going to be deployed this week by any chance? I'd really like to see it working asap, it might reduce a lot the number of TKOs that we are seeing :) [06:23:03] (03CR) 10Elukey: [C: 03+1] uwsgi: allow setting routing rules [puppet] - 10https://gerrit.wikimedia.org/r/500729 (owner: 10Giuseppe Lavagetto) [06:23:17] (03CR) 10Elukey: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15514/graphite1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/500730 (owner: 10Giuseppe Lavagetto) [06:25:53] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:26:57] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [06:31:05] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:37:37] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:38:07] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Services (watching), 10User-jijiki: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) p:05Triage→03Normal [06:53:05] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:00:06] (03CR) 10Jcrespo: "Filtered tables will attempt to create a trigger for each column- that will be a waste of resources. I would prefer to leave it as private" [puppet] - 10https://gerrit.wikimedia.org/r/500894 (https://phabricator.wikimedia.org/T218302) (owner: 10Marostegui) [07:01:25] (03CR) 10Marostegui: "> Filtered tables will attempt to create a trigger for each column-" [puppet] - 10https://gerrit.wikimedia.org/r/500894 (https://phabricator.wikimedia.org/T218302) (owner: 10Marostegui) [07:03:46] (03PS1) 10Marostegui: db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500895 (https://phabricator.wikimedia.org/T219493) [07:03:51] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:04:56] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500895 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [07:06:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500895 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [07:06:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500895 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [07:07:17] (03CR) 10Jcrespo: "Ah, so you want to make it public? Ok to me then, but it will have to b e imported." [puppet] - 10https://gerrit.wikimedia.org/r/500894 (https://phabricator.wikimedia.org/T218302) (owner: 10Marostegui) [07:07:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1120 T219493 (duration: 01m 13s) [07:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:37] T219493: Decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 [07:07:38] (03CR) 10Marostegui: "> Ah, so you want to make it public? Ok to me then, but it will have" [puppet] - 10https://gerrit.wikimedia.org/r/500894 (https://phabricator.wikimedia.org/T218302) (owner: 10Marostegui) [07:09:22] !log Stop replication in sync on db1120 and db2034 (x1 codfw master) - T219493 [07:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:18:36] (03CR) 10Gilles: "Nothing blocking this, should be fine" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [07:20:57] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500897 [07:22:59] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:23:16] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500897 (owner: 10Marostegui) [07:24:04] 10Operations, 10Domains, 10Traffic: figure out if we can park wicipediacymraeg.org - https://phabricator.wikimedia.org/T128085 (10Dzahn) [07:24:08] 10Operations, 10Domains, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Dzahn) [07:24:47] 10Operations, 10Domains, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Dzahn) also see T128085 [07:24:55] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500897 (owner: 10Marostegui) [07:25:47] !log Deploy schema change on db1073, labtestwiki - T219887 [07:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:50] T219887: Change job table to use mediumblob for job_params field - https://phabricator.wikimedia.org/T219887 [07:26:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1120 T219493 (duration: 00m 57s) [07:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:12] T219493: Decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 [07:28:11] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500897 (owner: 10Marostegui) [07:29:55] 10Operations, 10Domains, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Vgutierrez) regarding TLS wicipediacymraeg.org should benefit from T133548 that should be implemented during Q4 [07:30:03] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:30:30] (03CR) 10Marostegui: "jcrespo so you ok with this to go?" [puppet] - 10https://gerrit.wikimedia.org/r/500894 (https://phabricator.wikimedia.org/T218302) (owner: 10Marostegui) [07:32:01] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:32:07] (03CR) 10Jcrespo: [C: 03+1] mariadb: Move wikimedia_editor_tasks_entity_description_exists [puppet] - 10https://gerrit.wikimedia.org/r/500894 (https://phabricator.wikimedia.org/T218302) (owner: 10Marostegui) [07:32:14] \o/ [07:36:01] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:38:37] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:40:45] !log DIsable event scheduler on db1115 before restarting - tendril is stuck [07:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:14] !log Reboot db1115 - tendril and dbtree will be down [07:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:19] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:42:53] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) I think that we should move away from hacks done up to now and... [07:43:49] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Depool s8 sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500899 (https://phabricator.wikimedia.org/T218302) [07:44:09] PROBLEM - HTTP-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [07:44:43] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Depool s8 sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500899 (https://phabricator.wikimedia.org/T218302) [07:44:43] PROBLEM - HTTP-dbtree on dbmonitor2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [07:44:57] ^ expected as !logged before [07:45:39] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10MoritzMuehlenhoff) >>! In T148843#5080853, @elukey wrote: > * see if i... [07:45:44] (03CR) 10Gilles: Make caching of static performance site explicit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) (owner: 10Gilles) [07:46:15] (03CR) 10Muehlenhoff: [C: 03+1] gerrit: admins: ops -> gerritadmin [puppet] - 10https://gerrit.wikimedia.org/r/498431 (owner: 10Hashar) [07:47:53] RECOVERY - HTTP-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 80550 bytes in 0.325 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [07:48:29] RECOVERY - HTTP-dbtree on dbmonitor2001 is OK: HTTP OK: HTTP/1.1 200 OK - 80592 bytes in 0.834 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [07:51:12] !log installing new apache packages on mwdebug [07:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:26] !log gilles@deploy1001 Synchronized php-1.33.0-wmf.24/includes/media/ThumbnailImage.php: T216499 Only apply high priority hint half the time (duration: 00m 58s) [07:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:29] T216499: Priority Hints origin trial - https://phabricator.wikimedia.org/T216499 [07:55:49] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [08:02:10] (03PS4) 10Mathew.onipe: icinga: add mediawiki cirrus update lag check [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) [08:02:47] (03CR) 10Mathew.onipe: icinga: add mediawiki cirrus update lag check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe) [08:07:34] (03PS3) 10Dzahn: gerrit: admins: ops -> gerritadmin [puppet] - 10https://gerrit.wikimedia.org/r/498431 (owner: 10Hashar) [08:08:32] (03CR) 10Dzahn: [C: 03+2] gerrit: admins: ops -> gerritadmin [puppet] - 10https://gerrit.wikimedia.org/r/498431 (owner: 10Hashar) [08:09:31] (03PS1) 10Jcrespo: network constants: dbmonitor hosts are not general monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/500900 [08:09:52] !log installing new apache packages on mmw1261 [08:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:39] (03CR) 10Marostegui: [C: 03+1] network constants: dbmonitor hosts are not general monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/500900 (owner: 10Jcrespo) [08:12:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/500900 (owner: 10Jcrespo) [08:13:15] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:13:46] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Depool s8 sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500899 (https://phabricator.wikimedia.org/T218302) (owner: 10Marostegui) [08:14:54] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Depool s8 sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500899 (https://phabricator.wikimedia.org/T218302) (owner: 10Marostegui) [08:15:10] (03PS2) 10Jcrespo: network constants: dbmonitor hosts are not general monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/500900 [08:15:32] 10Operations, 10Traffic, 10Goal: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ema) [08:15:39] 10Operations, 10Traffic, 10Goal: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ema) p:05Triage→03Normal [08:15:47] (03PS3) 10Jcrespo: network constants: dbmonitor hosts are not general monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/500900 [08:16:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool s8 sanitarium master (duration: 00m 58s) [08:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:15] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:17:26] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool s8 sanitarium master (duration: 00m 57s) [08:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:35] !log Stop replication on db2082 and db1087 (s8 sanitarium masters) T218302 [08:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:38] T218302: Choose DB/Cluster for WikimediaEditorTasks tables - https://phabricator.wikimedia.org/T218302 [08:19:27] (03PS3) 10Marostegui: mariadb: Move wikimedia_editor_tasks_entity_description_exists [puppet] - 10https://gerrit.wikimedia.org/r/500894 (https://phabricator.wikimedia.org/T218302) [08:20:31] (03CR) 10Marostegui: [C: 03+2] mariadb: Move wikimedia_editor_tasks_entity_description_exists [puppet] - 10https://gerrit.wikimedia.org/r/500894 (https://phabricator.wikimedia.org/T218302) (owner: 10Marostegui) [08:22:57] (03PS4) 10Jcrespo: network constants: dbmonitor hosts are not general monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/500900 [08:23:02] !log Restart mysql on sanitarium hosts db1124 db1125 db2094 db2095 - T218302 [08:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:01] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Depool s8 sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500899 (https://phabricator.wikimedia.org/T218302) (owner: 10Marostegui) [08:27:35] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:33:53] (03CR) 10Jcrespo: [C: 03+2] network constants: dbmonitor hosts are not general monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/500900 (owner: 10Jcrespo) [08:35:27] !log merging change on network constants (firewall operation) [08:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:09] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) @EBernhardson I think that the most pressing point now is to d... [08:38:50] ^marostegui be viligilant for any network issue, even if it should be a noop [08:38:56] wilco [08:38:58] thanks [08:38:59] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:39:09] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:39:25] I will check db1115, maybe now there is an port opening needed? [08:40:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Some comments inline which are mostly optional stuff." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [08:40:31] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:42:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Let me know if you want me to merge this or if andrew will handle it." [puppet] - 10https://gerrit.wikimedia.org/r/500825 (owner: 10Alex Monk) [08:42:22] marostegui: we need a new rule for db1115, I added one manually for now [08:42:40] I just saw the dbmonitor complaining about timeout on icinga [08:43:03] PROBLEM - HTTP-dbtree on dbmonitor2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [08:43:12] hehe that ^ [08:43:21] it should be back [08:43:27] oh, I see, it is the other one [08:43:41] that should do it [08:43:51] 1001 is gone on icinga [08:43:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "let me know if andrew doesn't merge this, I can do it." [puppet] - 10https://gerrit.wikimedia.org/r/500824 (owner: 10Alex Monk) [08:43:58] and dbtree works for me [08:44:09] dbmonitor2001 is passive anyway [08:44:13] RECOVERY - HTTP-dbtree on dbmonitor2001 is OK: HTTP OK: HTTP/1.1 200 OK - 80589 bytes in 0.946 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [08:44:15] yep [08:44:21] that is why I didn't add it at first [08:44:25] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:44:32] so we need the glue between frontends and backends on a rule [08:44:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500823 (owner: 10Alex Monk) [08:44:45] soo.. i can explain the 2001 thing [08:44:55] if hiera('do_acme', true) { [08:44:55] ferm::service { 'tendril-http-https': [08:45:04] only opens port 80/443 if do_acme [08:45:05] ? [08:45:08] so only on the active one [08:45:24] mutante: it is actually the backend- it is ok now [08:45:35] oh, nevermind then :) [08:46:41] I am not worried about that, I am more worried about if others are ok [08:46:48] eg "ERROR ferm input drop default policy not set, ferm might not have been started correctly" [08:47:17] (03CR) 10Arturo Borrero Gonzalez: openstack: clientpackages: fix missing deb repo installation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500797 (owner: 10Arturo Borrero Gonzalez) [08:48:26] (03PS3) 10Arturo Borrero Gonzalez: openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 [08:48:49] jynus: gotcha, but even if ferm fails to restart for some reason, like failed DNS lookup, it doesn't mean that everything is now closed or open (anymore). had a case like that not too long ago [08:49:38] i don't know what was mw1338 thing, it seems ok now [08:53:31] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:54:09] (03PS4) 10Ema: ATS: add ats-backend-restart [puppet] - 10https://gerrit.wikimedia.org/r/500675 (https://phabricator.wikimedia.org/T213263) [08:56:08] (03PS4) 10Arturo Borrero Gonzalez: openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 [08:57:09] (03CR) 10jerkins-bot: [V: 04-1] openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 (owner: 10Arturo Borrero Gonzalez) [09:03:56] (03PS5) 10Arturo Borrero Gonzalez: openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 [09:05:05] (03CR) 10jerkins-bot: [V: 04-1] openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 (owner: 10Arturo Borrero Gonzalez) [09:08:12] (03PS1) 10Jcrespo: tendril: Open firewall only from tendril web to tendril db backend [puppet] - 10https://gerrit.wikimedia.org/r/500904 [09:08:51] (03PS3) 10Muehlenhoff: haproxy: Remove Ubuntu support [puppet] - 10https://gerrit.wikimedia.org/r/487895 [09:09:03] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:09:35] (03CR) 10jerkins-bot: [V: 04-1] tendril: Open firewall only from tendril web to tendril db backend [puppet] - 10https://gerrit.wikimedia.org/r/500904 (owner: 10Jcrespo) [09:10:03] (03CR) 10Muehlenhoff: [C: 03+2] haproxy: Remove Ubuntu support [puppet] - 10https://gerrit.wikimedia.org/r/487895 (owner: 10Muehlenhoff) [09:10:13] (03PS5) 10Ema: ATS: add ats-backend-restart [puppet] - 10https://gerrit.wikimedia.org/r/500675 (https://phabricator.wikimedia.org/T213263) [09:11:58] (03PS6) 10Ema: ATS: add ats-backend-restart [puppet] - 10https://gerrit.wikimedia.org/r/500675 (https://phabricator.wikimedia.org/T213263) [09:13:14] (03CR) 10Ema: [C: 03+2] ATS: add ats-backend-restart [puppet] - 10https://gerrit.wikimedia.org/r/500675 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [09:15:52] (03CR) 10Muehlenhoff: tendril: Open firewall only from tendril web to tendril db backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500904 (owner: 10Jcrespo) [09:17:18] (03PS1) 10Marostegui: Revert "db-eqiad,db-codfw.php: Depool s8 sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500906 [09:17:32] jouncebot: next [09:17:32] In 1 hour(s) and 42 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190403T1100) [09:18:47] (03PS2) 10Jcrespo: tendril: Open firewall only from tendril web to tendril db backend [puppet] - 10https://gerrit.wikimedia.org/r/500904 [09:23:44] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad,db-codfw.php: Depool s8 sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500906 (owner: 10Marostegui) [09:24:51] (03Merged) 10jenkins-bot: Revert "db-eqiad,db-codfw.php: Depool s8 sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500906 (owner: 10Marostegui) [09:25:48] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.21 [software/spicerack] - 10https://gerrit.wikimedia.org/r/500907 [09:26:04] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool s8 sanitarium master (duration: 01m 00s) [09:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:34] (03PS3) 10Jcrespo: tendril: Open firewall only from tendril web to tendril db backend [puppet] - 10https://gerrit.wikimedia.org/r/500904 [09:27:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool s8 sanitarium master (duration: 00m 56s) [09:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:31] !log Drop wikishared.wikimedia_editor_tasks_entity_description_exists table from x1 T219963 [09:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:38] T219963: Drop wikishared.wikimedia_editor_tasks_entity_description_exists table from x1 - https://phabricator.wikimedia.org/T219963 [09:27:39] (03CR) 10Jcrespo: "Please give it a new look :-)" [puppet] - 10https://gerrit.wikimedia.org/r/500904 (owner: 10Jcrespo) [09:29:22] !log cp-ats-codfw: test ATS rolling restart T213263 [09:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:26] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [09:29:33] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:30:26] (03CR) 10jenkins-bot: Revert "db-eqiad,db-codfw.php: Depool s8 sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500906 (owner: 10Marostegui) [09:30:47] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1002/15517/" [puppet] - 10https://gerrit.wikimedia.org/r/500904 (owner: 10Jcrespo) [09:33:04] (03PS1) 10Arturo Borrero Gonzalez: openstack: designate: introduce openstack/debian split layout [puppet] - 10https://gerrit.wikimedia.org/r/500908 [09:33:12] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.21 [software/spicerack] - 10https://gerrit.wikimedia.org/r/500907 (owner: 10Volans) [09:33:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/500904 (owner: 10Jcrespo) [09:36:07] (03PS1) 10Gehel: elasticsearch: only expose HTTP ports where needed. [puppet] - 10https://gerrit.wikimedia.org/r/500909 [09:36:17] (03PS2) 10Arturo Borrero Gonzalez: openstack: designate: introduce openstack/debian split layout [puppet] - 10https://gerrit.wikimedia.org/r/500908 [09:37:41] (03CR) 10Jcrespo: "I am deploying this without much delay as this is technically broken right now and the scope of potential breakage is very small." [puppet] - 10https://gerrit.wikimedia.org/r/500904 (owner: 10Jcrespo) [09:37:50] (03PS4) 10Jcrespo: tendril: Open firewall only from tendril web to tendril db backend [puppet] - 10https://gerrit.wikimedia.org/r/500904 [09:38:01] (03PS2) 10Gehel: elasticsearch: only expose HTTP ports where needed. [puppet] - 10https://gerrit.wikimedia.org/r/500909 [09:38:06] (03CR) 10Marostegui: [C: 03+1] tendril: Open firewall only from tendril web to tendril db backend [puppet] - 10https://gerrit.wikimedia.org/r/500904 (owner: 10Jcrespo) [09:38:41] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.21 [software/spicerack] - 10https://gerrit.wikimedia.org/r/500907 (owner: 10Volans) [09:38:48] (03CR) 10Gehel: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/15519/" [puppet] - 10https://gerrit.wikimedia.org/r/500909 (owner: 10Gehel) [09:38:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC ok https://puppet-compiler.wmflabs.org/compiler1002/15518/" [puppet] - 10https://gerrit.wikimedia.org/r/500908 (owner: 10Arturo Borrero Gonzalez) [09:39:54] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.21 [software/spicerack] - 10https://gerrit.wikimedia.org/r/500907 (owner: 10Volans) [09:43:19] (03PS1) 10Ladsgroup: Enable UrlShortener in mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500910 (https://phabricator.wikimedia.org/T108557) [09:44:27] (03CR) 10Mathew.onipe: "PCC is Ok: https://puppet-compiler.wmflabs.org/compiler1002/15520/cloudelastic1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/500909 (owner: 10Gehel) [09:44:46] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: only expose HTTP ports where needed. [puppet] - 10https://gerrit.wikimedia.org/r/500909 (owner: 10Gehel) [09:45:36] !log removed labtestnet2003.codfw.wmnet from debmonitor (T219776) [09:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:39] T219776: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 [09:46:38] (03PS1) 10Volans: Upstream release v0.0.21 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/500912 [09:47:03] (03PS1) 10Dzahn: confd: remove upstart support [puppet] - 10https://gerrit.wikimedia.org/r/500913 [09:47:17] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:47:41] (03CR) 10Dzahn: [C: 03+1] "after https://gerrit.wikimedia.org/r/c/operations/puppet/+/500913" [puppet] - 10https://gerrit.wikimedia.org/r/456317 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [09:47:54] (03CR) 10Gehel: [C: 03+2] elasticsearch: only expose HTTP ports where needed. [puppet] - 10https://gerrit.wikimedia.org/r/500909 (owner: 10Gehel) [09:48:02] (03PS3) 10Gehel: elasticsearch: only expose HTTP ports where needed. [puppet] - 10https://gerrit.wikimedia.org/r/500909 [09:48:06] (03PS5) 10Jcrespo: tendril: Open firewall only from tendril web to tendril db backend [puppet] - 10https://gerrit.wikimedia.org/r/500904 [09:49:27] (03CR) 10Jcrespo: [C: 03+2] tendril: Open firewall only from tendril web to tendril db backend [puppet] - 10https://gerrit.wikimedia.org/r/500904 (owner: 10Jcrespo) [09:49:57] (03PS4) 10Gehel: elasticsearch: only expose HTTP ports where needed. [puppet] - 10https://gerrit.wikimedia.org/r/500909 [09:51:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, not of the servers matched by "cumin C:confd" are running trusty." [puppet] - 10https://gerrit.wikimedia.org/r/500913 (owner: 10Dzahn) [09:52:54] (03CR) 10Dzahn: "thanks. after this also https://gerrit.wikimedia.org/r/c/operations/puppet/+/456317" [puppet] - 10https://gerrit.wikimedia.org/r/500913 (owner: 10Dzahn) [09:53:38] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.21 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/500912 (owner: 10Volans) [09:54:09] !log running mysql select queries on m3-slave to get data from phabricator conpherence as requested by andre [09:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:26] !log upgrading beta to hhvm wikidiff 1.8.1 (T203069) [09:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:30] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [09:56:19] (03PS4) 10Jbond: jbond home: add user files [puppet] - 10https://gerrit.wikimedia.org/r/500739 [09:56:30] !log Alter empty job table on s6 primary master - T219887 [09:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:34] T219887: Change job table to use mediumblob for job_params field - https://phabricator.wikimedia.org/T219887 [09:58:17] (03PS1) 10Gehel: elasticsearch: restrict access to cloudelastic to only cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/500914 [09:59:28] (03PS2) 10Gehel: elasticsearch: restrict access to cloudelastic to only cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/500914 [09:59:43] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: restrict access to cloudelastic to only cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/500914 (owner: 10Gehel) [10:00:30] (03CR) 10Gehel: [C: 03+2] elasticsearch: restrict access to cloudelastic to only cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/500914 (owner: 10Gehel) [10:03:25] (03Merged) 10jenkins-bot: Upstream release v0.0.21 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/500912 (owner: 10Volans) [10:07:42] Getting database query errors [10:07:45] Import failed: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? Query: INSERT IGNORE INTO `page` (page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES ('108','List_of_Australian_AM_radio_stations','','0','1','0.896837776405','20190403100623','0','0') Function: WikiPage::insertOn Error: [10:07:45] 1205 Lock wait timeout exceeded; try restarting transaction (10.64.32.136) [10:08:02] where is that? [10:08:14] is it happening all the time or just once? [10:08:45] Just happened twice now while trying to import [10:08:49] so far I have only seen that same error [10:09:18] Leaderboard: that is the same as: https://phabricator.wikimedia.org/T219702 I think [10:09:51] Looks like that, will add my case to it. Thanks [10:09:54] thanks [10:10:28] (03PS1) 10Arturo Borrero Gonzalez: openstack: drop serverpackages profile [puppet] - 10https://gerrit.wikimedia.org/r/500916 [10:14:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc happy (mostly NOOP) https://puppet-compiler.wmflabs.org/compiler1002/15521/" [puppet] - 10https://gerrit.wikimedia.org/r/500916 (owner: 10Arturo Borrero Gonzalez) [10:15:50] (03PS5) 10Jbond: jbond home: add user files [puppet] - 10https://gerrit.wikimedia.org/r/500739 [10:17:37] !log uploaded spicerack_0.0.21-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [10:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:50] (03CR) 10GTirloni: [C: 03+1] Add python 3.5 and nodejs 10 types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496265 (owner: 10BryanDavis) [10:18:18] (03PS2) 10Dzahn: confd: remove upstart support [puppet] - 10https://gerrit.wikimedia.org/r/500913 [10:18:41] (03PS6) 10Arturo Borrero Gonzalez: openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 [10:19:54] !log upgraded spicerack to 0.0.21 on cumin[12]001 [10:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:03] gehel, onimisionipe FYI ^^^ ;) [10:22:51] nice! [10:23:40] (03PS7) 10Arturo Borrero Gonzalez: openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 [10:25:36] !log updating puppet compiler facts [10:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:45] (03PS1) 10Elukey: profile::aqs: add the analytics contact group to aqs's alarm [puppet] - 10https://gerrit.wikimedia.org/r/500917 [10:27:30] (03CR) 10Jbond: [C: 03+2] jbond home: add user files [puppet] - 10https://gerrit.wikimedia.org/r/500739 (owner: 10Jbond) [10:27:39] (03PS6) 10Jbond: jbond home: add user files [puppet] - 10https://gerrit.wikimedia.org/r/500739 [10:27:48] !log planet1001/2001 - upgrade apache2, openssh, locales, rsyslog .. [10:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:28] !log planet1001/2001 - apt autoremove un-required packages [10:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "Aside from the comment below, I also recommend against deploying this feature until T219974 is resolved." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500910 (https://phabricator.wikimedia.org/T108557) (owner: 10Ladsgroup) [10:31:21] (03PS2) 10Elukey: profile::aqs: add the analytics contact group to aqs's alarms [puppet] - 10https://gerrit.wikimedia.org/r/500917 [10:31:53] (03PS8) 10Arturo Borrero Gonzalez: openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 [10:32:51] (03CR) 10Dzahn: profile::aqs: add the analytics contact group to aqs's alarms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500917 (owner: 10Elukey) [10:36:29] PROBLEM - puppet last run on cp5006 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 3 minutes ago with 5 failures. Failed resources (up to 3 shown): File[/home/jbond/.gitconfig],File[/home/jbond/.vim/autoload/pathogen.vim],File[/home/jbond/.vim/bundle/README.md],File[/home/jbond/.vimrc] [10:36:54] thanks mutante ! [10:36:59] 10Operations: Add support for temporary chroots to boron - https://phabricator.wikimedia.org/T219977 (10ema) [10:37:05] 10Operations: Add support for temporary chroots to boron - https://phabricator.wikimedia.org/T219977 (10ema) p:05Triage→03Normal [10:37:13] elukey: :) yw, enjoy lunch [10:37:14] (03PS3) 10Elukey: profile::aqs: add the analytics contact group to aqs's alarms [puppet] - 10https://gerrit.wikimedia.org/r/500917 [10:37:29] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 5 minutes ago with 5 failures. Failed resources (up to 3 shown): File[/home/jbond/.gitconfig],File[/home/jbond/.vim/autoload/pathogen.vim],File[/home/jbond/.vim/bundle/README.md],File[/home/jbond/.vimrc] [10:38:27] i bet that's just a race and phab2001 is random [10:38:34] since home dir files change it on everything [10:39:03] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/home/jbond/.vim/bundle/README.md],File[/home/jbond/.zshrc] [10:39:18] yea, confirmed on phab2001.. no issue [10:39:27] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 7 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/home/jbond/.vim/autoload/pathogen.vim],File[/home/jbond/.vim/bundle/README.md],File[/home/jbond/.vimrc],File[/home/jbond/.zshenv] [10:39:45] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/home/jbond/.vim/bundle/README.md],File[/home/jbond/.vimrc] [10:39:45] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/home/jbond/.gitconfig],File[/home/jbond/.vimrc],File[/home/jbond/.zshrc] [10:40:43] running puppet on all of that [10:41:16] (03PS2) 10Ladsgroup: Enable UrlShortener in mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500910 (https://phabricator.wikimedia.org/T108557) [10:41:32] sorry reverting [10:41:52] (03PS1) 10Jbond: Revert "jbond home: add user files" [puppet] - 10https://gerrit.wikimedia.org/r/500919 [10:42:03] (03CR) 10jerkins-bot: [V: 04-1] Enable UrlShortener in mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500910 (https://phabricator.wikimedia.org/T108557) (owner: 10Ladsgroup) [10:42:47] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:42:49] jbond42: why revert. it's fine [10:42:59] nothing to worry [10:43:09] it's just a matter of scale and timing [10:43:21] oh ok i just saw problems and my name [10:43:36] a change that touches everything just gets like 5 out of 1000 [10:43:43] that happen to run during the check or so [10:43:50] they were all fine after next run [10:43:57] ahh ok thanks [10:44:04] no worries [10:44:19] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:44:21] (03Abandoned) 10Jbond: Revert "jbond home: add user files" [puppet] - 10https://gerrit.wikimedia.org/r/500919 (owner: 10Jbond) [10:44:41] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:45:01] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:45:01] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:46:24] (03PS9) 10Arturo Borrero Gonzalez: openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 [10:49:07] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:49:41] (03PS3) 10Ladsgroup: Enable UrlShortener in mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500910 (https://phabricator.wikimedia.org/T108557) [10:50:26] (03PS10) 10Arturo Borrero Gonzalez: openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 [10:53:31] 10Operations, 10Phabricator, 10Traffic: Make phame cacheable - https://phabricator.wikimedia.org/T219978 (10ema) [10:53:39] 10Operations, 10Phabricator, 10Traffic: Make phame cacheable - https://phabricator.wikimedia.org/T219978 (10ema) p:05Triage→03Normal [10:54:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "I'm finally happy with the resulting catalogs: https://puppet-compiler.wmflabs.org/compiler1002/15527/" [puppet] - 10https://gerrit.wikimedia.org/r/500797 (owner: 10Arturo Borrero Gonzalez) [10:54:51] (03CR) 10Alex Monk: "I don't mind which person does it, whichever is most convenient for you two. I think Andrew is planning to when he has time." [puppet] - 10https://gerrit.wikimedia.org/r/500825 (owner: 10Alex Monk) [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190403T1100). [11:00:04] Tulsi and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:40] Amir1: could you please also deploy Tulsi's patch? (if they are around) [11:00:45] 10Operations: Add support for temporary chroots to boron - https://phabricator.wikimedia.org/T219977 (10ema) [11:00:56] yeah sure, let's see if they are around [11:01:30] Amir1: swat is yours then, start with your patch, continue with Tulsi's [11:01:42] yess [11:02:51] RECOVERY - puppet last run on cp5006 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [11:03:54] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500910 (https://phabricator.wikimedia.org/T108557) (owner: 10Ladsgroup) [11:05:02] (03Merged) 10jenkins-bot: Enable UrlShortener in mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500910 (https://phabricator.wikimedia.org/T108557) (owner: 10Ladsgroup) [11:06:48] XioNoX: they should be investigated, we can downtime for a few days though [11:07:53] (03PS1) 10Jbond: jbond_home: use emacs bindkeys [puppet] - 10https://gerrit.wikimedia.org/r/500923 [11:09:37] (03CR) 10Jbond: [C: 03+2] jbond_home: use emacs bindkeys [puppet] - 10https://gerrit.wikimedia.org/r/500923 (owner: 10Jbond) [11:09:48] (03PS1) 10Arturo Borrero Gonzalez: Revert "openstack: clientpackages: fix missing deb repo installation" [puppet] - 10https://gerrit.wikimedia.org/r/500924 [11:09:50] (03CR) 10jenkins-bot: Enable UrlShortener in mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500910 (https://phabricator.wikimedia.org/T108557) (owner: 10Ladsgroup) [11:11:06] (03CR) 10jerkins-bot: [V: 04-1] Revert "openstack: clientpackages: fix missing deb repo installation" [puppet] - 10https://gerrit.wikimedia.org/r/500924 (owner: 10Arturo Borrero Gonzalez) [11:12:06] (03PS2) 10Arturo Borrero Gonzalez: Revert "openstack: clientpackages: fix missing deb repo installation" [puppet] - 10https://gerrit.wikimedia.org/r/500924 [11:13:39] (03PS3) 10Arturo Borrero Gonzalez: Revert "openstack: clientpackages: fix missing deb repo installation" [puppet] - 10https://gerrit.wikimedia.org/r/500924 [11:14:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "openstack: clientpackages: fix missing deb repo installation" [puppet] - 10https://gerrit.wikimedia.org/r/500924 (owner: 10Arturo Borrero Gonzalez) [11:16:13] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:16:32] !rolling security updates for apache [11:16:36] !log rolling security updates for apache [11:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:30] okay, it works for example w.wiki/6 [11:20:30] (03PS1) 10Ladsgroup: Revert "Enable UrlShortener in mediawikiwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500926 [11:21:53] (03CR) 10Ladsgroup: [C: 03+2] Revert "Enable UrlShortener in mediawikiwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500926 (owner: 10Ladsgroup) [11:23:04] (03Merged) 10jenkins-bot: Revert "Enable UrlShortener in mediawikiwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500926 (owner: 10Ladsgroup) [11:25:20] !log EU SWAT is done [11:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:50] zeljkof: I put all of the Easter eggs. like w.wiki/e (ta-duh) [11:26:55] (03PS1) 10Volans: check_icinga: add configuration validator [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/500927 [11:28:23] (03CR) 10Volans: "The plan is to add a symlink on the wikitech-static host like we have now for the check_icinga and tell people to run it when modifying th" [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/500927 (owner: 10Volans) [11:30:17] PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:30:21] PROBLEM - HTTP-dbtree on dbmonitor2001 is CRITICAL: connect to address 208.80.153.52 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [11:30:27] PROBLEM - Check systemd state on dbmonitor2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:30:37] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Miriam) Thanks @EBernhardson and all!!. Would a CNN finetuning task, u... [11:30:39] PROBLEM - puppet last run on mc2031 is CRITICAL: CRITICAL: Puppet has 12 failures. Last run 3 minutes ago with 12 failures. Failed resources (up to 3 shown) [11:32:05] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:32:07] PROBLEM - puppet last run on mw2258 is CRITICAL: CRITICAL: Puppet has 80 failures. Last run 5 minutes ago with 80 failures. Failed resources (up to 3 shown) [11:32:18] (03CR) 10jenkins-bot: Revert "Enable UrlShortener in mediawikiwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500926 (owner: 10Ladsgroup) [11:32:27] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown) [11:32:31] PROBLEM - puppet last run on mw2195 is CRITICAL: CRITICAL: Puppet has 35 failures. Last run 5 minutes ago with 35 failures. Failed resources (up to 3 shown): File[/var/lib/hphpd/hphpd.ini],File[/usr/local/bin/mwrepl],File[/etc/logrotate.d/mediawiki_apache],File[/etc/rsyslog.lookup.d/lookup_table_output.json] [11:32:43] PROBLEM - puppet last run on mw2248 is CRITICAL: CRITICAL: Puppet has 37 failures. Last run 5 minutes ago with 37 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP],File[/etc/tmpreaper.conf],File[/usr/local/bin/cgroup-mediawiki-clean],File[/etc/ImageMagick-6/policy.xml] [11:32:45] PROBLEM - puppet last run on cp5008 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/home/krinkle],File[/home/gilles] [11:33:11] PROBLEM - puppet last run on an-worker1079 is CRITICAL: CRITICAL: Puppet has 40 failures. Last run 4 minutes ago with 40 failures. Failed resources (up to 3 shown): File[/home/filippo],File[/home/jgreen],File[/home/bblack],File[/home/andrew] [11:33:15] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Puppet has 71 failures. Last run 6 minutes ago with 71 failures. Failed resources (up to 3 shown) [11:33:35] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Puppet has 11 failures. Last run 4 minutes ago with 11 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml],File[/usr/local/bin/puppet-enabled],File[/usr/local/bin/prometheus-puppet-agent-stats],File[/etc/rsyslog.d] [11:34:27] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:34:35] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:34:44] Expected? ^ [11:35:11] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Puppet has 31 failures. Last run 6 minutes ago with 31 failures. Failed resources (up to 3 shown): File[/etc/vim/vimrc.local],File[/usr/local/bin/phaste],File[/root/.screenrc],File[/usr/local/lib/nagios/plugins/] [11:35:33] PROBLEM - puppet last run on cloudvirtan1004 is CRITICAL: CRITICAL: Puppet has 33 failures. Last run 6 minutes ago with 33 failures. Failed resources (up to 3 shown): File[/home/filippo],File[/home/jgreen],File[/home/bblack],File[/home/andrew] [11:35:45] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:35:55] RECOVERY - puppet last run on mc2031 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:36:03] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Puppet has 13 failures. Last run 7 minutes ago with 13 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_systemd_state],File[/usr/local/lib/nagios/plugins/check_long_procs],File[/etc/smartmontools/run.d/20logger],File[/usr/lib/nagios/plugins/check_timedatectl] [11:36:05] ^^ checking however i think this may have occured as i updated apache without first disabling puppet agent [11:36:19] sorry for the noise, everything i have tested so far is working [11:36:39] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 7 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/RapidSSL_SHA256_CA_-_G3.crt],File[/usr/local/share/ca-certificates/DigiCert_High_Assurance_CA-3.crt],File[/usr/local/share/ca-certificates/DigiCert_SHA2_High_Assurance_Server_CA.crt],File[/usr/local/share/ca-certificates/GlobalSign_ [11:36:39] dation_CA_-_SHA256_-_G2.crt] [11:36:55] you mean upgrading apache on one of the puppet masters? [11:37:00] jbond42: hey we know icinga works now atleast lol [11:37:30] if so, yes, that is the typical amount of puppet spam caused by the apache upgrade window [11:37:49] moritzm: erm no i didn't think theses would be so noise :S [11:37:59] RECOVERY - puppet last run on mw2248 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:38:37] puppetdb restarts are even worse :-) [11:39:29] moritzm: to be fair icinga is bound to complain regardless what you do :P [11:40:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Let's err on the side of caution and wait for filippo's +1 as well" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [11:40:37] fyi the alert on webperf1002 is geniune https://phabricator.wikimedia.org/P8334 [11:41:01] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:41:17] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[apache2] [11:42:11] looking at webperf1002 [11:42:29] (03PS1) 10Ladsgroup: wikilabels: update alias for database [puppet] - 10https://gerrit.wikimedia.org/r/500928 (https://phabricator.wikimedia.org/T219563) [11:50:18] (03PS1) 10Muehlenhoff: xhgui: Properly include passwords [puppet] - 10https://gerrit.wikimedia.org/r/500930 [11:50:25] (03PS1) 10Jbond: webperf1002: include passwords class [puppet] - 10https://gerrit.wikimedia.org/r/500931 [11:50:36] lol moritzm seems you beat me to it [11:51:02] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500930 (owner: 10Muehlenhoff) [11:51:15] PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[apache2] [11:51:31] (03Abandoned) 10Jbond: webperf1002: include passwords class [puppet] - 10https://gerrit.wikimedia.org/r/500931 (owner: 10Jbond) [11:51:40] great minds think alike :-) [11:51:45] :) [11:51:57] (03PS2) 10Muehlenhoff: xhgui: Properly include passwords [puppet] - 10https://gerrit.wikimedia.org/r/500930 [11:52:40] moritzm: looks like this error has been around for a while, is it possible its only started to trigger because of the apache upgrade [11:53:01] yeah, it was introduced on 5th of March [11:53:24] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/494425/ [11:53:32] actually, 13th, when it was merged [11:53:57] (03CR) 10Muehlenhoff: [C: 03+2] xhgui: Properly include passwords [puppet] - 10https://gerrit.wikimedia.org/r/500930 (owner: 10Muehlenhoff) [11:55:17] moritzm: ahh the notify to apache just dose a reload so it probably errord once and never reloaded its config untill just now when it was rtestarted witrh the upgrade [11:55:56] ack, yes [11:56:12] w.wiki/$ -> donate.wikimedia.org [11:58:31] RECOVERY - puppet last run on mw2258 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:58:55] RECOVERY - puppet last run on mw2195 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:59:09] RECOVERY - puppet last run on cp5008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:59:21] Hello can someone deploy https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/499530/ [11:59:35] RECOVERY - puppet last run on an-worker1079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:59:39] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:59:43] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:59:59] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190403T1200) [12:00:19] (03PS1) 10Muehlenhoff: Remove passwords::ldap::production from role::webperf::profiling_tools [puppet] - 10https://gerrit.wikimedia.org/r/500932 [12:00:58] Tulsi: the SWAT is over, we waited for you [12:01:01] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:01:15] be around in the time of deployment [12:01:37] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:01:57] RECOVERY - puppet last run on cloudvirtan1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:02:11] Amir1: :/ [12:02:21] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[apache2] [12:02:27] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:02:33] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500932 (owner: 10Muehlenhoff) [12:03:05] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:03:49] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:04:05] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:05:35] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:07:02] (03CR) 10Muehlenhoff: [C: 03+2] Remove passwords::ldap::production from role::webperf::profiling_tools [puppet] - 10https://gerrit.wikimedia.org/r/500932 (owner: 10Muehlenhoff) [12:09:19] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational [12:13:01] RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:15:49] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:17:39] RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:18:15] RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational [12:18:43] thanks for the fix moritz [12:18:48] re: webperf xhgui [12:22:13] sure, yw :-) [12:23:45] (03CR) 10CDanis: [C: 03+1] uwsgi: allow setting routing rules [puppet] - 10https://gerrit.wikimedia.org/r/500729 (owner: 10Giuseppe Lavagetto) [12:25:00] (03CR) 10CDanis: [C: 03+1] graphite: correctly set Cache-control: no-store [puppet] - 10https://gerrit.wikimedia.org/r/500730 (owner: 10Giuseppe Lavagetto) [12:26:21] PROBLEM - puppet last run on francium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:26:55] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [12:31:15] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:31:18] (03PS1) 10Mathew.onipe: acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o [puppet] - 10https://gerrit.wikimedia.org/r/500940 (https://phabricator.wikimedia.org/T214921) [12:31:26] !log restarting gerrit service to apply change 498431 [12:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:46] gerrit is restarting.. be back momentarily [12:31:51] ok [12:33:39] onimisionipe: it's back. sorry about the interruption. tried to find a quiet window [12:33:54] mutante: Oh... no p! [12:34:45] (03PS2) 10Mathew.onipe: acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o [puppet] - 10https://gerrit.wikimedia.org/r/500940 (https://phabricator.wikimedia.org/T214921) [12:34:53] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: don't include clientpackages in cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/500946 (https://phabricator.wikimedia.org/T219981) [12:35:01] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [12:35:25] ^ caused by gerrit restart. fixed by puppet. no issue [12:36:55] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [12:37:19] (03CR) 10CDanis: [C: 03+1] "LGTM with one nit" (031 comment) [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/500927 (owner: 10Volans) [12:37:57] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [12:38:11] running puppet on those too [12:38:31] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [12:38:37] does anyone else have the sense that in the past few weeks we're seeing a fair bit more puppet failures due to 'Catalog fetch fail'? [12:38:45] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [12:38:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: don't include clientpackages in cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/500946 (https://phabricator.wikimedia.org/T219981) (owner: 10Arturo Borrero Gonzalez) [12:39:29] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [12:40:07] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [12:40:19] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:42:11] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:42:15] (03CR) 10Volans: check_icinga: add configuration validator (031 comment) [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/500927 (owner: 10Volans) [12:42:29] !log T219626 reimaging cloudcontrol2001-dev [12:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:33] T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626 [12:43:10] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15528/aqs1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/500917 (owner: 10Elukey) [12:43:15] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:43:25] (03PS4) 10Elukey: profile::aqs: add the analytics contact group to aqs's alarms [puppet] - 10https://gerrit.wikimedia.org/r/500917 [12:43:49] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:44:47] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:44:51] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:45:23] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:45:30] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [12:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:44] (03PS1) 10Elukey: aqs: better handling of contact groups [puppet] - 10https://gerrit.wikimedia.org/r/500949 [12:49:49] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99) [12:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:15] (03CR) 10Dzahn: "gerrit service restarted to apply this" [puppet] - 10https://gerrit.wikimedia.org/r/498431 (owner: 10Hashar) [12:51:35] (03PS2) 10Elukey: aqs: better handling of contact groups [puppet] - 10https://gerrit.wikimedia.org/r/500949 [12:52:49] RECOVERY - puppet last run on francium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:55:50] (03PS1) 10Gehel: elasticsearch: wait for host up after reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/500952 [12:56:26] 10Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic: Shortened URLs won't redirect when there's data - https://phabricator.wikimedia.org/T219986 (10Ladsgroup) I think there's something with https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/text-backend.inc.vcl.erb but my... [12:56:35] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:58:46] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:59:42] (03CR) 10DCausse: [C: 03+1] elasticsearch: wait for host up after reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/500952 (owner: 10Gehel) [13:00:02] (03CR) 10Gehel: [C: 03+2] elasticsearch: wait for host up after reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/500952 (owner: 10Gehel) [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190403T1300) [13:01:16] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [13:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:46] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:02:10] 10Operations, 10Domains, 10Traffic, 10serviceops: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Dzahn) p:05Triage→03Normal ` cat wikivoyage-old.org ; vim: set expandtab:smarttab @ 1D IN SOA n... [13:02:41] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99) [13:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:04] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15531/aqs1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/500949 (owner: 10Elukey) [13:03:11] (03CR) 10Elukey: [C: 03+2] aqs: better handling of contact groups [puppet] - 10https://gerrit.wikimedia.org/r/500949 (owner: 10Elukey) [13:03:16] 10Operations, 10Domains, 10Traffic, 10serviceops: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Dzahn) [13:03:42] 10Operations: DNS for wikivoyage-old.org - https://phabricator.wikimedia.org/T81727 (10Dzahn) [13:04:15] 10Operations: wikivoyage migration (tracking) - https://phabricator.wikimedia.org/T81583 (10Dzahn) [13:04:40] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:05:12] making old tickets public that used to be NDA but for no reason expect being RT and out of caution [13:05:33] that's why you might see something ancient pop up in feed [13:07:56] it doesnt mean they are new issues, it's for transparency and linkability.. and comes from ticket system before phab [13:08:32] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures [13:09:40] 10Operations: SSL / protoproxy config for wikivoyage - https://phabricator.wikimedia.org/T81686 (10Dzahn) [13:10:19] 10Operations: SSL cert for wikivoyage.org - https://phabricator.wikimedia.org/T81588 (10Dzahn) [13:10:40] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: drop even more usage of clientpackages [puppet] - 10https://gerrit.wikimedia.org/r/500956 (https://phabricator.wikimedia.org/T219981) [13:11:35] 10Operations: wikivoyage - import mysql dumps to s3 - https://phabricator.wikimedia.org/T81734 (10Dzahn) [13:12:29] 10Operations: redirect wikivoyage.de to wikivoyage.org with/after switch - https://phabricator.wikimedia.org/T81726 (10Dzahn) [13:12:51] 10Operations: create the new DNS zone file template for wikivoyage.org (wv "going live"-switch) - https://phabricator.wikimedia.org/T81569 (10Dzahn) [13:13:42] 10Operations: setup wikivoyage-lb - https://phabricator.wikimedia.org/T81555 (10Dzahn) [13:14:15] 10Operations, 10netops: IPv6 LVS service IPs (secure6) - https://phabricator.wikimedia.org/T81670 (10Dzahn) [13:16:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: drop even more usage of clientpackages [puppet] - 10https://gerrit.wikimedia.org/r/500956 (https://phabricator.wikimedia.org/T219981) (owner: 10Arturo Borrero Gonzalez) [13:20:18] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:22:50] jynus: there is an alert on icinga about dbmonitor2001 and dbtree, could that be related to the earlier work with ferm? [13:23:06] 10Operations, 10Patch-For-Review: Ferm rules for dumps (ms1001/datasets) - https://phabricator.wikimedia.org/T105040 (10Dzahn) [13:23:17] 10Operations: Ferm rules for dumps (ms1001/datasets) - https://phabricator.wikimedia.org/T105040 (10Dzahn) [13:27:48] (03CR) 10Volans: [C: 03+2] check_icinga: add configuration validator [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/500927 (owner: 10Volans) [13:27:50] marostegui: jynus : syntax error in apache config related to ssl [13:28:19] (03Merged) 10jenkins-bot: check_icinga: add configuration validator [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/500927 (owner: 10Volans) [13:28:20] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:30:03] (03PS2) 10Gilles: Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) [13:30:18] (03CR) 10Gilles: Make caching of static performance site explicit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) (owner: 10Gilles) [13:31:17] (03CR) 10jerkins-bot: [V: 04-1] Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) (owner: 10Gilles) [13:33:26] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures [13:34:36] RECOVERY - Check systemd state on dbmonitor2001 is OK: OK - running: The system is fully operational [13:34:38] RECOVERY - HTTP-dbtree on dbmonitor2001 is OK: HTTP OK: HTTP/1.1 200 OK - 80589 bytes in 1.084 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [13:34:47] :) [13:34:49] !log reverting dbmonitor2001 to deb8u12+wmf1 build [13:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:14] (03PS3) 10Gilles: Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) [13:36:22] (03CR) 10jerkins-bot: [V: 04-1] Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) (owner: 10Gilles) [13:36:58] (03PS1) 10Volans: icinga: rename config validator and set exec bit [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/500960 [13:37:27] (03CR) 10CDanis: [C: 03+1] icinga: rename config validator and set exec bit [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/500960 (owner: 10Volans) [13:37:51] (03CR) 10Volans: [C: 03+2] icinga: rename config validator and set exec bit [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/500960 (owner: 10Volans) [13:38:15] (03Merged) 10jenkins-bot: icinga: rename config validator and set exec bit [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/500960 (owner: 10Volans) [13:38:29] (03PS4) 10Gilles: Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) [13:44:07] !log gilles@deploy1001 Synchronized php-1.33.0-wmf.23/includes/media/MediaTransformOutput.php: T216499 Identify images that should have had high importance (duration: 00m 59s) [13:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:10] T216499: Priority Hints origin trial - https://phabricator.wikimedia.org/T216499 [13:44:40] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [13:46:31] (03PS5) 10Gilles: Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) [13:46:32] !log restarting neutron-metadata-agent on cloudnet1003 [13:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:00] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:49:38] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:50:06] 10Operations, 10Release-Engineering-Team: mwdebug2001 "/" almost full - https://phabricator.wikimedia.org/T219989 (10Marostegui) [13:56:24] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [13:56:40] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:59:32] !log restarting neutron-l3-agent on cloudnet1003 and cloudnet1004 [13:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:47] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10MoritzMuehlenhoff) It's a tough nut to crack, I've made progress on a number of issues, but still not fully done yet: 1. The failure quoted above is ultimately a bug in the C++ sta... [14:02:32] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:03:14] !log restarting rabbitmq on cloudcontrol1003 [14:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:38] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:15:36] (03PS1) 10Joal: Update sqoop number of processors to 10 [puppet] - 10https://gerrit.wikimedia.org/r/500964 [14:18:37] !log Stop replication on pc2007 for testing - T210725 [14:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:55] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [14:23:28] (03PS1) 10Elukey: cumin: add more hadoop-related aliases [puppet] - 10https://gerrit.wikimedia.org/r/500967 (https://phabricator.wikimedia.org/T218343) [14:28:24] 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Server rename: labtestnet2003 to cloudnet2003-dev, update label and switch ports descriptions, etc - https://phabricator.wikimedia.org/T219861 (10Papaul) 05Open→03Resolved complete [14:28:26] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10Papaul) [14:30:48] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10EBernhardson) >>! In T148843#5081277, @Miriam wrote: > Thanks @EBernha... [14:33:15] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T219852 (10Papaul) a:05Papaul→03Marostegui complete [14:33:56] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T219852 (10Marostegui) Thanks! ` physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, Rebuilding) ` [14:35:12] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:36:58] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:40:11] 10Operations, 10Phabricator, 10Traffic: Make phame cacheable - https://phabricator.wikimedia.org/T219978 (10ema) [14:45:38] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:50:29] (03CR) 10Elukey: [C: 03+2] Update sqoop number of processors to 10 [puppet] - 10https://gerrit.wikimedia.org/r/500964 (owner: 10Joal) [14:51:13] 10Operations, 10Parsoid, 10RESTBase, 10VisualEditor, and 5 others: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 (10Pchelolo) [14:51:28] PROBLEM - cache_text: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [14:52:09] (03CR) 10Bstorm: [C: 03+1] wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [14:52:54] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:53:12] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:55:09] !log anomie@deploy1001 Synchronized php-1.33.0-wmf.23/maintenance/includes/MigrateActors.php: Backporting fix from [[gerrit:500754]] (duration: 01m 01s) [14:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:33] checking varnishkafka [14:56:37] !log anomie@deploy1001 Synchronized php-1.33.0-wmf.24/maintenance/includes/MigrateActors.php: Backporting fix from [[gerrit:500754]] (duration: 01m 01s) [14:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:20] (03CR) 10Bstorm: cloudstore: start refactor for role switch up around the labstores (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [14:58:45] (03PS1) 10Alex Monk: profile::cache::ssl::wikibase: Simplify [puppet] - 10https://gerrit.wikimedia.org/r/500973 [14:59:10] RECOVERY - cache_text: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [14:59:30] (03CR) 10jerkins-bot: [V: 04-1] profile::cache::ssl::wikibase: Simplify [puppet] - 10https://gerrit.wikimedia.org/r/500973 (owner: 10Alex Monk) [14:59:33] !log anomie@mwmaint1002 Fixing empty values for 'target_author_actor' in log_search on section 1 wikis for T215525 [14:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:36] T215525: log_search rows with ls_field='target_author_actor' and empty ls_value are created during actor migration - https://phabricator.wikimedia.org/T215525 [14:59:42] (03PS1) 10Gehel: elasticsearch: batch_sleep should be None, not 0.0 [cookbooks] - 10https://gerrit.wikimedia.org/r/500974 [15:00:06] !log anomie@mwmaint1002 Fixing empty values for 'target_author_actor' in log_search on section 2 wikis for T215525 [15:00:06] !log anomie@mwmaint1002 Fixing empty values for 'target_author_actor' in log_search on remaining section 3 wikis for T215525 [15:00:06] !log anomie@mwmaint1002 Fixing empty values for 'target_author_actor' in log_search on section 4 wikis for T215525 [15:00:06] !log anomie@mwmaint1002 Fixing empty values for 'target_author_actor' in log_search on section 5 wikis for T215525 [15:00:06] !log anomie@mwmaint1002 Fixing empty values for 'target_author_actor' in log_search on section 6 wikis for T215525 [15:00:06] !log anomie@mwmaint1002 Fixing empty values for 'target_author_actor' in log_search on section 7 wikis for T215525 [15:00:06] !log anomie@mwmaint1002 Fixing empty values for 'target_author_actor' in log_search on section 8 wikis for T215525 [15:00:07] !log anomie@mwmaint1002 Fixing empty values for 'target_author_actor' in log_search on wikitech for T215525 [15:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:30] (03PS2) 10Alex Monk: profile::cache::ssl::wikibase: Simplify [puppet] - 10https://gerrit.wikimedia.org/r/500973 [15:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:06] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:03:49] 10Operations, 10Release-Engineering-Team: mwdebug2001 "/" almost full - https://phabricator.wikimedia.org/T219989 (10greg) [15:03:56] 10Operations, 10Release-Engineering-Team: mwdebug2001 "/" almost full - https://phabricator.wikimedia.org/T219989 (10thcipriani) This is related to T218783 [15:05:11] (03CR) 10Volans: [C: 03+1] "LGTM with a caveat" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/500974 (owner: 10Gehel) [15:05:12] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) Clang-4.0 is provided by security jessie/updates and have managed to get pbuilder working by adding the following ` deb http://security.debian.org/ jessie/updates main ` to:... [15:05:30] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) p:05Triage→03Normal [15:06:29] (03PS3) 10Dzahn: varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) [15:06:46] 10Operations, 10ops-eqiad, 10DBA: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) @Cmjohnson let us know that the BBU arrived and he'll need to put the server down to be able to replace it. So we need to do a failover and failback to db1075 (the previous... [15:06:57] (03CR) 10Dzahn: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [15:07:00] RECOVERY - Device not healthy -SMART- on db2070 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2070&var-datasource=codfw+prometheus/ops [15:07:00] 10Operations, 10ops-eqiad, 10DBA: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) a:05Cmjohnson→03Marostegui [15:07:12] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) [15:08:20] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:08:48] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) [15:09:12] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10jcrespo) [15:09:24] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) One other thing left to do here: replace optional parameters in the `/sys` hierarchy specs. [15:09:52] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10jcrespo) 05Open→03Resolved a:05jcrespo→03Papaul This is done, except the problems with mounting point of the ssds, to be handled... [15:10:04] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:12:15] (03CR) 10Bstorm: "Will merge after Iefcc0a8ea51a3cddc0e79218809e14d97acfc186 is merged" [puppet] - 10https://gerrit.wikimedia.org/r/500535 (https://phabricator.wikimedia.org/T219817) (owner: 10Bstorm) [15:13:19] (03PS1) 10Ladsgroup: Add mediawiki.org to the URL shortener whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500976 [15:13:21] (03CR) 10Bstorm: [C: 03+1] "Oh yes, we should do this one asap. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/500928 (https://phabricator.wikimedia.org/T219563) (owner: 10Ladsgroup) [15:14:15] (03PS4) 10Dzahn: varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) [15:14:17] (03CR) 10Vgutierrez: "looks good but please go a step further and get rid of check_ssl_unified_sni_letsencrypt_no_ocsp" [puppet] - 10https://gerrit.wikimedia.org/r/500973 (owner: 10Alex Monk) [15:14:41] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10MoritzMuehlenhoff) >>! In T219803#5081880, @jbond wrote: > Clang-4.0 is provided by security jessie/updates and have managed to get pbuilder working by adding the following Ah, righ... [15:16:27] (03PS3) 10Dzahn: confd: remove upstart support [puppet] - 10https://gerrit.wikimedia.org/r/500913 [15:16:36] !log volans@cumin1001 START - Cookbook sre.hosts.downtime [15:16:36] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:42] (03CR) 10Dzahn: [C: 03+2] confd: remove upstart support [puppet] - 10https://gerrit.wikimedia.org/r/500913 (owner: 10Dzahn) [15:18:12] !log shutdown ms-be2026 for firmware upgrade - T219854 [15:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:16] T219854: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 [15:18:41] (03CR) 10Krinkle: Enable UrlShortener in mediawikiwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500910 (https://phabricator.wikimedia.org/T108557) (owner: 10Ladsgroup) [15:19:34] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:20:39] (03PS2) 10Gehel: elasticsearch: batch_sleep should be None, not 0.0 [cookbooks] - 10https://gerrit.wikimedia.org/r/500974 [15:22:50] 10Operations, 10Packaging: Add security apt security suites to pbuilder base images - https://phabricator.wikimedia.org/T220003 (10jbond) [15:23:01] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) have created https://phabricator.wikimedia.org/T220003 [15:23:05] 10Operations, 10ops-eqiad, 10DBA: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo) [15:23:08] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10jcrespo) [15:23:10] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10jcrespo) [15:23:15] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10jcrespo) [15:23:18] 10Operations, 10Citoid, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10mobrovac) p:05Triage→03Normal [15:26:02] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:26:05] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Move graphoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219923 (10mobrovac) p:05Triage→03Normal Since Graphoid currently does no... [15:26:15] (03CR) 10Volans: cumin: add more hadoop-related aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500967 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [15:27:14] (03CR) 10Elukey: cumin: add more hadoop-related aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500967 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [15:28:32] (03PS2) 10Elukey: cumin: add more hadoop-related aliases [puppet] - 10https://gerrit.wikimedia.org/r/500967 (https://phabricator.wikimedia.org/T218343) [15:29:28] (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: require apt-get update before moving on [puppet] - 10https://gerrit.wikimedia.org/r/500977 (https://phabricator.wikimedia.org/T219981) [15:30:11] (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: require apt-get update before moving on [puppet] - 10https://gerrit.wikimedia.org/r/500977 (https://phabricator.wikimedia.org/T219981) (owner: 10Arturo Borrero Gonzalez) [15:30:46] 10Operations, 10Parsoid, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move parsoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219927 (10mobrovac) @fgiunchedi Since Parsoid is being moved over to PHP in the next 2 Qs, is there still point in moving the Node.js version over... [15:30:49] (03PS1) 10Dzahn: park wikivoyage-old.org [dns] - 10https://gerrit.wikimedia.org/r/500978 (https://phabricator.wikimedia.org/T219867) [15:31:05] 10Operations: netbox: User's groups not updated - https://phabricator.wikimedia.org/T220004 (10GTirloni) [15:31:38] (03PS2) 10Andrew Bogott: openstack::monitor::spreadcheck: Use a list of projects [puppet] - 10https://gerrit.wikimedia.org/r/500823 (owner: 10Alex Monk) [15:32:19] (03CR) 10Andrew Bogott: [C: 03+2] "Puppet compiler shows only test descriptions changing." [puppet] - 10https://gerrit.wikimedia.org/r/500823 (owner: 10Alex Monk) [15:34:44] (03CR) 10Bstorm: [C: 03+2] wikilabels: update alias for database [puppet] - 10https://gerrit.wikimedia.org/r/500928 (https://phabricator.wikimedia.org/T219563) (owner: 10Ladsgroup) [15:34:53] (03PS2) 10Bstorm: wikilabels: update alias for database [puppet] - 10https://gerrit.wikimedia.org/r/500928 (https://phabricator.wikimedia.org/T219563) (owner: 10Ladsgroup) [15:36:40] (03PS2) 10Andrew Bogott: openstack::monitor::spreadcheck: rm old renaming absent file resources [puppet] - 10https://gerrit.wikimedia.org/r/500824 (owner: 10Alex Monk) [15:37:32] (03PS2) 10Andrew Bogott: openstack::monitor::spreadcheck: add cloudinfra config [puppet] - 10https://gerrit.wikimedia.org/r/500825 (owner: 10Alex Monk) [15:38:00] (03CR) 10Andrew Bogott: [C: 03+2] openstack::monitor::spreadcheck: rm old renaming absent file resources [puppet] - 10https://gerrit.wikimedia.org/r/500824 (owner: 10Alex Monk) [15:38:28] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move proton logging to new logging pipeline - https://phabricator.wikimedia.org/T219925 (10LGoto) p:05Triage→03Normal [15:38:32] (03PS2) 10Alex Monk: sslcert: update-ocsp: Fix passing Host header in absence of proxy [puppet] - 10https://gerrit.wikimedia.org/r/500398 [15:38:33] (03PS4) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 [15:38:43] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:39:01] (03PS1) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [15:39:06] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move mobile apps logging to new logging pipeline - https://phabricator.wikimedia.org/T219924 (10LGoto) p:05Triage→03Normal [15:39:29] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:40:01] (03PS3) 10Alex Monk: profile::cache::ssl::wikibase: Simplify [puppet] - 10https://gerrit.wikimedia.org/r/500973 [15:40:13] (03CR) 10Andrew Bogott: [C: 03+2] openstack::monitor::spreadcheck: add cloudinfra config [puppet] - 10https://gerrit.wikimedia.org/r/500825 (owner: 10Alex Monk) [15:42:54] (03PS26) 10Andrew Bogott: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [15:44:06] (03PS2) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [15:45:33] (03CR) 10WMDE-Fisch: wikiba.se: add Apache rewrites for www to naked domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [15:46:21] (03CR) 10Andrew Bogott: "Latest diffs: https://puppet-compiler.wmflabs.org/compiler1001/15535/" [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [15:54:31] 10Operations, 10Release-Engineering-Team: mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 (10jcrespo) [15:56:11] 10Operations, 10ops-codfw, 10Patch-For-Review: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10Papaul) HP FlexFabric 10Gb 2port 534FLR-SFP+ Adapter 7.17.19 Embedded HPE Smart Storage Battery 1 Firmware 1.1 Embedded iLO 2.60 May 23 2018 System Board Intelligent Platform Abstraction Da... [15:57:35] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:58:07] PROBLEM - tools project instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: k8s-etcd,prometheus,static class instances not spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:58:57] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:59:07] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/500731 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [15:59:31] (03CR) 10jerkins-bot: [V: 04-1] Turn on non-chaining CNAMEs experimental option [dns] - 10https://gerrit.wikimedia.org/r/500731 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [16:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190403T1600). [16:00:05] Zoranzoki21, Tulsi, and Pchelolo: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:13] here [16:00:19] \o [16:02:14] can anyone swat? all of #wikimedia-releng is in a meeting right now [16:03:11] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:07:34] I could do the deploys, but AFAIK some SWAT member should still be available on standby or something in case I really screw up [16:08:53] Lucas_WMDE: there are several of us on standby ;) [16:09:04] I'll keep an eye on this channel [16:09:10] okay, then I can do it [16:09:23] Lucas_WMDE: thank you! much appreciated [16:09:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "I checked that there’s no more namespace 104 in https://ar.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces, so I thin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500153 (https://phabricator.wikimedia.org/T217507) (owner: 10Zoranzoki21) [16:09:54] (03PS5) 10Lucas Werkmeister (WMDE): Remove namespace 104 from FlaggedRevs configuration for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500153 (https://phabricator.wikimedia.org/T217507) (owner: 10Zoranzoki21) [16:10:04] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) Looks like we may need to rebuild everything with stdc++[1] already tried leatherman and get a simlar errors pointing to relating to boost, hopefully we dont need to rebuild... [16:10:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500153 (https://phabricator.wikimedia.org/T217507) (owner: 10Zoranzoki21) [16:10:39] Oh, super. I am online :) [16:11:01] good, because I already started with your first change :) [16:11:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] "I think we can merge this for now to stop alerting so often, while trying to figure out the reason for this in https://phabricator.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/500839 (owner: 10CRusnov) [16:11:24] (03Merged) 10jenkins-bot: Remove namespace 104 from FlaggedRevs configuration for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500153 (https://phabricator.wikimedia.org/T217507) (owner: 10Zoranzoki21) [16:12:22] yes, I saw it now [16:12:23] Zoranzoki21: the first patch (ns104) should be on mwdebug1002, please test [16:12:30] I’ll review the next one in the meantime [16:12:51] Lucas_WMDE: ok, will do [16:13:33] so slow... [16:13:39] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add three domains at wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500154 (https://phabricator.wikimedia.org/T216886) (owner: 10Zoranzoki21) [16:13:57] (03PS3) 10Zoranzoki21: Add three domains at wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500154 (https://phabricator.wikimedia.org/T216886) [16:14:02] Hi [16:14:17] Good to go, check logs [16:14:21] okay [16:14:57] 500154? [16:15:22] that one’s next [16:15:26] the previous one isn’t done yet though [16:15:41] (03PS1) 10Zoranzoki21: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500987 (https://phabricator.wikimedia.org/T220001) [16:15:49] (03CR) 10Krinkle: "I cherry-picked this to deployment-puppetmaster03 and ran 'puppet agent -tv' on webperf11 but got this error:" [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) (owner: 10Gilles) [16:16:02] hi Tulsi btw :) [16:16:07] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/flaggedrevs.php: SWAT: [[gerrit:500153|Remove namespace 104 from FlaggedRevs configuration for arwiki (T217507)]] (duration: 01m 00s) [16:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500154 (https://phabricator.wikimedia.org/T216886) (owner: 10Zoranzoki21) [16:16:17] T217507: FlaggedRevs still treats a removed namespace as if it still exists (arwiki) - https://phabricator.wikimedia.org/T217507 [16:16:20] oh ok [16:16:26] Hello Lucas_WMDE :-) [16:16:34] Please ping me when it's my turn. [16:16:41] will do, currently doing Zoranzoki21’s patches [16:17:12] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/500974 (owner: 10Gehel) [16:17:24] (03Merged) 10jenkins-bot: Add three domains at wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500154 (https://phabricator.wikimedia.org/T216886) (owner: 10Zoranzoki21) [16:17:33] Lucas_WMDE: Patch is public now, and works [16:17:35] (03CR) 10Gehel: [C: 03+2] elasticsearch: batch_sleep should be None, not 0.0 [cookbooks] - 10https://gerrit.wikimedia.org/r/500974 (owner: 10Gehel) [16:17:47] And 500154 can be merged directly [16:18:05] it’s on mwdebug1002 now [16:18:09] can it be tested? [16:18:22] (03PS2) 10Zoranzoki21: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500987 (https://phabricator.wikimedia.org/T220001) [16:18:29] (03PS3) 10Zoranzoki21: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500987 (https://phabricator.wikimedia.org/T220001) [16:18:36] I’ll just check that at least it’s not breaking Commons [16:18:52] Lucas_WMDE: Yes, you can check it [16:18:56] (why is the debug server so slow? I don’t remember it being that bad) [16:18:57] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:19:12] Lucas_WMDE: I told it already previously [16:19:41] (03CR) 10jenkins-bot: Remove namespace 104 from FlaggedRevs configuration for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500153 (https://phabricator.wikimedia.org/T217507) (owner: 10Zoranzoki21) [16:19:43] (03CR) 10jenkins-bot: Add three domains at wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500154 (https://phabricator.wikimedia.org/T216886) (owner: 10Zoranzoki21) [16:20:29] (03CR) 10Acamicamacaraca: [C: 03+1] Enable Draft namespace on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500761 (https://phabricator.wikimedia.org/T214428) (owner: 10Zoranzoki21) [16:20:31] Lucas_WMDE: I loaded commons, everything works [16:20:41] okay [16:20:53] going ahead [16:21:25] Lucas_WMDE: I added https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/500987/ on deployments calendar too, can you do it after 500761? [16:21:58] (03PS2) 10Zoranzoki21: Enable Draft namespace on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500761 (https://phabricator.wikimedia.org/T214428) [16:22:37] is it okay if we do it at the end of the SWAT window, if there’s still time? [16:22:41] doesn’t look urgent from the task [16:22:52] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:500154|Add three domains at wgCopyUploadDomains (T216886, T219075)]] (duration: 01m 00s) [16:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:57] T216886: Please add uni-hamburg.de to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T216886 [16:22:58] T219075: Add bruun-rasmussen.dk to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T219075 [16:23:16] Lucas_WMDE: Ok [16:23:37] Now you can do adding namespace on srwiki [16:24:49] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM – NS 118 is still free on https://sr.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces and also matches the Draft " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500761 (https://phabricator.wikimedia.org/T214428) (owner: 10Zoranzoki21) [16:25:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500761 (https://phabricator.wikimedia.org/T214428) (owner: 10Zoranzoki21) [16:25:16] heh, it’s nice when I don’t even need to rebase the change, thank you :) [16:25:19] (03PS5) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 [16:25:32] Lucas_WMDE: Your welcome [16:26:13] (03Merged) 10jenkins-bot: Enable Draft namespace on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500761 (https://phabricator.wikimedia.org/T214428) (owner: 10Zoranzoki21) [16:26:41] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [16:27:00] Lucas_WMDE: Is it ready for mwdebug? [16:27:05] it is now [16:27:17] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:27:18] Will check [16:27:22] seems to work in the API at least [16:27:55] Lucas_WMDE: Yes, I can confirm it. LGTM [16:27:59] alright, deploying [16:28:15] https://i.snag.gy/seOh5w.jpg [16:28:32] 10Operations, 10Office-IT, 10Research, 10Wikimedia-Mailing-lists: Create research-alerts mailing list - https://phabricator.wikimedia.org/T219309 (10bmansurov) @Dzahn thanks. Turns out a Google group needs at least one member besides the admin. The only person who will use this mailing list is me for now.... [16:29:05] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:29:17] um [16:29:32] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:500761|Enable Draft namespace on srwiki (T214428)]] (duration: 01m 00s) [16:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:35] T214428: Enable Draft namespace on sr.wikipedia - https://phabricator.wikimedia.org/T214428 [16:29:37] T218796 says that the new idwiktionary namespace will require a maintenance script after deployment [16:29:37] T218796: Add namespace "Lampiran" at ID wiktionary - https://phabricator.wikimedia.org/T218796 [16:29:45] (03PS3) 10Elukey: cumin: add more hadoop-related aliases [puppet] - 10https://gerrit.wikimedia.org/r/500967 (https://phabricator.wikimedia.org/T218343) [16:29:45] is that the case for the new srwiki Draft namespace too? [16:30:00] that would be a question for the SWAT folks, e. g. twentyafterfour :) [16:30:15] * Lucas_WMDE tries to find documentation in the meantime [16:30:18] Lucas_WMDE: No, for srwiki we no need it, because namespace is empty [16:30:46] Lucas_WMDE: For idwiktionary you need it because it contains articles already. [16:30:48] mwscript namespaceDupes.php --wiki=idwiktionary --fix [16:30:53] okay [16:31:58] Lucas_WMDE: srwiki is ok [16:32:34] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM – namespace 102 is still free according to https://id.wiktionary.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces&formatver" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499530 (https://phabricator.wikimedia.org/T218796) (owner: 10Tulsi Bhagat) [16:32:41] (03PS3) 10Lucas Werkmeister (WMDE): Add namespace "Lampiran" at id.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499530 (https://phabricator.wikimedia.org/T218796) (owner: 10Tulsi Bhagat) [16:32:54] Tulsi: starting with your change now [16:33:00] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499530 (https://phabricator.wikimedia.org/T218796) (owner: 10Tulsi Bhagat) [16:33:07] Okay [16:33:51] (03CR) 10Vgutierrez: [C: 03+1] "pcc looks happy, let's merge this tomorrow EU morning: https://puppet-compiler.wmflabs.org/compiler1002/15536/" [puppet] - 10https://gerrit.wikimedia.org/r/500973 (owner: 10Alex Monk) [16:34:06] (03Merged) 10jenkins-bot: Add namespace "Lampiran" at id.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499530 (https://phabricator.wikimedia.org/T218796) (owner: 10Tulsi Bhagat) [16:34:34] Tulsi: it’s on mwdebug1002, please test [16:34:41] Testing [16:35:16] Looks good [16:35:21] alright, deploying [16:35:33] OK [16:36:41] Pchelolo: heads up, your change is coming up soon [16:36:48] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:499530|Add namespace "Lampiran" at id.wiktionary (T218796)]] (duration: 00m 59s) [16:36:49] (I’m not yet done with idwiktionary though) [16:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:53] T218796: Add namespace "Lampiran" at ID wiktionary - https://phabricator.wikimedia.org/T218796 [16:36:55] thank you Lucas_WMDE, [16:37:17] (03CR) 10jenkins-bot: Enable Draft namespace on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500761 (https://phabricator.wikimedia.org/T214428) (owner: 10Zoranzoki21) [16:37:19] (03CR) 10jenkins-bot: Add namespace "Lampiran" at id.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499530 (https://phabricator.wikimedia.org/T218796) (owner: 10Tulsi Bhagat) [16:37:39] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript maintenance/namespaceDupes.php --wiki=idwiktionary --fix # T218796 – 41 links to fix, 41 were resolvable, Looks good! [16:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:09] okay that didn’t take long at all, phew [16:38:36] Lucas_WMDE: I think you should put output in comment at task [16:38:45] Lucas_WMDE: Some deployers always do it [16:38:46] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500967 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [16:38:50] Zoranzoki21: the full output? [16:39:19] Lucas_WMDE: Yes [16:39:33] hmm [16:39:48] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10User-zeljkofilipin: npm 6 consistently fails with "Z_DATA_ERROR: invalid distance too far back" on some repos - https://phabricator.wikimedia.org/T215562 (10Krinkle) a:05Krinkle→03MoritzMuehlenhoff OK. Looks like the image will alread... [16:39:53] (03CR) 10EBernhardson: [C: 03+1] Add 'depicts' statements to search index on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500080 (owner: 10Cparle) [16:40:15] what it fixed.... [16:40:17] I think on it [16:40:33] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:41:00] (03PS2) 10CRusnov: profile kubernetes node: Adjust latency alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/500839 [16:41:02] Lucas_WMDE: I means on this (as example) https://phabricator.wikimedia.org/T212100#4940931 [16:41:27] Zoranzoki21: done [16:41:34] looks good? https://phabricator.wikimedia.org/T218796#5082331 [16:41:43] Lucas_WMDE: Yes, it is [16:41:55] Thank you so much Lucas_WMDE, \o/ [16:42:01] okay, yay [16:42:08] ;) [16:42:11] going ahead with Pchelolo’s change now [16:42:16] thanks for the info Zoranzoki21 [16:42:35] Lucas_WMDE: yw [16:42:36] Lucas_WMDE: gimme a headsup when on mwdebug, I'll test [16:44:22] +2ed, let’s hope gate-and-submit doesn’t take too long [16:44:44] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [16:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:23] okay, its jenkins jobs are already done but now it’s waiting for other changes in the gate-and-submit pipeline :/ [16:45:45] sorry, nevermind, I was looking at the entirely wrong change [16:46:34] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) >>! In T219279#5079510, @Jdforrester-WMF wrote: >>>! In T219279#5068956, @Joe wro... [16:46:49] Lucas_WMDE: Can you do my throttle rule patch meanwhile? [16:48:00] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Add new throttle rule (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500987 (https://phabricator.wikimedia.org/T220001) (owner: 10Zoranzoki21) [16:48:05] Zoranzoki21: reviewed [16:48:15] but I’d prefer to get the backport done first [16:48:17] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) Disclaimer: I do understand that SRE and others have been pretty busy last weeks, and I would absolutely take "we cannot reall... [16:48:32] (03PS4) 10Zoranzoki21: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500987 (https://phabricator.wikimedia.org/T220001) [16:48:38] (03PS5) 10Zoranzoki21: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500987 (https://phabricator.wikimedia.org/T220001) [16:48:42] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10RobH) 05Open→03Resolved robh@sodium:~$ sudo megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online,... [16:48:45] Lucas_WMDE: Fixed [16:49:38] um [16:49:51] well, never mind [16:50:07] I’m not sure if the T0:00 in the Hackathon rule is intentional (midnight) or the same typo [16:50:12] but it doesn’t really matter at the moment I guess [16:50:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500987 (https://phabricator.wikimedia.org/T220001) (owner: 10Zoranzoki21) [16:50:45] Lucas_WMDE: It is not related to me [16:50:49] yeah [16:50:51] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:51:45] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99) [16:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:37] (03CR) 10CRusnov: [C: 03+2] profile kubernetes node: Adjust latency alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/500839 (owner: 10CRusnov) [16:52:47] Pchelolo: your change is on mwdebug1002 now, please test [16:53:09] Lucas_WMDE: doing so. will ping when done [16:53:23] ok [16:53:33] and wmf.23 doesn’t need a backport? [16:53:39] marxarelli: Are you running the train this week? I just made a cherry-pick that it would be really really nice to get into the train for wikitech -- https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/LdapAuthentication/+/500994/ [16:57:05] bd808: i am [16:57:24] * bd808 makes puppet dog eyes at marxarelli [16:57:32] *puppy [16:57:33] haha [16:57:37] lol [16:57:45] was going to say, puppet eyes are not effective on me :) [16:58:26] Lucas_WMDE: seems ok. and wmf.23 does not need a backport, the issue was introduced in wmf.24 [16:58:30] ok [16:58:33] going ahead [16:58:36] thanks [16:58:48] hopefully not this puppet dog -- https://www.amazon.com/Rotting-Zombie-Puppet-Halloween-Decoration/dp/B075FV1BRQ [16:59:24] lmao [16:59:51] Lucas_WMDE: I will move throttle in second SWAT windows as it is not more time anymore [16:59:57] *window [16:59:58] Zoranzoki21: I was about to do it [17:00:08] Lucas_WMDE: If you can do it, it will be great [17:00:09] I think it’d be okay to extend the SWAT by a few minutes [17:00:18] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/EventBus: SWAT: [[gerrit:500959|Incorrect order of calls in createPageDeleteEvent.]] (duration: 00m 59s) [17:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500987 (https://phabricator.wikimedia.org/T220001) (owner: 10Zoranzoki21) [17:00:26] can’t be tested anyways right? [17:00:27] Lucas_WMDE: I think so, becauuse throttles no needs mwdebug [17:00:32] well except for “it doesn’t break the site” [17:00:36] *because [17:00:45] It will not break [17:01:12] yeah I guess the canaries should be enough for that [17:01:13] bd808: i have no objections to deploying that but i'm not sure i can review/merge it [17:01:33] (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500987 (https://phabricator.wikimedia.org/T220001) (owner: 10Zoranzoki21) [17:01:34] marxarelli: I already did the merge of it into master [17:01:36] (03CR) 10Zoranzoki21: Add new throttle rule (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500987 (https://phabricator.wikimedia.org/T220001) (owner: 10Zoranzoki21) [17:01:47] (03CR) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500987 (https://phabricator.wikimedia.org/T220001) (owner: 10Zoranzoki21) [17:01:52] and it's exercised somewhere? (beta) [17:01:53] so this is just a backport rubberstamp and then the critical part is the deploy :) [17:02:16] marxarelli: local test instances by me and akosiaris [17:02:21] kk [17:02:31] we don't have a deployment-prep version of wikitech [17:02:43] word for me then :) [17:02:46] er, works [17:03:05] puppies, puppets, words, works! [17:03:07] Lucas_WMDE: Is it deployed? Can I go now? [17:03:24] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:500987|Add new throttle rule (T220001)]] (duration: 00m 58s) [17:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:27] T220001: Throttle Exception for Amnesty International edit-a-thon on April 14th - https://phabricator.wikimedia.org/T220001 [17:03:31] Zoranzoki21: it got done just now [17:03:35] should be fine now [17:03:40] Yep, thanks [17:03:42] Cya [17:03:45] * Zoranzoki21 waves [17:04:38] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript maintenance/namespaceDupes.php --wiki=srwiki --fix # T214428 – 0 pages to fix, 0 links to fix, Looks good! [17:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:42] T214428: Enable Draft namespace on sr.wikipedia - https://phabricator.wikimedia.org/T214428 [17:04:46] !log EU SWAT done [17:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:05] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:14:18] !log depooling kafka1001 to restart eventbus and kafka services for security updates [17:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:06] PROBLEM - Hadoop Namenode - Stand By on an-master1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [17:15:08] (03PS2) 10Arturo Borrero Gonzalez: openstack: serverpackages: require apt-get update before moving on [puppet] - 10https://gerrit.wikimedia.org/r/500977 (https://phabricator.wikimedia.org/T219981) [17:15:31] elukey: any work in progress? [17:15:35] that paged [17:15:36] nope [17:15:59] (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: require apt-get update before moving on [puppet] - 10https://gerrit.wikimedia.org/r/500977 (https://phabricator.wikimedia.org/T219981) (owner: 10Arturo Borrero Gonzalez) [17:16:12] is that active or an standby? [17:16:46] standby, i am checking [17:16:56] thanks :) [17:17:18] (03PS1) 10Anomie: Set actor migration to read-new on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501000 (https://phabricator.wikimedia.org/T188327) [17:17:35] (03PS3) 10Arturo Borrero Gonzalez: openstack: serverpackages: require apt-get update before moving on [puppet] - 10https://gerrit.wikimedia.org/r/500977 (https://phabricator.wikimedia.org/T219981) [17:17:48] never seen this error before [17:17:55] * apergos peeks in [17:19:48] !log restart hadoop-hdfs-namenode on an-master1002 after forced shutdown due to errors [17:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:17] need to check why it paged, I thought we removed pages for hadoop [17:26:46] (03PS1) 10Hashar: jenkins: prevent reading IRC passwords [puppet] - 10https://gerrit.wikimedia.org/r/501001 (https://phabricator.wikimedia.org/T219991) [17:27:25] (03PS1) 10Elukey: profile::hadoop::master::standby: remove page to SRE [puppet] - 10https://gerrit.wikimedia.org/r/501002 [17:27:37] (03PS1) 10Jforrester: Invariant config cleanup: I - Initial DB and performance items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 [17:27:39] (03PS1) 10Jforrester: Invariant config cleanup: II - Account and anti-abuse settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501004 [17:27:41] (03PS1) 10Jforrester: Invariant config cleanup: III - SVG rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501005 [17:27:47] (03PS1) 10Jforrester: Invariant config cleanup: IV - DJVU rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501006 [17:27:49] (03PS1) 10Jforrester: Invariant config cleanup: V - Notifications matters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501007 [17:27:51] (03PS1) 10Jforrester: Invariant config cleanup: VI - Watchlist default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501008 [17:27:53] (03PS1) 10Jforrester: Invariant config cleanup: VII - RL local storage setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501009 [17:27:55] (03PS1) 10Jforrester: Invariant config cleanup: VIII - ULS logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501010 [17:27:57] (03PS1) 10Jforrester: Invariant config cleanup: IX - RightsIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501011 [17:27:59] (03PS1) 10Jforrester: Invariant config cleanup: X - Extensions loaded on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501012 [17:28:44] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: II - Account and anti-abuse settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501004 (owner: 10Jforrester) [17:28:46] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: I - Initial DB and performance items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 (owner: 10Jforrester) [17:28:48] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: III - SVG rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501005 (owner: 10Jforrester) [17:29:04] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: IV - DJVU rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501006 (owner: 10Jforrester) [17:29:10] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:29:12] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: V - Notifications matters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501007 (owner: 10Jforrester) [17:29:22] (03CR) 10Elukey: [C: 03+2] profile::hadoop::master::standby: remove page to SRE [puppet] - 10https://gerrit.wikimedia.org/r/501002 (owner: 10Elukey) [17:29:36] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: VI - Watchlist default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501008 (owner: 10Jforrester) [17:30:00] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: VII - RL local storage setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501009 (owner: 10Jforrester) [17:30:40] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: VIII - ULS logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501010 (owner: 10Jforrester) [17:30:45] (03PS1) 10Hashar: jenkins: ensure secrets are only readable by jenkins [puppet] - 10https://gerrit.wikimedia.org/r/501013 [17:31:18] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: IX - RightsIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501011 (owner: 10Jforrester) [17:31:29] (03PS1) 10Thcipriani: Gerrit 2.15.12 (update core only) [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/501014 [17:31:59] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: X - Extensions loaded on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501012 (owner: 10Jforrester) [17:32:14] (03PS2) 10Jforrester: Invariant config cleanup: I - Initial DB and performance items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 [17:32:16] (03PS2) 10Jforrester: Invariant config cleanup: II - Account and anti-abuse settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501004 [17:32:18] (03PS2) 10Jforrester: Invariant config cleanup: III - SVG rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501005 [17:32:20] (03PS2) 10Jforrester: Invariant config cleanup: IV - DJVU rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501006 [17:32:22] (03PS2) 10Jforrester: Invariant config cleanup: V - Notifications matters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501007 [17:32:24] (03PS2) 10Jforrester: Invariant config cleanup: VI - Watchlist default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501008 [17:32:26] (03PS2) 10Jforrester: Invariant config cleanup: VII - RL local storage setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501009 [17:32:28] (03PS2) 10Jforrester: Invariant config cleanup: VIII - ULS logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501010 [17:32:33] (03PS2) 10Jforrester: Invariant config cleanup: IX - RightsIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501011 [17:32:35] (03PS2) 10Jforrester: Invariant config cleanup: X - Extensions loaded on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501012 [17:33:26] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: III - SVG rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501005 (owner: 10Jforrester) [17:33:31] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: IV - DJVU rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501006 (owner: 10Jforrester) [17:33:53] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: V - Notifications matters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501007 (owner: 10Jforrester) [17:34:11] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: VI - Watchlist default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501008 (owner: 10Jforrester) [17:34:59] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: VII - RL local storage setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501009 (owner: 10Jforrester) [17:35:15] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: VIII - ULS logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501010 (owner: 10Jforrester) [17:35:49] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: IX - RightsIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501011 (owner: 10Jforrester) [17:36:32] (03CR) 10jerkins-bot: [V: 04-1] Invariant config cleanup: X - Extensions loaded on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501012 (owner: 10Jforrester) [17:37:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Makes sense. Still doesn't cover the fact any program executed as the jenkins user can read that, which looks to me like the biggest issue" [puppet] - 10https://gerrit.wikimedia.org/r/501001 (https://phabricator.wikimedia.org/T219991) (owner: 10Hashar) [17:37:42] (03CR) 10Hashar: [C: 04-1] "That is apparently done by the Debian package (per Joe)" [puppet] - 10https://gerrit.wikimedia.org/r/501013 (owner: 10Hashar) [17:39:52] RECOVERY - Hadoop Namenode - Stand By on an-master1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [17:40:03] thanks elukey ! [17:40:15] there was a job hammering the namenodes :( [17:40:20] we just killed it [17:41:04] huh [17:41:18] what was it? [17:41:56] it seems a job that creates a ton of files on hdfs [17:42:02] from a researcher [17:42:09] we are going to follow up with him [17:42:13] I have little context [17:43:09] (03CR) 10Hashar: [C: 03+1] "So the Debian package 'postinst' does not set the adm group." [puppet] - 10https://gerrit.wikimedia.org/r/501013 (owner: 10Hashar) [17:44:45] !log shortly postponing restarts of eventbus and kafka services for security updates due to unrelated firefighting - repooling kafka1001 [17:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:38] (03PS10) 10Giuseppe Lavagetto: Add an update action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 [17:45:40] (03PS1) 10Giuseppe Lavagetto: Move pulling logic to us, away from the docker daemon [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501017 [17:46:20] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:46:58] (03CR) 10jerkins-bot: [V: 04-1] Move pulling logic to us, away from the docker daemon [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501017 (owner: 10Giuseppe Lavagetto) [17:47:37] (03PS2) 10Herron: jenkins: prevent reading IRC passwords [puppet] - 10https://gerrit.wikimedia.org/r/501001 (https://phabricator.wikimedia.org/T219991) (owner: 10Hashar) [17:47:41] (03PS2) 10Hashar: jenkins: ensure secrets and logs are only readable by jenkins [puppet] - 10https://gerrit.wikimedia.org/r/501013 [17:48:36] (03PS9) 10Andrew Bogott: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 (owner: 10Alex Monk) [17:48:46] (03CR) 10Herron: [C: 03+2] jenkins: prevent reading IRC passwords [puppet] - 10https://gerrit.wikimedia.org/r/501001 (https://phabricator.wikimedia.org/T219991) (owner: 10Hashar) [17:50:42] (03PS10) 10Andrew Bogott: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 (owner: 10Alex Monk) [17:52:18] (03CR) 10Andrew Bogott: [C: 03+2] labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 (owner: 10Alex Monk) [17:52:48] 10Operations, 10Performance-Team, 10Traffic: Some load.php requests failing due to "ERR_SPDY_PROTOCOL_ERROR 200" - https://phabricator.wikimedia.org/T220022 (10Krinkle) [17:53:47] (03CR) 10Gilles: "Ok,yeah, I'll have to figure out another way to get the headers apache mod installed." [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) (owner: 10Gilles) [17:54:56] (03CR) 10Paladox: [C: 03+2] Gerrit 2.15.12 (update core only) [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/501014 (owner: 10Thcipriani) [17:56:48] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [17:57:05] !log restart hadoop-hdfs-namenode on an-master1001 as precautionary measure after the outage (currently standby) [17:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190403T1800) [18:00:04] thcipriani, brennen, and paladox: #bothumor My software never has bugs. It just develops random features. Rise for Gerrit Core 2.15.12 Upgrade. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190403T1800). [18:00:32] * paladox here [18:00:33] * thcipriani does [18:01:22] (03PS3) 10Herron: jenkins: ensure secrets and logs are only readable by jenkins [puppet] - 10https://gerrit.wikimedia.org/r/501013 (owner: 10Hashar) [18:03:19] (03CR) 10Herron: [C: 03+2] jenkins: ensure secrets and logs are only readable by jenkins [puppet] - 10https://gerrit.wikimedia.org/r/501013 (owner: 10Hashar) [18:04:18] (03CR) 10Thcipriani: [V: 03+2] Gerrit 2.15.12 (update core only) [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/501014 (owner: 10Thcipriani) [18:04:37] thcipriani: can you old the upgrade just for a few minutes? ;D [18:04:57] I got some patching for jenkins going on :] [18:05:31] hashar: sure ping me when I'm clear [18:05:58] we are running puppet on the hosts [18:07:24] thcipriani: all set thanks [18:07:41] hashar: cool, going ahead with the update now [18:07:46] good luck! [18:08:59] thanks :) [18:09:26] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@e416edf]: Gerrit to 2.15.12 on gerrit2001 only [18:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:37] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@e416edf]: Gerrit to 2.15.12 on gerrit2001 only (duration: 00m 11s) [18:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:46] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@e416edf]: Gerrit to 2.15.12 on cobalt (restart to follow) [18:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:57] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@e416edf]: Gerrit to 2.15.12 on cobalt (restart to follow) (duration: 00m 11s) [18:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:20] !log restarting gerrit for 2.15.12 update [18:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:52] !log gerrit back on 2.15.12 [18:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:12] Yay. [18:15:20] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater] [18:17:57] (03CR) 10Lucas Werkmeister (WMDE): Enable UrlShortener in mediawikiwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500910 (https://phabricator.wikimedia.org/T108557) (owner: 10Ladsgroup) [18:18:04] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [18:19:14] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [18:23:13] thcipriani: Hmm. Phab isn't letting me create a task via https://phabricator.wikimedia.org/maniphest/task/edit/form/3/ without explicitly setting a number of points. Did we change something? It used to work fine. [18:26:20] James_F: hrm, I'm unware of any updates to phab that have happened recently (although twentyafterfour mmay know better). Krinkle updated the form on the 28th, but I can't figure out what changes were made looking at the ui. [18:26:52] (03PS4) 10Elukey: cumin: add more hadoop-related aliases [puppet] - 10https://gerrit.wikimedia.org/r/500967 (https://phabricator.wikimedia.org/T218343) [18:26:58] James_F: no updates to phab that I'm aware of either.. I'll look at it [18:27:10] thcipriani: Not urgent at all, just confusing. [18:27:51] I changed the label from "Create Task (Advanced)" to "New Task (Advanced)". [18:28:18] I don't see a configuratin option to declare whether a field should be required to be non-empty or not. [18:28:29] That might've been changed in upstream code unintentionally or in our extension of it. [18:28:34] (03CR) 10Elukey: [C: 03+2] cumin: add more hadoop-related aliases [puppet] - 10https://gerrit.wikimedia.org/r/500967 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [18:29:37] * James_F nods. [18:42:16] (03PS2) 10Bstorm: sonofgridengine: make tools-checker hosts submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/500535 (https://phabricator.wikimedia.org/T219817) [18:43:56] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:45:08] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:46:16] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:47:06] (03PS3) 10Jforrester: Invariant config cleanup: III - SVG rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501005 [18:47:08] (03PS3) 10Jforrester: Invariant config cleanup: IV - DJVU rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501006 [18:47:10] (03PS3) 10Jforrester: Invariant config cleanup: V - Notifications matters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501007 [18:47:12] (03PS3) 10Jforrester: Invariant config cleanup: VI - Watchlist default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501008 [18:47:14] (03PS3) 10Jforrester: Invariant config cleanup: VII - RL local storage setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501009 [18:47:16] (03PS3) 10Jforrester: Invariant config cleanup: VIII - ULS logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501010 [18:47:18] (03PS3) 10Jforrester: Invariant config cleanup: IX - RightsIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501011 [18:47:20] (03PS3) 10Jforrester: Invariant config cleanup: X - Extensions loaded on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501012 [18:50:34] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Backlog (Watching / External), and 2 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) [18:50:38] (03PS27) 10Andrew Bogott: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [18:52:24] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [18:55:13] (03CR) 10Jbond: tests: mark test strings with escape as raw (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/483131 (owner: 10Volans) [18:55:40] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Backlog (Watching / External), and 2 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) [18:57:12] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:57:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [18:59:00] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:00:04] marxarelli: How many deployers does it take to do MediaWiki train - Americas version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190403T1900). [19:02:44] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [19:03:55] (03PS3) 10Bstorm: sonofgridengine: make tools-checker hosts submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/500535 (https://phabricator.wikimedia.org/T219817) [19:05:25] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: make tools-checker hosts submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/500535 (https://phabricator.wikimedia.org/T219817) (owner: 10Bstorm) [19:08:35] (03CR) 10Jforrester: "Found via `'(wm?g)(.*)' => \[\n\t'default' => ([^\n,]*),?( ([^\n]*))?\n\],`, with judgement as to which vary from time to time and which d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 (owner: 10Jforrester) [19:09:40] James_F: I set the points field to default=0 ... maybe that helps? [19:10:39] well sort of helps: https://phabricator.wikimedia.org/T220027 [19:10:53] only drawback is that shows "0 story points" at the top for no good reason [19:11:10] Yeah, difference between 0 and null… [19:11:33] This is probably an upstream change we didn't notice until now. [19:16:32] (03PS1) 10Ladsgroup: varnish: allow short urls that have query [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) [19:19:50] (03PS1) 10Hashar: contint: deny some Jenkins entrypoint [puppet] - 10https://gerrit.wikimedia.org/r/501033 (https://phabricator.wikimedia.org/T219991) [19:19:59] (03CR) 10Ladsgroup: varnish: allow short urls that have query (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup) [19:20:51] (03CR) 10Andrew Bogott: [C: 03+2] contint: deny some Jenkins entrypoint [puppet] - 10https://gerrit.wikimedia.org/r/501033 (https://phabricator.wikimedia.org/T219991) (owner: 10Hashar) [19:21:52] (03PS1) 10Bstorm: toolschecker: Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/501034 (https://phabricator.wikimedia.org/T219243) [19:22:20] (03PS2) 10Bstorm: toolschecker: Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/501034 (https://phabricator.wikimedia.org/T219243) [19:23:22] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@50b2af9]: Deploy new Updater for more cache-friendly update startegy [19:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:57] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Backlog (Watching / External), and 3 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) [19:24:36] (03CR) 10Bstorm: [C: 03+2] toolschecker: Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/501034 (https://phabricator.wikimedia.org/T219243) (owner: 10Bstorm) [19:25:14] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:26:10] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [19:28:40] (03CR) 10Andrew Bogott: [C: 04-1] "This needs a change in the puppet compiler client code to search subdirs" [puppet] - 10https://gerrit.wikimedia.org/r/500501 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [19:34:17] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@50b2af9]: Deploy new Updater for more cache-friendly update startegy (duration: 10m 54s) [19:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:19] (03PS10) 10Jforrester: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) [19:36:00] (03PS11) 10Jforrester: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) [19:39:06] (03PS1) 10Andrew Bogott: utils.facts_file: do a recursive search in the 'facts' dir [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501039 (https://phabricator.wikimedia.org/T219430) [19:39:50] (03CR) 10jerkins-bot: [V: 04-1] utils.facts_file: do a recursive search in the 'facts' dir [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501039 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [19:44:59] (03PS1) 10Dduvall: group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501042 [19:45:01] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501042 (owner: 10Dduvall) [19:46:16] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501042 (owner: 10Dduvall) [19:46:50] bd808: rolling the train, fyi [19:48:19] marxarelli: thanks for the heads up. We haven't flipped the feature flag on for that patch yet, so hopefully it will be a very boring roll out :) [19:48:35] ah, good to know [19:49:25] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.24 [19:50:29] dduvall@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:50:39] (03PS1) 10BryanDavis: sudo: Allow root to assume any group [puppet] - 10https://gerrit.wikimedia.org/r/501043 [19:50:48] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.24 [19:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:15] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.24 (duration: 01m 49s) [19:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:24] PROBLEM - HHVM rendering on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:52:11] (03PS1) 10Dmaza: Enable Partial Blocks on French and Polish wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501044 (https://phabricator.wikimedia.org/T219327) [19:52:16] PROBLEM - Nginx local proxy to apache on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:52:18] uh ohs [19:52:24] PROBLEM - Apache HTTP on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:53:02] !log massive spike in DBTransactionError ([{exception_id}] {exception_url} Wikimedia\Rdbms\DBTransactionError from line 246 of /srv/mediawiki/php-1.33.0-wmf.24/includes/libs/rdbms/lbfactory/LBFactory.php: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started.) [19:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:16] !log rolling back group1 [19:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:01] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501042 (owner: 10Dduvall) [19:54:17] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Alert "kubelet operational latencies" - https://phabricator.wikimedia.org/T219696 (10akosiaris) 05Open→03Resolved Culprit identified. On `Thu Mar 28 15:07:55 2019` a new version of the eventgate-analytics chart was deployed to both codfw and eqiad. That new... [19:55:26] !log 111,185 and counting DBTransactionError for jobrunner.discovery.wmnet [19:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:37] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Miriam) OK, I can prepare a task for this, or we can start from someth... [19:56:00] RECOVERY - Nginx local proxy to apache on mw1241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.216 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:56:06] RECOVERY - Apache HTTP on mw1241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:56:14] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: Revert group1 to 1.33.0-wmf.24 [19:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:22] RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 74924 bytes in 0.493 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:56:40] (03CR) 10Aezell: [C: 03+1] "These are the ones." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501044 (https://phabricator.wikimedia.org/T219327) (owner: 10Dmaza) [19:56:51] !log log correction group1 reverted to 1.33.0-wmf.23 [19:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:59] (03PS13) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [19:59:52] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:00:05] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190403T2000). [20:09:18] !log 1.33.0-wmf.24 is holding at group0 following rollback. filed T220037. cc: T206678 [20:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:28] T206678: 1.33.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T206678 [20:09:29] T220037: Spike in DBTransactionError following 1.33.0-wmf.24 group1 promotion - https://phabricator.wikimedia.org/T220037 [20:11:22] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:14:27] !log arlolra@deploy1001 Started deploy [parsoid/deploy@4f740e3]: Updating Parsoid to 0b3bb10 [20:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:20] (03PS2) 10Giuseppe Lavagetto: Move pulling logic to us, away from the docker daemon [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501017 [20:18:27] (03CR) 10jerkins-bot: [V: 04-1] Move pulling logic to us, away from the docker daemon [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501017 (owner: 10Giuseppe Lavagetto) [20:19:20] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) I wonder if we can use fashion mist to benchmark: https://resea... [20:20:11] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@4f740e3]: Updating Parsoid to 0b3bb10 (duration: 05m 44s) [20:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:30] PROBLEM - MD RAID on elastic2048 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [20:23:34] ACKNOWLEDGEMENT - MD RAID on elastic2048 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T220038 [20:23:46] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10ops-monitoring-bot) [20:24:24] PROBLEM - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:26:20] (03CR) 10Smalyshev: [C: 03+1] "Deployed." [puppet] - 10https://gerrit.wikimedia.org/r/500359 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev) [20:29:57] !log Updated Parsoid to 0b3bb10 (T219337) [20:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:00] T219337: Port Parsoid tokenizer to PHP - https://phabricator.wikimedia.org/T219337 [20:30:50] RECOVERY - Check systemd state on elastic2048 is OK: OK - running: The system is fully operational [20:34:42] PROBLEM - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:36:22] (03PS4) 10KartikMistry: Add publish restrictions config for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495677 (https://phabricator.wikimedia.org/T217237) (owner: 10Petar.petkovic) [20:36:59] * gehel is looking into elastic2048 [20:37:48] (03PS14) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [20:39:31] ACKNOWLEDGEMENT - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel raid failed, disk read only - https://phabricator.wikimedia.org/T220038 [20:40:39] !log excluding elastic2048 from cluster and depooling - T220038 [20:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:42] T220038: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 [20:43:54] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10Gehel) Node is depooled and excluded from the cluster. @Papaul if you have a spare, feel free to do what needs doing. Ping me when done and I'll reimage. [20:45:37] (03PS5) 10Gehel: Enable using revision-fetch mechanism for test & internal clusters [puppet] - 10https://gerrit.wikimedia.org/r/500359 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev) [20:47:08] (03CR) 10Gehel: [C: 03+2] Enable using revision-fetch mechanism for test & internal clusters [puppet] - 10https://gerrit.wikimedia.org/r/500359 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev) [20:47:35] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10tstarling) Excluding the ligatures, since I think they are correct already in HHVM, th... [20:53:03] (03PS1) 10Gehel: Revert "Enable using revision-fetch mechanism for test & internal clusters" [puppet] - 10https://gerrit.wikimedia.org/r/501054 [20:54:15] (03CR) 10Gehel: [C: 03+2] Revert "Enable using revision-fetch mechanism for test & internal clusters" [puppet] - 10https://gerrit.wikimedia.org/r/501054 (owner: 10Gehel) [20:54:35] (03PS15) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [20:56:46] PROBLEM - Hadoop Namenode - Stand By on an-master1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:58:19] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) both rapidjson and catch build find using libc++ however even using theses packages we get the above error [20:59:26] (03PS16) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [21:02:14] (03PS1) 10Gehel: wdqs: expose revision-fetch mechanism [puppet] - 10https://gerrit.wikimedia.org/r/501056 (https://phabricator.wikimedia.org/T217897) [21:04:17] (03CR) 10Smalyshev: [C: 03+1] wdqs: expose revision-fetch mechanism [puppet] - 10https://gerrit.wikimedia.org/r/501056 (https://phabricator.wikimedia.org/T217897) (owner: 10Gehel) [21:07:33] (03CR) 10Gehel: [C: 03+2] wdqs: expose revision-fetch mechanism [puppet] - 10https://gerrit.wikimedia.org/r/501056 (https://phabricator.wikimedia.org/T217897) (owner: 10Gehel) [21:14:08] checking an-master 1002 [21:14:13] (03PS17) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [21:16:59] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/15553/dns1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [21:19:00] (03CR) 10Andrew Bogott: [C: 04-1] "We also need to pass in a host-specific yamldir to the puppet master itself" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501039 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [21:21:12] RECOVERY - Hadoop Namenode - Stand By on an-master1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:21:47] (03PS6) 10Gilles: Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) [21:22:29] (03PS9) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) [21:22:56] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10tstarling) I see now that the ligatures are indeed changing, but there is only one aff... [21:22:58] (03CR) 10jerkins-bot: [V: 04-1] Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) (owner: 10Gilles) [21:24:23] (03PS7) 10Gilles: Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) [21:25:36] (03CR) 10Bstorm: [C: 03+2] cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [21:32:03] !log start hadoop-hdfs-namenode on an-master1002 after outage due to big job hitting HDFS [21:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:37] (03PS1) 10Bstorm: labstore: fix mistake in maintain_dbusers service [puppet] - 10https://gerrit.wikimedia.org/r/501066 (https://phabricator.wikimedia.org/T209527) [21:50:41] (03CR) 10Bstorm: [C: 03+2] labstore: fix mistake in maintain_dbusers service [puppet] - 10https://gerrit.wikimedia.org/r/501066 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [21:55:03] (03PS1) 10Elukey: admin: temporary remove piccardi from analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/501067 [21:55:30] cdanis: --^ [21:55:55] (03CR) 10CDanis: [C: 03+1] admin: temporary remove piccardi from analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/501067 (owner: 10Elukey) [21:57:09] thanks! [21:57:16] np! [21:57:49] (03CR) 10Elukey: [C: 03+2] admin: temporary remove piccardi from analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/501067 (owner: 10Elukey) [22:14:13] (03PS1) 10Bstorm: labstore: cleanup the remaining files after Icc89332f0e779 [puppet] - 10https://gerrit.wikimedia.org/r/501070 (https://phabricator.wikimedia.org/T209527) [22:29:52] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:41:47] (03PS1) 10Catrope: Set GrowthExperiments homepage config for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501079 [22:45:39] (03PS1) 10Volans: icinga: sync only if config is valid and log it [puppet] - 10https://gerrit.wikimedia.org/r/501083 [22:45:52] (03PS1) 10Catrope: GrowthExperiments Homepage: configure tutorial pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501084 (https://phabricator.wikimedia.org/T219395) [22:48:00] (03PS1) 10Catrope: Enable Flow and Flow beta feature on zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501086 (https://phabricator.wikimedia.org/T219588) [22:48:37] (03CR) 10Volans: "Compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/501083 (owner: 10Volans) [22:51:57] (03CR) 10CDanis: [C: 03+1] icinga: sync only if config is valid and log it [puppet] - 10https://gerrit.wikimedia.org/r/501083 (owner: 10Volans) [22:54:19] (03PS1) 10Catrope: GrowthExperiments: Enable homepage instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501091 [22:55:47] (03PS1) 10Catrope: Beta cluster: Enable GrowthExperiments homepage for 50% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501093 [22:56:58] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [22:57:04] (03PS3) 10CRusnov: Break report into 3 parts and adjust the way devices are filtered [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 [22:57:49] (03CR) 10Catrope: [C: 03+2] Beta cluster: Enable GrowthExperiments homepage for 50% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501093 (owner: 10Catrope) [22:58:29] (03CR) 10Gergő Tisza: [C: 03+1] Add cron job to update WikimediaEditorTasks suggestions table [puppet] - 10https://gerrit.wikimedia.org/r/500104 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [22:59:01] (03Merged) 10jenkins-bot: Beta cluster: Enable GrowthExperiments homepage for 50% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501093 (owner: 10Catrope) [22:59:21] (03PS18) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190403T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:26] I'll SWAT since I'm the only customer [23:00:38] (03CR) 10Catrope: [C: 03+2] Set GrowthExperiments homepage config for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501079 (owner: 10Catrope) [23:01:55] (03Merged) 10jenkins-bot: Set GrowthExperiments homepage config for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501079 (owner: 10Catrope) [23:03:03] (03PS4) 10CRusnov: Break report into 3 parts and adjust the way devices are filtered [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 [23:04:15] (03CR) 10jenkins-bot: Beta cluster: Enable GrowthExperiments homepage for 50% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501093 (owner: 10Catrope) [23:04:17] (03CR) 10jenkins-bot: Set GrowthExperiments homepage config for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501079 (owner: 10Catrope) [23:05:59] marxarelli: In the future please push train reverts to Gerrit and merge them rather than leaving them as local patches on the deployment host [23:06:54] RoanKattouw: ah, my mistake [23:07:05] (03PS1) 10Catrope: Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501100 [23:07:05] did you submit the revert already? [23:07:12] there it is [23:07:12] Just did [23:07:14] sorry about that [23:07:20] (03CR) 10Catrope: [C: 03+2] Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501100 (owner: 10Catrope) [23:08:21] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501100 (owner: 10Catrope) [23:09:06] No worries, I was thrown off by git pull trying to do a merge commit at first, but once I figured out it was just a train revert it was easy to recover from [23:12:30] (03PS1) 10CRusnov: puppetdb_microservice: Redo how it returns values [puppet] - 10https://gerrit.wikimedia.org/r/501104 [23:13:37] (03CR) 10jerkins-bot: [V: 04-1] puppetdb_microservice: Redo how it returns values [puppet] - 10https://gerrit.wikimedia.org/r/501104 (owner: 10CRusnov) [23:14:28] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Configure GrowthExperiments homepage on testwiki (duration: 01m 01s) [23:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:17] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501100 (owner: 10Catrope) [23:15:39] (03PS2) 10CRusnov: puppetdb_microservice: Redo how it returns values [puppet] - 10https://gerrit.wikimedia.org/r/501104 [23:16:18] (03CR) 10Catrope: [C: 03+2] GrowthExperiments Homepage: configure tutorial pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501084 (https://phabricator.wikimedia.org/T219395) (owner: 10Catrope) [23:16:35] (03CR) 10jerkins-bot: [V: 04-1] GrowthExperiments Homepage: configure tutorial pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501084 (https://phabricator.wikimedia.org/T219395) (owner: 10Catrope) [23:16:54] (03PS2) 10Catrope: GrowthExperiments Homepage: configure tutorial pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501084 (https://phabricator.wikimedia.org/T219395) [23:17:02] (03CR) 10Catrope: [C: 03+2] GrowthExperiments Homepage: configure tutorial pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501084 (https://phabricator.wikimedia.org/T219395) (owner: 10Catrope) [23:18:12] (03Merged) 10jenkins-bot: GrowthExperiments Homepage: configure tutorial pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501084 (https://phabricator.wikimedia.org/T219395) (owner: 10Catrope) [23:18:33] !log catrope@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 00s) [23:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:42] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Configure GrowthExperiments homepage tutorial pages on cswiki, kowiki, viwiki (dark deploy) (duration: 00m 59s) [23:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:02] (03PS2) 10Catrope: GrowthExperiments: Enable homepage instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501091 [23:22:08] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Enable homepage instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501091 (owner: 10Catrope) [23:23:53] (03Merged) 10jenkins-bot: GrowthExperiments: Enable homepage instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501091 (owner: 10Catrope) [23:26:00] (03PS1) 10Catrope: Fix missing wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501110 [23:26:05] (03PS5) 10CRusnov: Break report into 3 parts and adjust the way devices are filtered [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 [23:26:25] (03CR) 10jenkins-bot: GrowthExperiments Homepage: configure tutorial pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501084 (https://phabricator.wikimedia.org/T219395) (owner: 10Catrope) [23:26:27] (03CR) 10jenkins-bot: GrowthExperiments: Enable homepage instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501091 (owner: 10Catrope) [23:28:16] (03PS6) 10CRusnov: Break report into 3 parts and adjust the way devices are filtered [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 [23:28:26] (03CR) 10Catrope: [C: 03+2] Fix missing wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501110 (owner: 10Catrope) [23:29:32] (03Merged) 10jenkins-bot: Fix missing wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501110 (owner: 10Catrope) [23:37:44] (03CR) 10jenkins-bot: Fix missing wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501110 (owner: 10Catrope) [23:38:01] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable GrowthExperiments homepage EventLogging on testwiki (duration: 00m 59s) [23:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:19] (03PS7) 10CRusnov: Break report into parts and adjust the way devices are filtered [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 [23:39:19] (03PS2) 10Catrope: Enable Flow and Flow beta feature on zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501086 (https://phabricator.wikimedia.org/T219588) [23:39:25] (03CR) 10Catrope: [C: 03+2] Enable Flow and Flow beta feature on zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501086 (https://phabricator.wikimedia.org/T219588) (owner: 10Catrope) [23:41:11] (03Merged) 10jenkins-bot: Enable Flow and Flow beta feature on zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501086 (https://phabricator.wikimedia.org/T219588) (owner: 10Catrope) [23:41:41] TimStarling: It looks like you're the only one using mwscript on deploy1001 right now, so I think it might be you that just triggered 15k error log entries of the form ErrorException from line 0 of : PHP Warning: Class WikiPageMessageGroup has no unserializer ? [23:42:56] https://logstash.wikimedia.org/goto/2216d523464e3fce980768445c77b13b [23:44:33] no, I don't think so [23:44:39] (03CR) 10Jforrester: "Pinging people who like these kinds of changes. ;-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 (owner: 10Jforrester) [23:46:07] I had an sql terminal open, not sure how that could cause a flood of messages [23:46:13] wasn't doing anything with it [23:48:50] (03CR) 10jenkins-bot: Enable Flow and Flow beta feature on zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501086 (https://phabricator.wikimedia.org/T219588) (owner: 10Catrope) [23:50:41] !log catrope@deploy1001 Synchronized dblists/flow.dblist: Enable Flow on zhwikisource (T219588) (duration: 00m 57s) [23:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:45] T219588: Enable "Structured Discussion" for zh.wikisource.org - https://phabricator.wikimedia.org/T219588 [23:50:54] it was open untouched since about 21:19, hard to see how it could cause errors at 23:37 [23:51:40] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Flow beta feature on zhwikisource (T219588) (duration: 00m 58s) [23:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:42] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) Looks like we have problem with redirects - they can not be fetched by-revision. E.g.: https://www.wi...