[00:00:08] oh okay, anyone from releng will be around? [00:00:19] probably not? [00:00:59] okay [00:01:36] PROBLEM - High lag on wdqs1003 is CRITICAL: 3609 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:03:27] (03PS3) 10Ladsgroup: Revert "Revert back wikidata for change_tag backend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469115 [00:03:43] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469115 (owner: 10Ladsgroup) [00:05:08] (03CR) 10Dzahn: [C: 032] "join(): Requires array to work with at /etc/puppet/modules/rsync/manifests/server/module.pp:65:27 at /etc/puppet/modules/aptrepo/manifests" [puppet] - 10https://gerrit.wikimedia.org/r/469140 (owner: 10Dzahn) [00:05:45] (03Merged) 10jenkins-bot: Revert "Revert back wikidata for change_tag backend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469115 (owner: 10Ladsgroup) [00:07:44] (03CR) 10Dzahn: "after merging the follow-up above that adds the parameter:" [puppet] - 10https://gerrit.wikimedia.org/r/465378 (owner: 10Muehlenhoff) [00:07:52] (03CR) 10jenkins-bot: Revert "Revert back wikidata for change_tag backend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469115 (owner: 10Ladsgroup) [00:08:58] (03PS1) 10Ladsgroup: Revert "Revert "Revert back wikidata for change_tag backend"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469143 [00:10:48] (03CR) 10Ladsgroup: [C: 032] Revert "Revert "Revert back wikidata for change_tag backend"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469143 (owner: 10Ladsgroup) [00:10:56] Canary died [00:15:44] I reverted the second patch [00:16:52] (03Merged) 10jenkins-bot: Revert "Revert "Revert back wikidata for change_tag backend"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469143 (owner: 10Ladsgroup) [00:17:28] !log evening SWAT is done [00:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:31] (03PS1) 10Dzahn: aptrepo/rsync: hosts_allow needs to be an array [puppet] - 10https://gerrit.wikimedia.org/r/469146 [00:17:46] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) > If this does go into the 'public' VLAN, could we restrict access to these nodes using some simple ferm rules?... [00:23:13] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Krenair) >>! In T207321#4687651, @ayounsi wrote: >> Where are the labsdb hosts going to live if they are being moved out... [00:24:06] (03CR) 10jenkins-bot: Revert "Revert "Revert back wikidata for change_tag backend"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469143 (owner: 10Ladsgroup) [00:24:27] (03PS2) 10Dzahn: aptrepo/rsync: hosts_allow needs to be an array [puppet] - 10https://gerrit.wikimedia.org/r/469146 [00:24:39] 10Operations, 10Cloud-VPS: Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) [00:31:17] (03CR) 10Dzahn: "bonus: compiler says "no change" but it actually removes the Error when you look in details anyways:" [puppet] - 10https://gerrit.wikimedia.org/r/469146 (owner: 10Dzahn) [00:32:32] (03CR) 10Dzahn: [C: 032] aptrepo/rsync: hosts_allow needs to be an array [puppet] - 10https://gerrit.wikimedia.org/r/469146 (owner: 10Dzahn) [00:32:52] (03CR) 10Dzahn: [C: 032] "https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469146/" [puppet] - 10https://gerrit.wikimedia.org/r/469140 (owner: 10Dzahn) [00:33:00] (03CR) 10Dzahn: [C: 032] "https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469146/" [puppet] - 10https://gerrit.wikimedia.org/r/467982 (owner: 10Muehlenhoff) [00:33:38] (03PS3) 10Dzahn: aptrepo/rsync: hosts_allow needs to be an array [puppet] - 10https://gerrit.wikimedia.org/r/469146 [00:33:46] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Smalyshev) [00:34:00] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Smalyshev) p:05Triage>03High [00:35:03] !log temp depooled wdq1003 to let it catch up [00:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:22] install2002 puppet is fixed now [00:42:26] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:43:45] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:46] PROBLEM - HHVM rendering on mw1258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:47] sites are very slow to load for me right now [00:43:55] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [00:43:55] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (wi [00:43:56] e)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) [00:43:56] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [00:43:56] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article [00:43:56] before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received [00:43:56] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [00:43:58] https://pl.wikipedia.org/wiki/Nawiedzony_dom_na_wzgórzu did not load at all: [00:43:58] Request from 185.157.12.102 via cp1087 cp1087, Varnish XID 481362331 [00:43:58] Error: 503, Backend fetch failed at Tue, 23 Oct 2018 00:43:20 GMT [00:44:05] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [00:44:16] PROBLEM - debmonitor.wikimedia.org on debmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:44:16] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=t [00:44:16] unexpected status 500 (expecting: 200) [00:44:25] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [00:44:36] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [00:44:36] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [00:44:46] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 75386 bytes in 0.103 second response time [00:44:46] RECOVERY - HHVM rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 75389 bytes in 1.741 second response time [00:44:56] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [00:44:56] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [00:44:56] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [00:44:57] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [00:45:06] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [00:45:15] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [00:45:15] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 219 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:45:16] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:45:16] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [00:45:16] RECOVERY - debmonitor.wikimedia.org on debmonitor1001 is OK: HTTP OK: Status line output matched HTTP/1.1 301 - 274 bytes in 0.002 second response time [00:45:25] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [00:45:26] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [00:45:36] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [00:45:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:45:45] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [00:45:55] PROBLEM - puppet last run on elastic1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:45:56] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:45:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:46:06] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [00:46:06] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [00:46:11] MatmaRex: is it ok now? [00:46:15] PROBLEM - puppet last run on rdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:46:16] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:46:16] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [00:46:21] it is for me [00:46:25] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.8995 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [00:46:26] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:46:26] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [00:46:26] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:46:26] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:46:30] mutante: yeah, it loaded now [00:46:35] PROBLEM - puppet last run on kafka-jumbo1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:46:36] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [00:46:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:47:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:47:16] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [00:47:25] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.9589 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [00:47:26] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:47:33] there was a short but bad spike and the rest is icinga takes some time to get it .. afaict [00:47:35] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.916 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [00:47:36] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:47:45] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [00:48:06] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:48:06] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [00:48:06] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:48:15] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:48:15] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:48:35] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:48:45] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [00:49:25] in all those graphs it looks like a spike that is over [00:49:25] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:49:25] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:49:25] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:49:26] PROBLEM - puppet last run on mw1333 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:49:43] also looking at one of those appservers [00:49:46] PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:49:55] PROBLEM - puppet last run on cp1081 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:49:55] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:49:55] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:49:56] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:50:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:50:16] PROBLEM - puppet last run on elastic1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:50:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:50:25] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:50:26] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:50:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:51:54] mutante, so do you have any idea what the spike was caused by? [00:51:55] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [00:52:05] Krenair: no [00:52:05] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [00:52:06] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [00:52:36] why are random servers failing to fetch puppet catalogs? [00:53:00] I mean, maybe not entirely random servers as each one of those could have been involved in the spike [00:53:02] they are not, mw1233 ran just fine [00:53:24] mw1333, mw1241, cp1081 all above. catalog fetch fail [00:54:16] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [00:54:51] mw1241: Notice: Applied catalog in 28.38 seconds [00:54:55] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [00:55:56] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [00:56:35] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [00:56:36] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [00:56:37] Krenair: they were probably so busy that the puppet run timed out [00:56:45] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [00:56:49] and these were the ones that happened to try it during the spike [00:56:49] huh ok [00:57:02] the time it runs is randomized [00:57:26] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [00:57:55] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [01:01:25] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:01:36] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:03:13] that last one seems related to the last log in SAL [01:03:35] wdqs1003 is temp depooled [01:03:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:07:36] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [01:13:45] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [01:14:06] RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:14:56] RECOVERY - puppet last run on mw1333 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:15:26] RECOVERY - puppet last run on cp1081 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:15:46] RECOVERY - puppet last run on elastic1044 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:15:56] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:16:26] RECOVERY - puppet last run on elastic1037 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:16:55] RECOVERY - puppet last run on rdb1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [01:17:06] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:17:06] RECOVERY - puppet last run on kafka-jumbo1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:23:41] (03CR) 10Dzahn: [C: 032] "https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469140/" [puppet] - 10https://gerrit.wikimedia.org/r/467982 (owner: 10Muehlenhoff) [01:24:30] (03CR) 10Dzahn: [C: 032] "after the 2 follow-ups puppet is happy now and i confirmed rsync still worked. (that is pushing from install1002 to install2002)" [puppet] - 10https://gerrit.wikimedia.org/r/467982 (owner: 10Muehlenhoff) [01:46:46] PROBLEM - Nginx local proxy to apache on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:46] RECOVERY - Nginx local proxy to apache on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.045 second response time [02:06:36] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:52:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:58:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:18:33] are we having an outage? i'm experiencing load.php URLs taking more than a minute to load [03:18:54] e.g. https://sr.wikipedia.org/w/load.php?debug=false&lang=sr&modules=ext.CodeMirror.lib%7Cext.math.styles%7Cext.visualEditor.core%2Clanguage%2Cmwextensionmessages%7Coojs-ui-core%2Coojs-ui-widgets%7Coojs-ui.styles.icons-editing-advanced&skin=vector&version=1n2ohgi [03:19:01] https://commons.wikimedia.org/w/load.php?debug=false&lang=en&modules=ext.wikimediaEvents%7Cjquery%2Coojs-ui-core%2Coojs-ui-widgets%7Cmediawiki.feedback%2CmessagePoster%7Coojs-ui.styles.icons-editing-advanced%2Cicons-location&skin=vector&version=19qmpvw [03:19:20] these are from trying to load VisualEditor and UploadWizard, respectively [03:21:37] alright, they just loaded and now are fast [03:31:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:31:55] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 946.34 seconds [03:36:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:36:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:39:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:44:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:56:35] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 291.23 seconds [04:19:56] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:21:35] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [04:28:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:30:35] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:06:56] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [05:27:00] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) [05:34:52] (03PS6) 10Giuseppe Lavagetto: mediawiki::web::vhost: allow serving content from php7 [puppet] - 10https://gerrit.wikimedia.org/r/467878 (https://phabricator.wikimedia.org/T206338) [05:36:02] <_joe_> MatmaRex: I think you were not the only one with that problem, it looks like it was an outage earlier [05:43:53] huh [05:52:35] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational [06:02:09] !log Deploy schema change on s3 - T207359 [06:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:13] T207359: compress wb_changes_dispatch on testwikidatawiki - https://phabricator.wikimedia.org/T207359 [06:03:45] (03PS1) 10KartikMistry: apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/469163 (https://phabricator.wikimedia.org/T189076) [06:04:23] (03CR) 10jerkins-bot: [V: 04-1] apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/469163 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [06:18:46] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469164 [06:28:46] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ImageMagick-6/policy.xml] [06:31:47] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/nova/policy.json] [06:32:25] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R] [06:33:25] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt-upgrade-activity] [06:38:16] PROBLEM - YARN NodeManager Node-State on analytics1068 is CRITICAL: CRITICAL: YARN NodeManager analytics1068.eqiad.wmnet:8041 Node-State: Could not find the node report for node id : analytics1068.eqiad.wmnet:8041 [06:38:54] working on it --^ [06:39:09] !log Stop replication on db1092 and db1087 for checking T206743 [06:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:12] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [06:40:35] RECOVERY - YARN NodeManager Node-State on analytics1068 is OK: OK: YARN NodeManager analytics1068.eqiad.wmnet:8041 Node-State: RUNNING [06:41:35] RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:42:29] !log restart yarn and hdfs daemon on analytics1068 to pick up correct config (the host was down since before we swapped the Hadoop masters due to hw failure) [06:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:45] !log powercycle ms-be2017 (frozen since ~8hrs ago) [06:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:47] (03CR) 10Muehlenhoff: [C: 04-1] "This needs to wait until the memcached collector is removed, it's still in use on the labweb hosts." [puppet] - 10https://gerrit.wikimedia.org/r/466908 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [06:52:55] PROBLEM - puppet last run on db1089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:53:27] ACKNOWLEDGEMENT - IPMI Sensor Status on scb2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Muehlenhoff T207629 [06:53:36] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:55:26] RECOVERY - very high load average likely xfs on ms-be2017 is OK: OK - load average: 4.00, 0.86, 0.28 [06:55:35] RECOVERY - Host ms-be2017 is UP: PING OK - Packet loss = 0%, RTA = 38.28 ms [06:56:05] RECOVERY - MD RAID on ms-be2017 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [06:57:17] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:56] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:55] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:16] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:02:50] 10Operations, 10ops-eqiad: Broken memory on thumbor1004 - https://phabricator.wikimedia.org/T207721 (10MoritzMuehlenhoff) [07:03:19] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 17 ge 4 Muehlenhoff T207721 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad%2520prometheus%252Fops [07:05:38] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:21] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469164 (owner: 10Marostegui) [07:13:41] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469164 (owner: 10Marostegui) [07:14:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1100 - T184805 (duration: 00m 48s) [07:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:17] RECOVERY - puppet last run on db1089 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:28:21] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469164 (owner: 10Marostegui) [07:30:48] PROBLEM - Apache HTTP on mw1342 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [07:31:57] RECOVERY - Apache HTTP on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.086 second response time [07:38:50] (03PS1) 10Urbanecm: Remove expired throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469167 (https://phabricator.wikimedia.org/T207722) [07:38:52] (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469168 (https://phabricator.wikimedia.org/T207714) [07:39:51] (03CR) 10jerkins-bot: [V: 04-1] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469168 (https://phabricator.wikimedia.org/T207714) (owner: 10Urbanecm) [07:47:03] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Add punjabi.wikimedia.org to DNS and Apache - https://phabricator.wikimedia.org/T207583 (10Urbanecm) Thank you @Dzahn! [07:53:09] Jonas_WMDE, do you have a second? [07:54:45] (03CR) 10Muehlenhoff: [C: 031] "Actually let's merge this and ignore trusty: While we do still use ISC ntpd on trusty servers as a time synchronisation client and while w" [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:00:55] (03CR) 10Zoranzoki21: [C: 04-1] "You can remove and rule for tomorrow, so this can be deployed at 24th October." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469167 (https://phabricator.wikimedia.org/T207722) (owner: 10Urbanecm) [08:01:22] (03CR) 10Zoranzoki21: [C: 04-1] "(my mistake) 24th or 25th October" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469167 (https://phabricator.wikimedia.org/T207722) (owner: 10Urbanecm) [08:02:14] (03CR) 10Zoranzoki21: [C: 04-1] "I would like to merge this and 051048e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469168 (https://phabricator.wikimedia.org/T207714) (owner: 10Urbanecm) [08:03:31] (03CR) 10Filippo Giunchedi: [C: 031] ntp: move diamond::collector to where it will only apply to ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:08:30] !log update hp firmware to 6.60 on ms-be2017 - T141756 [08:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:33] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [08:09:37] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:12:17] (03PS2) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469168 (https://phabricator.wikimedia.org/T207714) [08:15:05] (03CR) 10Urbanecm: "I disagree. It is more clear to have those two patches splitted, and easier to review. Often, just forgeting something easy made throttle " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469168 (https://phabricator.wikimedia.org/T207714) (owner: 10Urbanecm) [08:17:27] (03CR) 10Muehlenhoff: [C: 031] "This patch is dependant on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464866/, I've commented on the patch that we should simp" [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:18:10] (03CR) 10Zoranzoki21: "-1 removed per discussion on IRC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469167 (https://phabricator.wikimedia.org/T207722) (owner: 10Urbanecm) [08:18:21] (03PS3) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469168 (https://phabricator.wikimedia.org/T207714) [08:21:49] (03PS4) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469168 (https://phabricator.wikimedia.org/T207714) [08:23:03] 10Operations, 10Discovery-Search (Current work): Investigate reducing number of servers in the elasticsearch cluster - https://phabricator.wikimedia.org/T207724 (10Gehel) p:05Triage>03Normal [08:39:00] (03CR) 10Muehlenhoff: [C: 04-1] "But we still need to remove/absent the memcached collector until this can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/466907 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:45:23] (03CR) 10Muehlenhoff: "Looks good. I'm not sure the ensure=>absent for the Nagios collector was applied yet, though? It's still listed in /etc/diamond/collectors" [puppet] - 10https://gerrit.wikimedia.org/r/468481 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:47:10] (03CR) 10Muehlenhoff: [C: 031] "Ah, that's only done in a not-yet-merged patch I initially missed, so fine after all." [puppet] - 10https://gerrit.wikimedia.org/r/468481 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:48:37] PROBLEM - MariaDB Slave SQL: s8 on db1124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.change_tag: Cant find record in change_tag, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1087-bin.003121, end_log_pos 1035549760 [08:48:47] PROBLEM - MariaDB Slave Lag: s8 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2096.14 seconds [08:49:50] (03CR) 10Muehlenhoff: "The comments touch a number of discussions which are not covered in the commit message, the patch looks fine from a technical angle but it" [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:51:04] that ^ is expected [08:51:07] jynus: ^ [08:51:47] are you asking or saying? [08:53:48] I am asking if that's you, as I thought you were working with change_tag? [08:54:19] then is is clearer if you say "Is that ^ expected?" [08:54:31] I am doing 20 things at the same time, sorry. [08:54:52] I will look at it [08:54:55] thanks [08:59:31] 10Operations, 10ops-codfw: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T207713 (10fgiunchedi) 05Open>03Invalid Looks like a case of the controller freaking out. I've updated its firmware now to 6.60, after a reboot the raid is clean ``` ms-be2017:~$ cat /proc/mdstat Personalities : [li... [09:02:41] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: use statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/467988 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [09:04:12] (03PS2) 10Filippo Giunchedi: thumbor: use statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/467988 (https://phabricator.wikimedia.org/T205870) [09:06:49] (03CR) 10Zfilipin: New throttle rule (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469168 (https://phabricator.wikimedia.org/T207714) (owner: 10Urbanecm) [09:08:05] (03CR) 10Effie Mouzeli: [C: 032] admin: add aaron(Aaron Schulz) to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/468989 (https://phabricator.wikimedia.org/T207090) (owner: 10Mathew.onipe) [09:08:41] (03PS3) 10Effie Mouzeli: admin: add aaron(Aaron Schulz) to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/468989 (https://phabricator.wikimedia.org/T207090) (owner: 10Mathew.onipe) [09:11:35] (03CR) 10Muehlenhoff: [C: 031] "Looks great! One nit: Maybe rename "KB Total" to "Apache throughput per seconds (KB)", otherwise it's not really obvious what this value m" [puppet] - 10https://gerrit.wikimedia.org/r/466988 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [09:13:00] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment access to servers for Performance Team task for perf-roots - https://phabricator.wikimedia.org/T207090 (10jijiki) 05Open>03Resolved [09:13:41] !log roll-restart thumbor to send statsd traffic through statsd_exporter - T205870 [09:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:45] T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 [09:14:04] (03CR) 10Mathew.onipe: [C: 031] admin: Add liw user account [puppet] - 10https://gerrit.wikimedia.org/r/467938 (https://phabricator.wikimedia.org/T206612) (owner: 10Vgutierrez) [09:16:18] (03PS1) 10Elukey: eventlogging: use /srv/log instead of /var/log as default logging dir [puppet] - 10https://gerrit.wikimedia.org/r/469176 [09:16:30] (03CR) 10Effie Mouzeli: [C: 031] admin: Add liw user account [puppet] - 10https://gerrit.wikimedia.org/r/467938 (https://phabricator.wikimedia.org/T206612) (owner: 10Vgutierrez) [09:16:54] (03CR) 10Alex Monk: "The shinken thing was removed in I0c018646." [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [09:19:11] (03CR) 10Muehlenhoff: [C: 031] admin: Add liw user account [puppet] - 10https://gerrit.wikimedia.org/r/467938 (https://phabricator.wikimedia.org/T206612) (owner: 10Vgutierrez) [09:19:54] (03CR) 10Muehlenhoff: [C: 031] "This was approved in yesterday's meeting." [puppet] - 10https://gerrit.wikimedia.org/r/467939 (https://phabricator.wikimedia.org/T206612) (owner: 10Vgutierrez) [09:22:55] (03CR) 10Effie Mouzeli: [C: 032] admins: add kharlan to 'restricted' group [puppet] - 10https://gerrit.wikimedia.org/r/468327 (https://phabricator.wikimedia.org/T207330) (owner: 10Dzahn) [09:23:21] (03PS2) 10Effie Mouzeli: admins: add kharlan to 'restricted' group [puppet] - 10https://gerrit.wikimedia.org/r/468327 (https://phabricator.wikimedia.org/T207330) (owner: 10Dzahn) [09:28:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers (mwlog*, mmaint* ?) for kharlan - https://phabricator.wikimedia.org/T207330 (10jijiki) 05Open>03Resolved >>! In T207330#4677366, @Krenair wrote: > btw, I'm still not convinced about the rules regarding sudo r... [09:30:27] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:30:58] RECOVERY - MariaDB Slave SQL: s8 on db1124 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:31:41] !log depooling / banning elastics1017 and 1022 - T207724 [09:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:45] T207724: Investigate reducing number of servers in the elasticsearch cluster - https://phabricator.wikimedia.org/T207724 [09:33:58] (03CR) 10Ema: [C: 031] "Two minor comments, looks good to my untrained eye otherwise." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/467878 (https://phabricator.wikimedia.org/T206338) (owner: 10Giuseppe Lavagetto) [09:34:19] (03PS1) 10Elukey: eventbus: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/469178 [09:38:01] (03PS2) 10Ema: icinga: remove unused check_http commands [puppet] - 10https://gerrit.wikimedia.org/r/468961 [09:38:24] (03CR) 10Effie Mouzeli: [C: 032] admin: Add liw user account [puppet] - 10https://gerrit.wikimedia.org/r/467938 (https://phabricator.wikimedia.org/T206612) (owner: 10Vgutierrez) [09:38:54] (03CR) 10Ema: [C: 032] icinga: remove unused check_http commands [puppet] - 10https://gerrit.wikimedia.org/r/468961 (owner: 10Ema) [09:39:29] (03CR) 10Effie Mouzeli: [C: 032] "Approved in SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/467938 (https://phabricator.wikimedia.org/T206612) (owner: 10Vgutierrez) [09:39:49] (03PS2) 10Effie Mouzeli: admin: Add liw user account [puppet] - 10https://gerrit.wikimedia.org/r/467938 (https://phabricator.wikimedia.org/T206612) (owner: 10Vgutierrez) [09:40:17] RECOVERY - MariaDB Slave Lag: s8 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [09:42:09] !log stopping db1087 to fix db1124 [09:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:31] (03PS5) 10Urbanecm: New throttle rule for Wikipedia in Ort [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469168 (https://phabricator.wikimedia.org/T207714) [09:48:32] (03PS2) 10Elukey: eventbus: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/469178 [09:50:08] (03PS1) 10Filippo Giunchedi: thumbor: add missing statsd_exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/469179 (https://phabricator.wikimedia.org/T205870) [09:50:34] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: add missing statsd_exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/469179 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [09:53:04] PROBLEM - MariaDB Slave Lag: s8 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.53 seconds [09:54:21] (03PS1) 10Zoranzoki21: Enable RCPatrol for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469180 (https://phabricator.wikimedia.org/T207732) [09:55:05] (03CR) 10Effie Mouzeli: [C: 032] "Approved in SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/467939 (https://phabricator.wikimedia.org/T206612) (owner: 10Vgutierrez) [10:02:50] (03PS3) 10Elukey: eventbus: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/469178 [10:03:13] (03PS1) 10Filippo Giunchedi: prometheus: add jobs for statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/469182 (https://phabricator.wikimedia.org/T205870) [10:04:06] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add jobs for statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/469182 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [10:06:09] (03PS4) 10Elukey: eventbus: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/469178 [10:08:04] (03CR) 10Elukey: "Cumulative list of changes (including the parent change):" [puppet] - 10https://gerrit.wikimedia.org/r/469178 (owner: 10Elukey) [10:08:34] RECOVERY - puppet last run on logstash1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:13:31] !log upload libc++ 6.0.1 to stretch-wikimedia/main T204232 [10:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:35] T204232: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 [10:22:02] I'm looking at the hp raid service checks timing out btw [10:22:47] (03PS2) 10Ema: Backport ATS 8.0.0 to stretch-wikimedia [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/464764 (https://phabricator.wikimedia.org/T204232) [10:24:13] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:25:34] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:27:47] godog: those are a little odd, I tried to re-run a few manually via "Schedule next service check" yesterday, but that didn't resolve it for all of them [10:29:43] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.580 second response time [10:30:03] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:30:03] the memcached error should resolve soon [10:30:09] there you go [10:30:31] historically they timeout when there is too much I/O activity on the disks [10:31:12] (the raid ones) [10:36:04] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:37:13] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:37:28] (03CR) 10jerkins-bot: [V: 04-1] Backport ATS 8.0.0 to stretch-wikimedia [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/464764 (https://phabricator.wikimedia.org/T204232) (owner: 10Ema) [10:38:04] again WANCache:t:commonswiki:gadgets-definition, I really hope that the new change that AaronSchulz made will make some difference (should be deployed this week) [10:47:19] (03CR) 10Volans: "LGTM, but one nitpick inline for the Hash type." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [10:49:53] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:53:42] (03Abandoned) 10Effie Mouzeli: admin: Add liw to deployment, contint-admins, labnet-users and contint-docker [puppet] - 10https://gerrit.wikimedia.org/r/467939 (https://phabricator.wikimedia.org/T206612) (owner: 10Vgutierrez) [10:57:32] (03PS1) 10Effie Mouzeli: admin: Add liw to deployment, contint-admins, labnet-users and contint-docker [puppet] - 10https://gerrit.wikimedia.org/r/469187 (https://phabricator.wikimedia.org/T206612) [10:59:40] jouncebot: now [10:59:40] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [11:00:04] jouncebot: next [11:00:04] In 0 hour(s) and 59 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1200) [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1100). Please do the needful. [11:00:04] Jonas_WMDE, Urbanecm, and Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:20] I am here ;) [11:00:27] Hi [11:01:25] o/ [11:01:30] I can SWAT today [11:01:42] zeljkof: You need to deploy my first. I have to be for 10 minutes in school [11:03:12] 10Operations, 10Patch-For-Review: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10MoritzMuehlenhoff) @crusnov I've added you to pwstore. You can find some docs at https://office.wikimedia.org/wiki/Pwstore, let me know if you run into any issues. [11:03:17] Zoranzoki21: no problem, you're first then [11:03:22] ^ volans [11:03:23] 10Operations, 10Patch-For-Review: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10MoritzMuehlenhoff) [11:03:32] ok [11:03:59] moritzm: <3 [11:04:37] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469180 (https://phabricator.wikimedia.org/T207732) (owner: 10Zoranzoki21) [11:05:15] ... [11:05:31] (test because I changed some setting in wifi) [11:05:36] ok.. everything is ok [11:05:46] (03CR) 10Muehlenhoff: [C: 031] "Was acked in the SRE meeting." [puppet] - 10https://gerrit.wikimedia.org/r/469187 (https://phabricator.wikimedia.org/T206612) (owner: 10Effie Mouzeli) [11:05:57] (03Merged) 10jenkins-bot: Enable RCPatrol for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469180 (https://phabricator.wikimedia.org/T207732) (owner: 10Zoranzoki21) [11:06:47] zeljkof: mwdebug1002? [11:07:03] Zoranzoki21: just push it there, please test [11:07:31] zeljkof: sure [11:10:04] zeljkof: Works, push in production [11:10:10] ok [11:11:03] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:469180|Enable RCPatrol for srwikiquote (T207732)]] (duration: 00m 47s) [11:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:07] T207732: Enable RC Patrol for srwikiquote - https://phabricator.wikimedia.org/T207732 [11:11:17] Zoranzoki21: it's deployed, please test and run to school! :D [11:11:33] Ok ;) [11:11:36] Works ;) [11:11:39] Jonas_WMDE: around for SWAT? [11:11:43] Zoranzoki21: great! [11:12:04] Teacher is happy when I tell him to I done something for Wikimedia [11:12:13] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.785 second response time [11:12:13] So, no problems if I late small [11:12:16] CYA [11:12:18] <3 [11:13:10] (03CR) 10Zfilipin: "This is scheduled for EU SWAT (right now), but it's already merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463439 (https://phabricator.wikimedia.org/T207019) (owner: 10Jonas Kress (WMDE)) [11:13:52] Jonas_WMDE, addshore this is scheduled for swat, but it's already merged? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/463439 [11:13:56] * zeljkof is confused [11:14:18] Urbanecm: you're next [11:14:24] (03CR) 10Mathew.onipe: [C: 031] admin: Add liw to deployment, contint-admins, labnet-users and contint-docker [puppet] - 10https://gerrit.wikimedia.org/r/469187 (https://phabricator.wikimedia.org/T206612) (owner: 10Effie Mouzeli) [11:14:24] ack [11:14:36] but looks like there is nothing to test, so I'll just let you know when I'm done [11:14:40] kk [11:15:04] (03CR) 10Effie Mouzeli: [C: 032] admin: Add liw to deployment, contint-admins, labnet-users and contint-docker [puppet] - 10https://gerrit.wikimedia.org/r/469187 (https://phabricator.wikimedia.org/T206612) (owner: 10Effie Mouzeli) [11:15:18] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469167 (https://phabricator.wikimedia.org/T207722) (owner: 10Urbanecm) [11:15:34] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:16:32] (03Merged) 10jenkins-bot: Remove expired throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469167 (https://phabricator.wikimedia.org/T207722) (owner: 10Urbanecm) [11:18:11] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469168 (https://phabricator.wikimedia.org/T207714) (owner: 10Urbanecm) [11:19:22] (03Merged) 10jenkins-bot: New throttle rule for Wikipedia in Ort [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469168 (https://phabricator.wikimedia.org/T207714) (owner: 10Urbanecm) [11:20:10] (03CR) 10jenkins-bot: Enable RCPatrol for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469180 (https://phabricator.wikimedia.org/T207732) (owner: 10Zoranzoki21) [11:20:12] (03CR) 10jenkins-bot: Remove expired throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469167 (https://phabricator.wikimedia.org/T207722) (owner: 10Urbanecm) [11:20:14] (03CR) 10jenkins-bot: New throttle rule for Wikipedia in Ort [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469168 (https://phabricator.wikimedia.org/T207714) (owner: 10Urbanecm) [11:20:54] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:469168|New throttle rule for Wikipedia in Ort (T207714)]] (duration: 00m 46s) [11:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:57] T207714: Account creation exception for wikimedia event in cologne - https://phabricator.wikimedia.org/T207714 [11:21:04] Urbanecm: all done! [11:21:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10jijiki) 05Open>03Resolved [11:22:39] thank you zeljkof [11:23:01] !log EU SWAT finished [11:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:07] this was a quick swat :) [11:23:39] yep [11:28:23] 10Operations, 10Security: Access requests process: Consideration of 'indirect' sudo rules via e.g. keyholder - https://phabricator.wikimedia.org/T207739 (10Krenair) [11:36:26] (03PS3) 10Ema: Backport ATS 8.0.0 to stretch-wikimedia [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/464764 (https://phabricator.wikimedia.org/T204232) [11:40:43] RECOVERY - MariaDB Slave Lag: s8 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:42:56] 10Operations, 10Wikimedia-Mailing-lists: New list request for 1lib1ref - https://phabricator.wikimedia.org/T207283 (10jijiki) a:03jijiki [11:46:34] PROBLEM - Apache HTTP on mw2210 is CRITICAL: HTTP CRITICAL - No data received from host [11:47:43] RECOVERY - Apache HTTP on mw2210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.120 second response time [11:48:06] sorry this is me --^ [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1200) [12:07:23] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [12:12:05] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) p:05Triage>03High [12:19:08] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 4 others: PHP Warning "Unable to delete stat cache" from file uploads - https://phabricator.wikimedia.org/T205567 (10jijiki) p:05Triage>03Normal [12:21:39] (03CR) 10Filippo Giunchedi: [C: 031] aptrepo: add thirdparty/confluent component for jessie [puppet] - 10https://gerrit.wikimedia.org/r/469123 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [12:26:29] (03CR) 10Filippo Giunchedi: "Seeing this and I21d43114ed makes me think the simplest option would be to import kafka 1.1 to jessie-wikimedia/thirdparty and leave third" [puppet] - 10https://gerrit.wikimedia.org/r/469122 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [12:26:48] (03CR) 10Muehlenhoff: [C: 04-1] "That's not needed, on jessie the confluent kafka packages are present in "thirdparty" (as the split into separate thirdparty components wa" [puppet] - 10https://gerrit.wikimedia.org/r/469123 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [12:27:17] indeed, thanks for confirming moritzm [12:28:24] 10Operations, 10Revision-Slider, 10TCB-Team, 10WMDE-Analytics-Engineering, 10Graphite: Fix aggregation of "MediaWiki.RevisionSlider.event.load.sum" from average to sum - https://phabricator.wikimedia.org/T205416 (10jijiki) p:05Triage>03Low [12:28:59] thank wikibugs bot, otherwise I'd never have noticed :-) [12:29:40] !log depooling / banning elastics1028 and 1030 - T207724 [12:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:44] T207724: Investigate reducing number of servers in the elasticsearch cluster - https://phabricator.wikimedia.org/T207724 [12:32:18] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10Cmjohnson) @elukey analytics1068 is back up and running. Please resolve this task if everything looks good to you. [12:35:54] 10Operations, 10Cloud-Services, 10netops: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) >>! In T207663#4687081, @faidon wrote: > It's sad to hear that's a major disruption :( Would it make sense to do this now when it's early in the migratio... [12:36:56] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10elukey) @Cmjohnson thanks a lot! I noticed it this morning and fixed it, it was running with a old/stale config (and failing, so no big deal). Can we sync (either with me or Andrew) next time bef... [12:39:54] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.624 second response time [12:43:23] (03CR) 10Filippo Giunchedi: mediawiki::web::vhost: allow serving content from php7 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/467878 (https://phabricator.wikimedia.org/T206338) (owner: 10Giuseppe Lavagetto) [12:43:24] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:46:30] (03CR) 10Elukey: [C: 031] "IIUC testing has been proven working fine, my doubts were cleared, seems good :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/467878 (https://phabricator.wikimedia.org/T206338) (owner: 10Giuseppe Lavagetto) [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1300) [13:01:02] (03CR) 10Filippo Giunchedi: mediawiki::web::vhost: allow serving content from php7 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/467878 (https://phabricator.wikimedia.org/T206338) (owner: 10Giuseppe Lavagetto) [13:07:13] (03PS2) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T202885) [13:08:48] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [13:09:51] (03CR) 10Mathew.onipe: "> Patch Set 2: Verified-1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [13:09:58] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10aborrero) I don't think SMTP and DNS are in the same level, so perhaps is not a fair comparison. No VM needs SMTP to work, a failure in SMTP servers is not a disaster. Most stuff running... [13:10:03] (03CR) 10Ema: [C: 032] Backport ATS 8.0.0 to stretch-wikimedia [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/464764 (https://phabricator.wikimedia.org/T204232) (owner: 10Ema) [13:11:31] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Ok, great, then it sounds like this should go in the public VLAN, with ACLs in the Analytics VLAN to allow us t... [13:14:36] (03CR) 10Herron: "That would work for the purpose of logstash, but I think will present other existing jessie kafka hosts with available confluent- package " [puppet] - 10https://gerrit.wikimedia.org/r/469122 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [13:22:37] !log depooling / banning elastics1018 - T207724 [13:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:41] T207724: Investigate reducing number of servers in the elasticsearch cluster - https://phabricator.wikimedia.org/T207724 [13:25:38] (03PS1) 10Gehel: wdqs: Assign user directory to blazegraph user [puppet] - 10https://gerrit.wikimedia.org/r/469194 [13:27:10] (03CR) 10Gehel: [C: 032] wdqs: Assign user directory to blazegraph user [puppet] - 10https://gerrit.wikimedia.org/r/469194 (owner: 10Gehel) [13:28:11] 10Operations, 10Cloud-VPS: Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero) My understanding of the problem is: * cloud supporting services in hardware (DNS, SMTP, DB replicas, NFS, monitoring, etc) share their addr... [13:28:53] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero) [13:29:15] zeljkof: I'm in Portland this week so different timezone, I might have added that my accident to the wrong bit of the page, yes it is already deployed, nothing to do there [13:29:18] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero) a:03aborrero [13:32:40] addshore: cool [13:32:55] (03CR) 10Ottomata: [C: 031] eventlogging: use /srv/log instead of /var/log as default logging dir [puppet] - 10https://gerrit.wikimedia.org/r/469176 (owner: 10Elukey) [13:34:13] (03CR) 10Ottomata: [C: 031] "oh! great!" [puppet] - 10https://gerrit.wikimedia.org/r/469178 (owner: 10Elukey) [13:35:53] !log rolling restart of blazegraph for change to blazegraph home dir [13:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:34] (03CR) 10Ottomata: "In this case Im' not worried about it. The only (other) remaining prod kafka jessie hosts are the old analytics servers. They are are out" [puppet] - 10https://gerrit.wikimedia.org/r/469122 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [13:37:34] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.268 second response time [13:38:51] 10Operations, 10Revision-Slider, 10TCB-Team, 10WMDE-Analytics-Engineering, 10Graphite: Fix aggregation of "MediaWiki.RevisionSlider.event.load.sum" from average to sum - https://phabricator.wikimedia.org/T205416 (10Lea_WMDE) Since this task was marked as low: The fact that metrics are aggregated by avera... [13:39:11] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/469122 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [13:40:21] (03PS1) 10Bstorm: sonofgridengine: correct the collector variables [puppet] - 10https://gerrit.wikimedia.org/r/469196 (https://phabricator.wikimedia.org/T200557) [13:41:03] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:02] (03CR) 10Bstorm: [C: 032] sonofgridengine: correct the collector variables [puppet] - 10https://gerrit.wikimedia.org/r/469196 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [13:43:51] !log depooling / banning elastics1029 - T207724 [13:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:55] T207724: Investigate reducing number of servers in the elasticsearch cluster - https://phabricator.wikimedia.org/T207724 [13:47:20] (03CR) 10Ottomata: "ah right, great!" [puppet] - 10https://gerrit.wikimedia.org/r/469122 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [13:48:59] !log depooling / banning elastics1031 - T207724 [13:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:03] T207724: Investigate reducing number of servers in the elasticsearch cluster - https://phabricator.wikimedia.org/T207724 [13:52:59] 10Operations, 10Recommendation-API, 10Research, 10Core Platform Team Kanban (Doing), and 3 others: Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10bmansurov) @jcrespo can you please help out with T205452#4674282? Thanks! [13:58:00] (03PS2) 10Elukey: eventlogging: use /srv/log instead of /var/log as default logging dir [puppet] - 10https://gerrit.wikimedia.org/r/469176 [14:00:31] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13158/" [puppet] - 10https://gerrit.wikimedia.org/r/469176 (owner: 10Elukey) [14:00:48] !log upload trafficserver 8.0.0-1wm1 to stretch-wikimedia/main T204232 [14:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:53] T204232: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 [14:01:42] !log installing spice security updates [14:01:43] (03CR) 10DCausse: elasticsearch_cluster: multi-cluster/multi-instance support (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [14:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:32] !log repooling / banning elastics1031 - T207724 [14:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:35] T207724: Investigate reducing number of servers in the elasticsearch cluster - https://phabricator.wikimedia.org/T207724 [14:04:28] 10Operations, 10Recommendation-API, 10Research, 10Core Platform Team Kanban (Doing), and 3 others: Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10jcrespo) Sorry, there is an ops clinic duty to answer these kind of requests- I did my part which was creating the user accoun... [14:05:35] (03CR) 10DCausse: elasticsearch_cluster: multi-cluster/multi-instance support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [14:06:52] 10Operations, 10monitoring: Graphite1001 disk usage at 96% - https://phabricator.wikimedia.org/T207040 (10jijiki) p:05Triage>03Normal a:03fgiunchedi [14:08:17] (03PS1) 10Filippo Giunchedi: prometheus: set defaults for statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/469200 (https://phabricator.wikimedia.org/T205870) [14:08:56] (03CR) 10jerkins-bot: [V: 04-1] prometheus: set defaults for statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/469200 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [14:09:10] (03PS5) 10Elukey: eventbus: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/469178 [14:09:15] 10Operations, 10monitoring: Adapt Kafka dashboards to use metrics from prometheus-node-exporter - https://phabricator.wikimedia.org/T207041 (10MoritzMuehlenhoff) p:05Triage>03Normal [14:09:39] 10Operations, 10netops: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - https://phabricator.wikimedia.org/T207753 (10ayounsi) p:05Triage>03Normal [14:10:05] (03CR) 10Elukey: [C: 032] eventbus: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/469178 (owner: 10Elukey) [14:12:32] (03PS2) 10Filippo Giunchedi: prometheus: set defaults for statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/469200 (https://phabricator.wikimedia.org/T205870) [14:12:37] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10bd808) > The new Elastic replicas (which haven't been set up yet, see T194186) I suppose this could be done by trea... [14:13:16] (03CR) 10jerkins-bot: [V: 04-1] prometheus: set defaults for statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/469200 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [14:14:14] jouncebot: now [14:14:14] For the next 0 hour(s) and 45 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1300) [14:14:34] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) >>! In T207536#4688833, @bd808 wrote: > I'm sure there are cons to consider as well both for the WMCS team... [14:16:30] (03PS3) 10Filippo Giunchedi: prometheus: set defaults for statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/469200 (https://phabricator.wikimedia.org/T205870) [14:16:56] 10Operations, 10Security: Access requests process: Consideration of 'indirect' sudo rules via e.g. keyholder - https://phabricator.wikimedia.org/T207739 (10Krenair) * Actually, should any group providing access to a shared SSH key (keyholder or otherwise) to prod hosts need meeting review, even if the target u... [14:18:13] 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Andrew) Spoke too soon, got another failure overnight. ``` Oct 23 06:25:20 labvirt1017 puppet-agent[161569]: (/Stage[main]/Openstack::Nova::Common::Base/File[/etc/nova/policy.json]) Could not evaluate... [14:18:48] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: set defaults for statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/469200 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [14:20:56] 10Operations, 10cloud-services-team: Create a jessie netboot image with the 4.9 Linux kernel - https://phabricator.wikimedia.org/T206761 (10MoritzMuehlenhoff) p:05Triage>03Lowest [14:21:21] 10Operations, 10ops-eqiad: Broken memory on thumbor1004 - https://phabricator.wikimedia.org/T207721 (10MoritzMuehlenhoff) p:05Triage>03Normal [14:22:32] !log anomie@deploy1001 Synchronized php-1.32.0-wmf.26/includes/filerepo/file/LocalFile.php: Backport for T207419 (duration: 00m 47s) [14:22:34] 10Operations, 10ops-codfw: relabel server saiph.frack.codfw.wmnet to frpig2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T207036 (10jijiki) p:05Triage>03Normal [14:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:36] T207419: sql error: Error: 1048 Column 'fa_description_id' cannot be null - https://phabricator.wikimedia.org/T207419 [14:23:31] <_joe_> elukey: run! [14:24:26] what did I do!? [14:24:28] (03CR) 10C. Scott Ananian: [C: 031] Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [14:24:48] 10Operations, 10ops-codfw: relabel server saiph.frack.codfw.wmnet to frpig2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T207036 (10jijiki) a:03Papaul [14:24:52] <_joe_> you? nothing [14:24:53] <_joe_> I did [14:24:54] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: use handler to do proxying [puppet] - 10https://gerrit.wikimedia.org/r/469201 [14:24:56] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: allow serving requests with php7 [puppet] - 10https://gerrit.wikimedia.org/r/469202 [14:24:59] (03PS1) 10Giuseppe Lavagetto: beta: start using set_handler instead of the proxy passes [puppet] - 10https://gerrit.wikimedia.org/r/469203 [14:25:03] <_joe_> elukey: ^^ [14:25:47] wow [14:26:02] <_joe_> I wasn't happy with the direction we were going to [14:26:21] <_joe_> this is *way* better IMHO [14:27:01] ahhh nice! [14:27:21] 10Operations: Access requests process: People sometimes specify hostnames instead of admin groups in access requests - https://phabricator.wikimedia.org/T207754 (10Krenair) [14:27:23] <_joe_> I might avoid creating changes in general until one activates the set_handler variable, but the changes should be super-safe imho [14:28:02] <_joe_> yeah, gonna do that, but it's a minor change in what I wrote [14:28:29] <_joe_> this way we can apply the changes only to mwdebug first, and take our time to offer php7 elsewhere [14:31:40] (03PS1) 10Anomie: Set CommentTableSchemaMigrationStage => WRITE_NEW on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469204 (https://phabricator.wikimedia.org/T166733) [14:32:05] (03CR) 10Anomie: [C: 032] "Deploying planned config change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469204 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [14:32:30] (03CR) 10DCausse: elasticsearch_cluster: multi-cluster/multi-instance support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [14:33:14] (03Merged) 10jenkins-bot: Set CommentTableSchemaMigrationStage => WRITE_NEW on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469204 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [14:33:54] (03PS1) 10Addshore: CS.php, check UseWikibaseMediaInfo for loading Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469205 [14:34:13] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting comment table migration stage to write-new/read-both on group 1 (T166733) (duration: 00m 46s) [14:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:21] T166733: Deploy refactored comment storage - https://phabricator.wikimedia.org/T166733 [14:35:00] 10Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T206909 (10jijiki) p:05Triage>03High a:03Papaul [14:37:14] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.592 second response time [14:40:05] (03CR) 10jenkins-bot: Set CommentTableSchemaMigrationStage => WRITE_NEW on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469204 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [14:40:17] (03PS1) 10Addshore: BETA: Set wmgWikibaseCachePrefix for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469206 [14:40:44] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:47] 10Operations, 10Discovery-Search (Current work): Investigate reducing number of servers in the elasticsearch cluster - https://phabricator.wikimedia.org/T207724 (10Gehel) With 6 servers depooled / banned, the cluster seems to be just fine. Starting at 7 nodes depooled, I see the load rising on some of the othe... [14:41:01] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10bd808) >>! In T207321#4687656, @Krenair wrote: >>>! In T207321#4687651, @ayounsi wrote: >>> Where are the labsdb hosts go... [14:41:51] 10Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T206909 (10jijiki) ``` [Sat Oct 13 09:54:20 2018] megaraid_sas 0000:08:00.0: scanning for scsi7... [Sat Oct 13 09:54:20 2018] megaraid_sas 0000:08:00.0: 6833 (592740763s/0x0001/CRIT) - VD 00/0 is now DEGRADED [Sat Oct 13 09:54... [14:42:29] 10Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T164955 (10jijiki) 05Open>03Resolved a:03jijiki T206909 was opened automatically. [14:45:27] 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10jijiki) p:05Triage>03Normal a:03Cmjohnson [14:46:12] 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10elukey) @Cmjohnson sure lemme know when it works best for you, it should take me ~5/10 mins to shut it down properly (worst case scenario) [14:46:39] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10jijiki) [14:47:50] !log added confluent-kafka-2.11 1.1.0-1 package to jessie-wikimedia/thirdparty T206454 [14:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:53] T206454: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 [14:48:09] 10Operations, 10Maps, 10Patch-For-Review: Switch to unix socket connections for osmupdater / osmimporter for postgresql on maps - https://phabricator.wikimedia.org/T206639 (10jijiki) p:05Triage>03Normal a:03Mathew.onipe [14:48:32] (03CR) 10Herron: "> AFAICS it shouldn't, meaning that confluent-kafka-2.11.7 (and" [puppet] - 10https://gerrit.wikimedia.org/r/469122 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [14:50:00] mobrovac: FYI in case you're not aware of the flapping of pdfrender on scb1004, 6 times today so far [14:50:02] (03Abandoned) 10Herron: confluent::kafka::common support thirdparty/confluent on jessie [puppet] - 10https://gerrit.wikimedia.org/r/469122 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [14:50:03] according to icinga [14:50:20] 10Operations, 10DNS, 10Traffic: Create redirect to integration.wikimedia.org/zuul - https://phabricator.wikimedia.org/T207008 (10jijiki) p:05Triage>03Normal [14:51:06] (03Abandoned) 10Herron: aptrepo: add thirdparty/confluent component for jessie [puppet] - 10https://gerrit.wikimedia.org/r/469123 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [14:51:48] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, and 2 others: Remove *.cz domains from WMF's infrastructure - https://phabricator.wikimedia.org/T206923 (10jijiki) p:05Triage>03Low [14:52:49] (03PS1) 10Addshore: Wikibase.php, don't load wikidata repo settings on other repos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469209 [14:52:56] (03Abandoned) 10Herron: logstash: set logging kafka package version to 1.1.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/469124 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [14:54:33] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, and 3 others: Remove *.cz domains from WMF's infrastructure - https://phabricator.wikimedia.org/T206923 (10jijiki) a:03jijiki [14:55:47] jouncebot: next [14:55:48] In 1 hour(s) and 4 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1600) [14:57:48] 10Operations, 10monitoring, 10Availability, 10Patch-For-Review, 10Performance-Team (Radar): Perform a statsd and Graphite switch - https://phabricator.wikimedia.org/T206963 (10jijiki) p:05Triage>03Normal [15:02:53] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move labmon (Graphite, StatsD) into a Cloud VPS - https://phabricator.wikimedia.org/T207543 (10Krenair) >>! In T207543#4684881, @aborrero wrote: > We have dedicated hardware for this, in the case of labmon1002, a fairly new server which was put into... [15:07:05] Hallo [15:07:31] These days, what's the right server for making simple queries from sql enwiki, sql fawiki, sql wikishared? [15:07:40] In the past it used to be terbium. Then mwmaint1001. [15:07:51] Now, if I'm not mistaken, it's mwmaint1002... [15:07:55] am I wrong? [15:08:24] 10Operations, 10monitoring, 10Performance-Team (Radar): "Workers" data from prometheus for mw app servers alternates strangely - https://phabricator.wikimedia.org/T206939 (10jijiki) p:05Triage>03Normal [15:08:44] (03PS1) 10Bstorm: sonofgridengine: move the filetype definition outside the provider def [puppet] - 10https://gerrit.wikimedia.org/r/469210 (https://phabricator.wikimedia.org/T200557) [15:09:19] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: move the filetype definition outside the provider def [puppet] - 10https://gerrit.wikimedia.org/r/469210 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [15:09:29] aharoni, mwmaint1002 should work I think [15:10:29] 10Operations, 10cloud-services-team: Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10MoritzMuehlenhoff) p:05Triage>03High [15:11:33] (03PS2) 10Bstorm: sonofgridengine: move the filetype definition outside the provider def [puppet] - 10https://gerrit.wikimedia.org/r/469210 (https://phabricator.wikimedia.org/T200557) [15:11:39] (03CR) 10Filippo Giunchedi: mediawiki::web::vhost: use handler to do proxying (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469201 (owner: 10Giuseppe Lavagetto) [15:11:57] (03CR) 10Cwhite: [C: 031] icinga: add puppet types for parameters [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [15:12:41] (03CR) 10Cwhite: [C: 031] icinga/nsca: use systemd::service, unit file by systemd-sysv-generator [puppet] - 10https://gerrit.wikimedia.org/r/469130 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [15:18:43] (03CR) 10Giuseppe Lavagetto: [C: 031] sonofgridengine: move the filetype definition outside the provider def [puppet] - 10https://gerrit.wikimedia.org/r/469210 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [15:22:27] (03CR) 10Cwhite: "> The comments touch a number of discussions which are not covered in" [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [15:22:50] <_joe_> aharoni: you're not wrong [15:23:07] <_joe_> if you need to do the queries in production, that is [15:23:23] (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [15:27:05] (03PS2) 10Cwhite: hiera: remove diamond from mediawiki role [puppet] - 10https://gerrit.wikimedia.org/r/466905 (https://phabricator.wikimedia.org/T183454) [15:27:23] (03CR) 10Jforrester: [C: 031] CS.php, check UseWikibaseMediaInfo for loading Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469205 (owner: 10Addshore) [15:27:35] (03CR) 10Jforrester: [C: 031] BETA: Set wmgWikibaseCachePrefix for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469206 (owner: 10Addshore) [15:29:40] (03PS3) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T202885) [15:34:00] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Sanitizing input and increase throttling rate for wdqs errors to prevent spamming logstash - https://phabricator.wikimedia.org/T207643 (10jijiki) p:05Triage>03Normal a:03Gehel [15:35:31] shdubsh, "The situation is still unfolding and we won't be able to determine the safety of this change for another week." ? [15:42:03] Krenair: We've removed the obvious user, it's still unknown whether or not the metrics are being used by something else. [15:42:35] ok [15:42:46] do we have no way of monitoring for that? [15:42:56] i.e. whether any statistics are going in at all? [15:43:23] 10Operations, 10MediaWiki-Page-deletion, 10Performance: Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10jijiki) p:05Triage>03Normal [15:43:29] Statistics are definitely going in. The plan is to check logs in a few days to see if anything is going out. [15:46:01] ok... [15:46:12] where are they going in from if that got removed? [15:46:17] oh wait [15:46:26] we removed the shinken check, not the nagios collector itself? [15:46:32] Right [15:46:38] sorry yes, you're right [15:47:30] Thanks for following up! :) [15:47:36] 10Operations, 10Analytics: setup/install barium/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10RobH) p:05Triage>03Normal [15:47:37] (03CR) 10Jforrester: Wikibase.php, don't load wikidata repo settings on other repos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469209 (owner: 10Addshore) [15:48:06] (03CR) 10Cwhite: [C: 032] hiera: remove diamond from mediawiki role [puppet] - 10https://gerrit.wikimedia.org/r/466905 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [15:48:26] 10Operations, 10Analytics, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264 (10RobH) 05Open>03Resolved Created sub-task T207760 for setup. [15:50:29] (03CR) 10DCausse: elasticsearch_cluster: multi-cluster/multi-instance support (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [15:54:19] (03PS1) 10Niedzielski: Update: add Wikimedia logo for SEO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469214 (https://phabricator.wikimedia.org/T198946) [15:54:21] (03CR) 10Cwhite: "> Looks great! One nit: Maybe rename "KB Total" to "Apache throughput" [puppet] - 10https://gerrit.wikimedia.org/r/466988 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [15:58:47] (03CR) 10Jforrester: Wikibase.php, don't load wikidata repo settings on other repos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469209 (owner: 10Addshore) [15:59:11] (03CR) 10Jforrester: Wikibase.php, don't load wikidata repo settings on other repos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469209 (owner: 10Addshore) [16:00:05] godog and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:04:23] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.491 second response time [16:04:28] 10Operations, 10Cloud-Services, 10Traffic: Traffic project in labs cannot talk HTTP with deployment-prep any longer - https://phabricator.wikimedia.org/T207763 (10ema) [16:07:03] 10Operations, 10ops-eqiad: apply hostname label for barium/WMF4750 - https://phabricator.wikimedia.org/T207764 (10RobH) p:05Triage>03Normal [16:07:44] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:44] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.306 second response time [16:09:19] 10Operations, 10Cloud-Services, 10Traffic, 10Beta-Cluster-reproducible: Traffic project in labs cannot talk HTTP with deployment-prep any longer - https://phabricator.wikimedia.org/T207763 (10Krenair) [16:10:08] joincebot: next [16:10:11] lol [16:10:14] jouncebot: next [16:10:14] In 0 hour(s) and 49 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1700) [16:10:55] (03CR) 10Giuseppe Lavagetto: mediawiki::web::vhost: use handler to do proxying (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469201 (owner: 10Giuseppe Lavagetto) [16:11:47] 10Operations, 10Cloud-Services, 10Traffic, 10Beta-Cluster-reproducible: Traffic project in labs cannot talk HTTP with deployment-prep any longer - https://phabricator.wikimedia.org/T207763 (10Krenair) Was traffic-ats-stretch using 172.16.2.180 when this broke? Is it possible you got migrated across regions... [16:12:14] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:47] (03PS1) 10RobH: set barium dns entries [dns] - 10https://gerrit.wikimedia.org/r/469217 (https://phabricator.wikimedia.org/T207760) [16:15:24] (03CR) 10RobH: [C: 032] set barium dns entries [dns] - 10https://gerrit.wikimedia.org/r/469217 (https://phabricator.wikimedia.org/T207760) (owner: 10RobH) [16:17:40] (03CR) 10Addshore: Wikibase.php, don't load wikidata repo settings on other repos (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469209 (owner: 10Addshore) [16:17:55] (03CR) 10Gilles: [C: 031] Update: add Wikimedia logo for SEO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469214 (https://phabricator.wikimedia.org/T198946) (owner: 10Niedzielski) [16:19:03] (03PS2) 10Addshore: Wikibase.php, don't load wikidata repo settings on other repos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469209 [16:20:02] !log restarted pdfrender on scb1004 [16:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:17] (03CR) 10Bstorm: [C: 032] sonofgridengine: move the filetype definition outside the provider def [puppet] - 10https://gerrit.wikimedia.org/r/469210 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [16:20:19] (03PS3) 10Bstorm: sonofgridengine: move the filetype definition outside the provider def [puppet] - 10https://gerrit.wikimedia.org/r/469210 (https://phabricator.wikimedia.org/T200557) [16:20:51] 10Operations, 10Cloud-Services, 10Traffic, 10Beta-Cluster-reproducible, 10cloud-services-team (Kanban): Traffic project in labs cannot talk HTTP with deployment-prep any longer - https://phabricator.wikimedia.org/T207763 (10aborrero) [16:21:04] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.004 second response time [16:21:43] 10Operations, 10Cloud-Services, 10Traffic, 10Beta-Cluster-reproducible, 10cloud-services-team (Kanban): Traffic project in labs cannot talk HTTP with deployment-prep any longer - https://phabricator.wikimedia.org/T207763 (10Krenair) 05Open>03Resolved a:05aborrero>03Krenair ema, this was... [16:26:07] (03PS1) 10Andrew Bogott: Horizon: enable 'search' VM creation in eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/469219 (https://phabricator.wikimedia.org/T207715) [16:26:30] (03PS2) 10Andrew Bogott: Horizon: enable 'search' VM creation in eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/469219 (https://phabricator.wikimedia.org/T207715) [16:28:01] (03CR) 10Andrew Bogott: [C: 032] Horizon: enable 'search' VM creation in eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/469219 (https://phabricator.wikimedia.org/T207715) (owner: 10Andrew Bogott) [16:28:50] 10Operations, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10RobH) [16:29:06] 10Operations, 10ops-eqiad: apply hostname label for weblog1001/WMF4750 - https://phabricator.wikimedia.org/T207764 (10RobH) [16:33:51] (03PS1) 10RobH: weblog1001 setup to replace oxygen [dns] - 10https://gerrit.wikimedia.org/r/469221 (https://phabricator.wikimedia.org/T207760) [16:34:16] (03CR) 10RobH: [C: 032] weblog1001 setup to replace oxygen [dns] - 10https://gerrit.wikimedia.org/r/469221 (https://phabricator.wikimedia.org/T207760) (owner: 10RobH) [16:37:08] (03PS1) 10Volans: tests: remove pylint skip-file [software/cumin] - 10https://gerrit.wikimedia.org/r/469222 [16:46:29] (03CR) 10CRusnov: [V: 031 C: 031] "Looks good more linting is good. Tox passes so +!" [software/cumin] - 10https://gerrit.wikimedia.org/r/469222 (owner: 10Volans) [16:54:07] (03PS1) 10RobH: weblog1001 isntall params [puppet] - 10https://gerrit.wikimedia.org/r/469224 (https://phabricator.wikimedia.org/T207760) [16:56:15] (03CR) 10RobH: [C: 032] weblog1001 isntall params [puppet] - 10https://gerrit.wikimedia.org/r/469224 (https://phabricator.wikimedia.org/T207760) (owner: 10RobH) [16:56:25] (03PS2) 10RobH: weblog1001 isntall params [puppet] - 10https://gerrit.wikimedia.org/r/469224 (https://phabricator.wikimedia.org/T207760) [16:57:56] robh: typo! [16:58:05] yeah [16:58:10] in commit message at least ;D [16:58:47] (03CR) 10WMDE-leszek: [C: 031] Update: add Wikimedia logo for SEO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469214 (https://phabricator.wikimedia.org/T198946) (owner: 10Niedzielski) [16:58:49] (03PS3) 10RobH: weblog1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/469224 (https://phabricator.wikimedia.org/T207760) [16:59:11] (03CR) 10Dzahn: [C: 032] "noop on einsteinium: https://puppet-compiler.wmflabs.org/compiler1002/13161/" [puppet] - 10https://gerrit.wikimedia.org/r/469130 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [16:59:16] fixed [16:59:33] PROBLEM - High lag on wdqs1003 is CRITICAL: 3611 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:59:35] (03CR) 10RobH: [C: 032] weblog1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/469224 (https://phabricator.wikimedia.org/T207760) (owner: 10RobH) [17:00:05] cscott, arlolra, subbu, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1700). [17:00:45] (03PS3) 10Dzahn: icinga/nsca: use systemd::service, unit file by systemd-sysv-generator [puppet] - 10https://gerrit.wikimedia.org/r/469130 (https://phabricator.wikimedia.org/T202782) [17:04:03] PROBLEM - High lag on wdqs1003 is CRITICAL: 3650 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:06:02] jouncebot: next [17:06:05] In 1 hour(s) and 53 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1900) [17:06:11] jouncebot: now [17:06:12] For the next 0 hour(s) and 53 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1700) [17:06:21] James_F: shall we do it in an hour? [17:07:00] Or now. [17:07:20] Can do [17:07:28] I'll get my laptop out and watch for the final patch [17:07:49] Well, I guess the final 2, the big change in wikibase.php and the turning it back on on commons beta [17:08:51] Yeah. [17:08:55] Three. [17:09:16] Wikibase, install on Beta Commons, enable on Beta Commons. [17:10:26] (03CR) 10Jforrester: [C: 032] CS.php, check UseWikibaseMediaInfo for loading Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469205 (owner: 10Addshore) [17:12:17] (03Merged) 10jenkins-bot: CS.php, check UseWikibaseMediaInfo for loading Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469205 (owner: 10Addshore) [17:13:19] !log icinga1001 rm /var/log/user.log.1 - was 14G and using 25% of the / partition and server out of disk :/ [17:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:38] (03CR) 10Jforrester: [C: 032] BETA: Set wmgWikibaseCachePrefix for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469206 (owner: 10Addshore) [17:15:15] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: For WBMI, intentionally rather than implicitly install Wikibase I38574e670 (duration: 00m 47s) [17:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:46] (03CR) 10jenkins-bot: CS.php, check UseWikibaseMediaInfo for loading Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469205 (owner: 10Addshore) [17:17:26] 10Operations, 10Analytics, 10Patch-For-Review: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10RobH) a:05RobH>03Cmjohnson So, I'm not sure what port this is on, we'll need @cmjohnson to trace the cable and update the network switch (or atleast this ta... [17:17:36] 10Operations, 10ops-eqiad, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10RobH) [17:18:38] (03Merged) 10jenkins-bot: BETA: Set wmgWikibaseCachePrefix for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469206 (owner: 10Addshore) [17:19:57] (Yay for pointless syncs just to avoid leaving prod dirty.) [17:20:16] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: BETA: Set wmgWikibaseCachePrefix for commonswiki I0badd355723 (duration: 00m 46s) [17:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:22] (03PS3) 10Jforrester: Wikibase.php, don't load wikidata repo settings on other repos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469209 (owner: 10Addshore) [17:20:29] addshore: Ping for when you're ready to test https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/469209 [17:23:08] (03PS1) 10Bstorm: sonofgridengine: move filetype definition into separate code [puppet] - 10https://gerrit.wikimedia.org/r/469232 (https://phabricator.wikimedia.org/T200557) [17:23:44] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: move filetype definition into separate code [puppet] - 10https://gerrit.wikimedia.org/r/469232 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [17:26:00] (03PS2) 10Bstorm: sonofgridengine: move filetype definition into separate code [puppet] - 10https://gerrit.wikimedia.org/r/469232 (https://phabricator.wikimedia.org/T200557) [17:26:34] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: move filetype definition into separate code [puppet] - 10https://gerrit.wikimedia.org/r/469232 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [17:28:27] James_F: is it on mwdebug? [17:28:38] Or not yet? [17:28:44] No, I was waiting for you to be alive. :-) [17:28:46] Not +2ed right? [17:28:52] Indeed. [17:28:56] Okay, we might have to pause, I'm in a very involved session :) [17:28:59] (03CR) 10Jforrester: [C: 032] Wikibase.php, don't load wikidata repo settings on other repos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469209 (owner: 10Addshore) [17:29:03] Ah, OK. [17:29:07] :p [17:29:10] (03CR) 10Jforrester: "Let's wait." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469209 (owner: 10Addshore) [17:29:34] addshore: I've got a 11:00; will ping afterwards to see if you're around. [17:32:32] (03CR) 10jenkins-bot: BETA: Set wmgWikibaseCachePrefix for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469206 (owner: 10Addshore) [17:33:24] (03PS3) 10Bstorm: sonofgridengine: move filetype definition into separate code [puppet] - 10https://gerrit.wikimedia.org/r/469232 (https://phabricator.wikimedia.org/T200557) [17:35:32] (03CR) 10Bstorm: [C: 032] sonofgridengine: move filetype definition into separate code [puppet] - 10https://gerrit.wikimedia.org/r/469232 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [17:40:14] 10Operations, 10Cloud-Services, 10netops: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10Andrew) There are currently 23 projects running in the new region, and we're moving more over every day. This would have been a reasonable request when were origi... [17:44:20] (03PS1) 10Bstorm: sonofgridengine: old code still referenced the old gridengine module [puppet] - 10https://gerrit.wikimedia.org/r/469234 (https://phabricator.wikimedia.org/T200557) [17:46:07] (03CR) 10Bstorm: [C: 032] sonofgridengine: old code still referenced the old gridengine module [puppet] - 10https://gerrit.wikimedia.org/r/469234 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [17:51:57] James_F: cool [17:57:01] (03CR) 10Volans: [C: 032] tests: remove pylint skip-file [software/cumin] - 10https://gerrit.wikimedia.org/r/469222 (owner: 10Volans) [18:00:24] (03Merged) 10jenkins-bot: tests: remove pylint skip-file [software/cumin] - 10https://gerrit.wikimedia.org/r/469222 (owner: 10Volans) [18:01:49] (03PS1) 10Bstorm: sonofgridengine: remove more cruft from the old module [puppet] - 10https://gerrit.wikimedia.org/r/469240 (https://phabricator.wikimedia.org/T200557) [18:01:52] (03CR) 10jenkins-bot: tests: remove pylint skip-file [software/cumin] - 10https://gerrit.wikimedia.org/r/469222 (owner: 10Volans) [18:02:41] (03CR) 10Bstorm: [C: 032] sonofgridengine: remove more cruft from the old module [puppet] - 10https://gerrit.wikimedia.org/r/469240 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [18:11:25] 10Operations, 10Cloud-Services, 10netops: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10faidon) This is essentially part of T122406, which we resolved earlier in the week with the intention of making it more specific with this task (among others). Ba... [18:12:24] PROBLEM - High lag on wdqs1003 is CRITICAL: 3650 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:13:39] !log icinga1001 - manually set max_concurrent_checks to 0 (unlimited), restart icinga, keep puppet disabled, for testing (it ran into the limit of 10000 all the time, causing lots of logging, and the CPU power is actually slightly lower than on einsteinium (T202782) refs: Nagios Tuning, point 7 https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/tuning.html [18:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:43] T202782: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 [18:17:27] !log icinga - performance/latency comparison - https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=4 vs https://icinga-stretch.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=4 (T202782) [18:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:59] (03PS1) 10Bstorm: sonofgridengine: Remove yet another round of paths and such from the old gridengine [puppet] - 10https://gerrit.wikimedia.org/r/469245 (https://phabricator.wikimedia.org/T200557) [18:21:36] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: Remove yet another round of paths and such from the old gridengine [puppet] - 10https://gerrit.wikimedia.org/r/469245 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [18:22:31] (03PS2) 10Bstorm: sonofgridengine: Remove more paths and such from the old gridengine [puppet] - 10https://gerrit.wikimedia.org/r/469245 (https://phabricator.wikimedia.org/T200557) [18:23:16] (03CR) 10Bstorm: [C: 032] sonofgridengine: Remove more paths and such from the old gridengine [puppet] - 10https://gerrit.wikimedia.org/r/469245 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [18:30:21] 10Operations, 10monitoring: newer version of nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10Dzahn) [18:30:55] 10Operations, 10monitoring: newer version of nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10Dzahn) [18:30:57] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) [18:31:41] 10Operations, 10hardware-requests, 10Performance-Team (Radar): eqiad: (1) misc single cpu server allocation for performance browser testing - https://phabricator.wikimedia.org/T204589 (10RobH) a:05mark>03faidon [18:33:18] (03PS1) 10Herron: logstash: create es/kafka combined role and assign to es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) [18:33:59] (03CR) 10jerkins-bot: [V: 04-1] logstash: create es/kafka combined role and assign to es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [18:34:50] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad | (3) Labs Data Lake hardware - https://phabricator.wikimedia.org/T199674 (10RobH) Please note this has been ordered as part of T204177. [18:35:01] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad | (14 + 6) hadoop hardware refresh and expansion - https://phabricator.wikimedia.org/T199673 (10RobH) Please note this has been ordered as part of T204177. [18:35:03] 10Operations, 10monitoring: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10Dzahn) [18:36:58] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe) So I did some analysis with prometheus using elasticsearch nodes at eqiad and running the following queries to sh... [18:38:17] (03CR) 10GTirloni: [C: 032] shinken: Adjustments necessary to upgrade 1.4->2.0 and Trusty->Jessie [puppet] - 10https://gerrit.wikimedia.org/r/468792 (https://phabricator.wikimedia.org/T204562) (owner: 10GTirloni) [18:38:25] (03PS7) 10GTirloni: shinken: Adjustments necessary to upgrade 1.4->2.0 and Trusty->Jessie [puppet] - 10https://gerrit.wikimedia.org/r/468792 (https://phabricator.wikimedia.org/T204562) [18:38:28] (03PS2) 10Herron: logstash: create es/kafka combined role and assign to es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) [18:41:52] jouncebot: next [18:41:53] In 0 hour(s) and 18 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1900) [18:42:20] addshore: I'm around if you are. [18:45:24] (03PS3) 10Herron: logstash: create es/kafka combined role and assign to es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) [18:57:27] (03PS4) 10Herron: logstash: create es/kafka combined role and assign to es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) [19:00:04] twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Americas version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1900). [19:00:23] (03PS1) 10GTirloni: shinken: Fix webui.cfg template [puppet] - 10https://gerrit.wikimedia.org/r/469248 (https://phabricator.wikimedia.org/T204562) [19:01:19] (03CR) 10GTirloni: [C: 032] shinken: Fix webui.cfg template [puppet] - 10https://gerrit.wikimedia.org/r/469248 (https://phabricator.wikimedia.org/T204562) (owner: 10GTirloni) [19:01:59] James_F: in another involved bit now, perhaps in ~20/30 mins? [19:02:15] addshore: Well, train is happening now, so… maybe. :-) [19:02:23] RECOVERY - Memory correctable errors -EDAC- on wtp2013 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw%2520prometheus%252Fops [19:07:12] (03PS1) 10GTirloni: shinken: Fix typo in init.pp [puppet] - 10https://gerrit.wikimedia.org/r/469249 (https://phabricator.wikimedia.org/T204562) [19:08:07] (03CR) 10GTirloni: [C: 032] shinken: Fix typo in init.pp [puppet] - 10https://gerrit.wikimedia.org/r/469249 (https://phabricator.wikimedia.org/T204562) (owner: 10GTirloni) [19:11:01] Is the train happening right now? Because there is something security related I would like to deploy if it is not [19:11:28] James_F: cool, I'll keep you updated [19:14:50] jouncebot, now [19:14:51] For the next 1 hour(s) and 45 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1900) [19:15:10] twentyafterfour, see above [19:16:11] bawolff has the go-ahead. thanks Krenair [19:16:23] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 79 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:19:38] (03PS5) 10Herron: logstash: create es/kafka combined role and assign to es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) [19:22:04] 10Operations, 10Icinga, 10Patch-For-Review: move icinga contacts file to public repo - https://phabricator.wikimedia.org/T164238 (10Dzahn) p:05Triage>03Low [19:22:14] !log deploy patch T207778 [19:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:12] (03PS1) 10Cwhite: memcached: remove memcached diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/469250 (https://phabricator.wikimedia.org/T183454) [19:24:57] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10faidon) >>! In T207536#4688654, @aborrero wrote: > My understanding of the problem is: > > * cloud supporting servi... [19:25:31] (03PS6) 10Herron: logstash: create es/kafka combined role and assign to es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) [19:26:07] (03CR) 10jerkins-bot: [V: 04-1] logstash: create es/kafka combined role and assign to es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [19:27:22] !log deploying https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/MinervaNeue/+/469244/ [19:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:25] jdloft: ^ [19:27:31] sorry wrong ping [19:27:34] jdlrobson: ^ [19:27:47] awesome. [19:27:50] will watch the graphs [19:27:56] (03PS1) 10Dzahn: icinga: allow configuring max_concurrent_checks via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/469253 (https://phabricator.wikimedia.org/T202782) [19:28:00] should i also cherry pick it to 1.33.0-wmf.1 ? [19:28:40] (03CR) 10jerkins-bot: [V: 04-1] icinga: allow configuring max_concurrent_checks via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/469253 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:29:21] 10Operations, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-ce 17+ for production (kubernetes) - https://phabricator.wikimedia.org/T207693 (10akosiaris) One problem encountered is the fact that newer docker + calico for some reason breaks IPv6 assignment. Removing IPv6 a... [19:30:59] !log twentyafterfour@deploy1001 Synchronized php-1.32.0-wmf.26/skins/MinervaNeue/resources/skins.minerva.scripts/pageIssuesLogger.js: sync https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/MinervaNeue/+/469244/ refs T207423 (duration: 00m 48s) [19:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:02] T207423: Many errors on ReadingDepth.enable (?) schema - https://phabricator.wikimedia.org/T207423 [19:31:19] (03PS2) 10Dzahn: icinga: allow configuring max_concurrent_checks via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/469253 (https://phabricator.wikimedia.org/T202782) [19:32:06] (03PS7) 10Herron: logstash: apply role::kafka::logging to logstash es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) [19:32:32] (03CR) 10Dzahn: "also see this old ticket: https://phabricator.wikimedia.org/T1242" [puppet] - 10https://gerrit.wikimedia.org/r/469253 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:32:56] (03CR) 10jerkins-bot: [V: 04-1] icinga: allow configuring max_concurrent_checks via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/469253 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:33:01] (03CR) 10jerkins-bot: [V: 04-1] logstash: apply role::kafka::logging to logstash es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [19:34:00] jdlrobson: that's deployed now, can you verify that your problem is fixed? [19:35:31] (03PS1) 10Joal: Add configuration for parquet-logging in hive conf [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469256 [19:35:40] (03PS3) 10Dzahn: icinga: allow configuring max_concurrent_checks via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/469253 (https://phabricator.wikimedia.org/T202782) [19:35:52] ottomata: --^ I'll also ask Luca tomorrow :) [19:36:09] twentyafterfour keeping an eye on https://grafana.wikimedia.org/dashboard/db/reading-web-dashboard?orgId=1&panelId=16&fullscreen [19:36:36] i suspect because of caching we need to wait a few mins [19:36:45] hopefully 12:40 [19:37:18] (03PS2) 10Cwhite: hiera: remove diamond from thumbor role [puppet] - 10https://gerrit.wikimedia.org/r/466907 (https://phabricator.wikimedia.org/T183454) [19:38:24] (03PS3) 10Cwhite: hiera: remove diamond from thumbor role [puppet] - 10https://gerrit.wikimedia.org/r/466907 (https://phabricator.wikimedia.org/T183454) [19:38:48] (03CR) 10Dzahn: icinga: add puppet types for parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [19:40:28] (03PS2) 10Dzahn: Switch srvdumps rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467978 (owner: 10Muehlenhoff) [19:40:47] (03PS4) 10Cwhite: hiera: remove diamond from labweb role [puppet] - 10https://gerrit.wikimedia.org/r/466908 (https://phabricator.wikimedia.org/T183454) [19:43:03] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/13169/" [puppet] - 10https://gerrit.wikimedia.org/r/469253 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:43:24] (03PS1) 10Cwhite: memcached: remove diamond::collector resource [puppet] - 10https://gerrit.wikimedia.org/r/469258 (https://phabricator.wikimedia.org/T183454) [19:43:53] (03PS5) 10Cwhite: hiera: remove diamond from labweb role [puppet] - 10https://gerrit.wikimedia.org/r/466908 (https://phabricator.wikimedia.org/T183454) [19:43:57] twentyafterfour: it worked! [19:43:59] pheww [19:44:05] problem fixed [19:44:07] (03PS4) 10Cwhite: hiera: remove diamond from thumbor role [puppet] - 10https://gerrit.wikimedia.org/r/466907 (https://phabricator.wikimedia.org/T183454) [19:44:14] jouncebot: a thought: maybe the logging file should be called java_logging_template [19:44:15] so just need to make sure it goes out in the current change and future releases [19:44:16] (03CR) 10Dzahn: [C: 04-1] "allowed_hosts requires array (same issue we had on the change for aptrepo yesterday)" [puppet] - 10https://gerrit.wikimedia.org/r/467978 (owner: 10Muehlenhoff) [19:44:20] so that it isn't parquet specific [19:44:30] but we can put the parquet changes we need in the file [19:44:33] oops [19:44:36] joal: ^^^ [19:45:25] (03CR) 10Dzahn: [C: 04-1] Switch srvdumps rsync module to auto_ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467978 (owner: 10Muehlenhoff) [19:46:40] (03CR) 10Cwhite: [C: 031] icinga: allow configuring max_concurrent_checks via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/469253 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:49:23] (03CR) 10Dzahn: [C: 04-1] "similar to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469146/ but there we just had a single host as a string and here it's a" [puppet] - 10https://gerrit.wikimedia.org/r/467978 (owner: 10Muehlenhoff) [19:52:59] (03PS1) 10Zoranzoki21: New throttle rule for Johannesburg Event on 2018-10-27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469261 (https://phabricator.wikimedia.org/T207742) [19:53:43] (03PS1) 10Fomafix: Avoid redirects from HTTPS to HTTP and back to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/469262 [19:54:23] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10faidon) I'm not sure why there would be a chicken-and-egg problem. Prod recursors run in prod, right? Why is this different? Also, while I can see a recursor outage cascading into variou... [19:57:20] (03CR) 10Urbanecm: [C: 04-1] "Tabs must be used to indent lines; spaces are not allowed, see https://integration.wikimedia.org/ci/job/operations-mw-config-composer-test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469261 (https://phabricator.wikimedia.org/T207742) (owner: 10Zoranzoki21) [19:57:28] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10Krenair) >>! In T207533#4689756, @faidon wrote: > while I can see a recursor outage cascading into various random issues across the infrastructure, I'm unsure why such an issue would prev... [19:58:10] (03PS2) 10Zoranzoki21: New throttle rule for Johannesburg Event on 2018-10-27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469261 (https://phabricator.wikimedia.org/T207742) [19:58:50] (03CR) 10Herron: "Stubbed my toe against role hiera lookups a bit with this... The cleaner approach IMHO would have been a new role that combines role::log" [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [20:03:08] (03CR) 10Volans: [C: 04-1] "LGTM, -1 just because of the invalid syntax. Feel free to merge without a follow up review, just check it compiles with the compiler ;)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [20:03:14] (03CR) 10Urbanecm: [C: 04-1] "Requestor didn't provided IP address(es), to prevent accidental merging I'm voting -1 here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469261 (https://phabricator.wikimedia.org/T207742) (owner: 10Zoranzoki21) [20:04:48] 10Operations, 10LDAP-Access-Requests: Remove "daniel" from "wmde" LDAP group - https://phabricator.wikimedia.org/T207788 (10Addshore) [20:05:23] jouncebot: now [20:05:23] For the next 0 hour(s) and 54 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T1900) [20:08:04] thanks twentyafterfour im breaking for lunch! [20:08:22] jdlrobson: np, enjoy [20:08:45] 10Operations, 10LDAP-Access-Requests: Remove "daniel" from "wmde" LDAP group - https://phabricator.wikimedia.org/T207788 (10Addshore) p:05Triage>03Normal [20:09:20] 10Operations, 10LDAP-Access-Requests, 10Core Platform Team Kanban (Blocked Externally): Remove "daniel" from "wmde" LDAP group and add him to "wmf" - https://phabricator.wikimedia.org/T207788 (10Legoktm) p:05Normal>03Triage [20:10:53] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10GTirloni) @faidon the complete separation seems like a great goal from a security perspective, but considering there... [20:12:14] 10Operations, 10LDAP-Access-Requests, 10Core Platform Team Kanban (Blocked Externally): Remove "daniel" from "wmde" LDAP group and add him to "wmf" - https://phabricator.wikimedia.org/T207788 (10Dzahn) @jijiki leaving it for you as part of clinic duty and to provide an example: hint: ``` [mwmaint1002:~] $... [20:17:06] 10Operations, 10LDAP-Access-Requests: Remove "jk" from "wmde" ldap group - https://phabricator.wikimedia.org/T207792 (10Addshore) [20:17:53] 10Operations, 10LDAP-Access-Requests: Remove "aude" from "wmde" LDAP group - https://phabricator.wikimedia.org/T207793 (10Addshore) [20:18:04] 10Operations, 10LDAP-Access-Requests: Remove "jk" from "wmde" ldap group - https://phabricator.wikimedia.org/T207792 (10Addshore) p:05Triage>03Normal [20:18:12] 10Operations, 10LDAP-Access-Requests: Remove "aude" from "wmde" LDAP group - https://phabricator.wikimedia.org/T207793 (10Addshore) p:05Triage>03Normal [20:18:16] 10Operations, 10LDAP-Access-Requests: Remove "aude" from "wmde" LDAP group - https://phabricator.wikimedia.org/T207793 (10Addshore) [20:18:54] 10Operations, 10LDAP-Access-Requests, 10Core Platform Team Kanban (Blocked Externally): Remove "daniel" from "wmde" LDAP group and add him to "wmf" - https://phabricator.wikimedia.org/T207788 (10Addshore) p:05Triage>03Normal [20:20:56] :'( [20:21:20] :'( is for the rm aude from wmde group part, not the daniel part [20:21:35] 10Operations: Access requests process: People sometimes specify hostnames instead of admin groups in access requests - https://phabricator.wikimedia.org/T207754 (10Dzahn) I agree. Access should always be given based on puppet roles, not host names. This translates to using hieradata/role/common/something.yaml... [20:21:53] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 (10herron) >>! In T206454#4686063, @herron wrote: > * Need a proper import of confluent-kafka-2.11 1.1.0-1 for Jessie.... [20:23:54] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10Andrew) Another issue is that we typically ssh via a bastion -- if the bastion is unable to resolve the target host then the connection will fail. [20:28:49] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10Krenair) Yeah but for that one you can at least do the DNS lookup for yourself. [20:30:03] (03PS5) 10Dzahn: icinga: add puppet types for parameters [puppet] - 10https://gerrit.wikimedia.org/r/468468 [20:30:04] 10Operations, 10hardware-requests, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), 10User-Eevans: Hardware for session storage service - https://phabricator.wikimedia.org/T206017 (10RobH) So, I'm a bit unclear on SSD requirements. It seems like a mixed use... [20:31:14] 10Operations, 10hardware-requests, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), 10User-Eevans: Hardware for session storage service - https://phabricator.wikimedia.org/T206017 (10RobH) Also does this have a site preference, codfw or eqiad? [20:31:17] (03PS1) 10Bstorm: sonofgridengine: Make an effort at causing the grid master to bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/469311 (https://phabricator.wikimedia.org/T200557) [20:31:19] (03CR) 10Dzahn: icinga: add puppet types for parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [20:33:40] (03PS2) 10Bstorm: sonofgridengine: Make an effort at causing the grid master to bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/469311 (https://phabricator.wikimedia.org/T200557) [20:37:21] bawolff: :( she hasn't really been working for us for some time now [20:38:06] James_F: thoughts on trying the config? [20:40:33] addshore: Happy to do so if you have the time. [20:40:36] twentyafterfour: Train done? [20:41:27] Oh, no. [20:42:11] James_F: Krinkle just told me there is no train this week? :P [20:42:34] or maybe there is [20:42:35] addshore: Krinkle was wrong. This is the 1.33.0-wmf.1 train week, so things are more complicated than normal. [20:42:43] gotcha [20:42:43] James_F: not yet [20:43:23] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 2 others: The usual Lag pattern for wdqs2003 seems to be taking another turn - https://phabricator.wikimedia.org/T206423 (10Mathew.onipe) p:05Triage>03Normal [20:43:37] I'm about to sync 1.33.0-wmf.1 but `scap prep` errored out and I've got to verify that the new branch isn't messed up some how [20:47:01] (03PS1) 10Dzahn: icinga: don't log service/host check retries [puppet] - 10https://gerrit.wikimedia.org/r/469317 [20:47:45] 10Operations, 10Recommendation-API, 10Research, 10Core Platform Team Kanban (Doing), and 3 others: Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10mobrovac) >>! In T205452#4688800, @jcrespo wrote: > Sorry, there is an ops clinic duty to answer these kind of requests- I did... [20:55:53] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:57:56] (03PS2) 10Dzahn: icinga: don't log service/host check retries [puppet] - 10https://gerrit.wikimedia.org/r/469317 (https://phabricator.wikimedia.org/T202782) [20:58:04] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:58:27] (03PS1) 10Dzahn: icinga: logging optimizations [puppet] - 10https://gerrit.wikimedia.org/r/469320 (https://phabricator.wikimedia.org/T202782) [20:59:51] (03CR) 10Cwhite: [C: 031] icinga: add puppet types for parameters [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [21:00:09] (03CR) 10Cwhite: [C: 031] icinga: don't log service/host check retries [puppet] - 10https://gerrit.wikimedia.org/r/469317 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [21:01:32] (03PS2) 10Dzahn: icinga: logging optimizations [puppet] - 10https://gerrit.wikimedia.org/r/469320 (https://phabricator.wikimedia.org/T202782) [21:03:22] (03CR) 10Dzahn: "-> waiting for https://phabricator.wikimedia.org/T207246#4689795" [puppet] - 10https://gerrit.wikimedia.org/r/467100 (https://phabricator.wikimedia.org/T207243) (owner: 10Paladox) [21:04:36] (03PS4) 10Dzahn: gerrit: move letsencrypt::cert::integrated to profile [puppet] - 10https://gerrit.wikimedia.org/r/467690 [21:05:31] (03CR) 10Paladox: "Im not sure if the git (gerrit-test3) project uses the profile." [puppet] - 10https://gerrit.wikimedia.org/r/467690 (owner: 10Dzahn) [21:06:27] at least 10 times i have already clicked accidently on the links in the new gerrit footer .. when editing commit message in inline editor .. :p [21:06:55] (03PS5) 10Dzahn: gerrit: move letsencrypt::cert::integrated to profile [puppet] - 10https://gerrit.wikimedia.org/r/467690 [21:07:28] (03CR) 10Dzahn: "why would it not use the profile? what else would it use?" [puppet] - 10https://gerrit.wikimedia.org/r/467690 (owner: 10Dzahn) [21:08:01] (03CR) 10Dzahn: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13171/" [puppet] - 10https://gerrit.wikimedia.org/r/467690 (owner: 10Dzahn) [21:11:27] (03CR) 10Paladox: [C: 031] "Per chat" [puppet] - 10https://gerrit.wikimedia.org/r/467690 (owner: 10Dzahn) [21:11:38] (03CR) 10Dzahn: [C: 032] gerrit: move letsencrypt::cert::integrated to profile [puppet] - 10https://gerrit.wikimedia.org/r/467690 (owner: 10Dzahn) [21:12:04] (03CR) 10Bstorm: [C: 032] sonofgridengine: Make an effort at causing the grid master to bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/469311 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [21:13:10] wow, if you are a split second before submitting and then somebody else beats you, rebase is greyed out but yuou cant submit [21:13:18] no needs local rebase [21:13:20] now [21:15:49] press the rebase button mutante [21:16:22] paladox: it doesn't work, that's what i am reporting. it's advanced level :) [21:16:30] if you catch the right moment [21:16:39] advanced? [21:16:49] it stays greyed out. abd submit is blue [21:16:54] (03PS6) 10Paladox: gerrit: move letsencrypt::cert::integrated to profile [puppet] - 10https://gerrit.wikimedia.org/r/467690 (owner: 10Dzahn) [21:16:56] reload the page [21:16:57] but you still cant submit.. it reports an error [21:16:58] mutante see ^^ [21:17:33] eh, ok, it didn't for me . and i though i did this a lot of times before [21:17:36] thx [21:18:37] multiple commits warning now [21:18:54] bstorm_: can i merge both? [21:19:02] Mine's fine [21:19:21] Sorry, I was slow [21:19:21] done. merged "multiple" [21:19:25] Thank you! [21:19:25] no worries at all [21:21:00] (03CR) 10Dzahn: [C: 032] "noop on cobalt" [puppet] - 10https://gerrit.wikimedia.org/r/467690 (owner: 10Dzahn) [21:21:30] 10Operations, 10hardware-requests, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), 10User-Eevans: Hardware for session storage service - https://phabricator.wikimedia.org/T206017 (10Eevans) >>! In T206017#4689984, @RobH wrote: > So, I'm a bit unclear on SSD... [21:21:56] 10Operations, 10hardware-requests, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), 10User-Eevans: Hardware for session storage service - https://phabricator.wikimedia.org/T206017 (10Eevans) >>! In T206017#4689990, @RobH wrote: > Also does this have a site pr... [21:30:02] !log icinga1001 - changing check_result_reaper_frequecy from 10 to 3, trying to lower average check latency. "allow faster check result processing -> requires more CPU" (T202782) [21:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:06] T202782: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 [21:31:12] "A value that is too high can result in large latencies for your host and service checks. A value that is too low can have the same effect." [21:42:45] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) Not impacting that task, but for labsdb10[08|09|10], the presence of sensitive data + need to be reached from Cl... [21:47:28] !log icinga1001 - replacing check_ping with check_fping as the standard host check command, for faster host checks (another tip from Nagios Tuning guide, still manual testing) (T202782) [21:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:31] T202782: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 [21:48:22] (03PS5) 10Paladox: WIP: Update gerrit to 2.16 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/463509 [21:48:43] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [21:50:22] (03PS1) 10Bstorm: sonofgridengine: need to create the spooldb folder [puppet] - 10https://gerrit.wikimedia.org/r/469328 (https://phabricator.wikimedia.org/T200557) [21:51:04] (03PS1) 10MSantos: Decrease OSM update Frequency [puppet] - 10https://gerrit.wikimedia.org/r/469329 (https://phabricator.wikimedia.org/T205735) [21:52:14] (03PS2) 10MSantos: Decrease OSM update Frequency [puppet] - 10https://gerrit.wikimedia.org/r/469329 (https://phabricator.wikimedia.org/T205735) [21:56:18] (03CR) 10Bstorm: [C: 032] sonofgridengine: need to create the spooldb folder [puppet] - 10https://gerrit.wikimedia.org/r/469328 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [21:58:12] 10Operations, 10Maps (Tilerator), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10MSantos) There is a [[ https://github.com/kartotherian/osm-bright.tm2source/pull/66 | PR ]] that changes the `popul... [22:01:09] (03PS1) 10Dmaza: Enable partial blocks on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469332 [22:02:23] (03CR) 10Dbarratt: [C: 031] Enable partial blocks on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469332 (owner: 10Dmaza) [22:08:07] (03PS1) 10Dzahn: icinga: use fping instead of ping for faster host checks [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) [22:09:59] (03PS1) 10Catrope: Enable $wgWMEUnderstandingFirstDay on Korean beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469334 [22:11:10] (03CR) 10Catrope: [C: 032] Enable $wgWMEUnderstandingFirstDay on Korean beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469334 (owner: 10Catrope) [22:11:37] 10Operations, 10hardware-requests, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), 10User-Eevans: Hardware for session storage service - https://phabricator.wikimedia.org/T206017 (10RobH) >>! In T206017#4689984, @RobH wrote: > So, I'm a bit unclear on SSD re... [22:11:45] 10Operations, 10hardware-requests, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), 10User-Eevans: Hardware for session storage service - https://phabricator.wikimedia.org/T206017 (10RobH) a:05Eevans>03RobH [22:13:36] (03Merged) 10jenkins-bot: Enable $wgWMEUnderstandingFirstDay on Korean beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469334 (owner: 10Catrope) [22:14:48] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [22:14:56] !log scap prep 1.33.0-wmf.1 [22:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:46] 10Operations, 10hardware-requests: Find and rack 2 EX4200s in rack c1-eqiad - https://phabricator.wikimedia.org/T139752 (10RobH) [22:22:48] 10Operations, 10ops-eqiad, 10hardware-requests: eqiad: add all spare network switches to hardware spares tracking - https://phabricator.wikimedia.org/T139775 (10RobH) 05Open>03declined [22:24:16] (03PS1) 10Dzahn: icinga: tell rsyslog to discard logs from check_nrpe [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) [22:25:43] (03CR) 10jerkins-bot: [V: 04-1] icinga: tell rsyslog to discard logs from check_nrpe [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) (owner: 10Dzahn) [22:26:09] 10Operations, 10Revision-Slider, 10TCB-Team, 10WMDE-Analytics-Engineering, 10Graphite: Fix aggregation of "MediaWiki.RevisionSlider.event.load.sum" from average to sum - https://phabricator.wikimedia.org/T205416 (10GoranSMilovanovic) From the description of the task I am not sure if just changing the agg... [22:28:40] (03CR) 10jenkins-bot: Enable $wgWMEUnderstandingFirstDay on Korean beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469334 (owner: 10Catrope) [22:56:49] (03PS1) 10Alexandros Kosiaris: calico: Support version 2.4.1 [puppet] - 10https://gerrit.wikimedia.org/r/469339 (https://phabricator.wikimedia.org/T207804) [22:58:30] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) > I've created a new VM, t206636-2.wikidata-query... [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181023T2300). [23:00:05] bpirkle: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:31] The moon idea sounds fun too [23:02:48] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) @Andrew Also looks like there is some puppet issu... [23:03:33] (03CR) 10Cwhite: "This looks right to me. Not sure why jenkins is complaining about not finding rsyslog::conf." [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) (owner: 10Dzahn) [23:05:19] bpirkle: the train is still underway, however, I can deploy your patch to wmf.26 [23:05:42] 👍 [23:06:33] bpirkle: is the batch size 50000 going to have an impact on databases? should we clear it with dbas? [23:08:03] I'm happy to clear it if necessary, but 50000 should still be low for this operation - the 1000 turned out to be ridiculously small. [23:09:11] apergos (who is in bed right now, but whose change this cherry pick is from) said this: "it's a dump-related thing, the only test is once it hits the snapshots, and I've got an install on one of them with .26 and that patch. so I guess you can let them know I'll double check it on the snaps in the morning" [23:09:12] bpirkle: If you are comfortable with it then I am. [23:09:14] If that helps [23:09:33] I'm comfortable with it [23:09:35] cool [23:10:01] just waiting on the gate-and-submit job [23:19:14] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) a:05Smalyshev>03None [23:29:42] bpirkle: syncing [23:30:52] !log twentyafterfour@deploy1001 Synchronized php-1.32.0-wmf.26/includes/export/WikiExporter.php: sync https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/469319/ refs T207628 (duration: 01m 39s) [23:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:56] T207628: stubs dumps run much slower after move from bufferied queries to batches - https://phabricator.wikimedia.org/T207628 [23:31:24] !log twentyafterfour@deploy1001 Started scap: syncing 1.33.0-wmf.1 refs T206655 [23:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:27] T206655: 1.33.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T206655 [23:46:24] Started scap [23:46:29] oops [23:46:44] Ignore me typing in the wrong places :) [23:46:50] :D