[00:00:41] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 54.96 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:03:37] (03CR) 10Sbisson: "> Blocked on wmf.23 being deployed on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033) (owner: 10Sbisson) [00:04:01] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 86.82 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:31:50] 10Operations, 10MediaWiki-Shell, 10Core Platform Team Goals (MCR: Uncategorized), 10Core Platform Team Kanban (Later): Update limit.sh to support systemd-based cgroup management - https://phabricator.wikimedia.org/T136603 (10CCicalese_WMF) [00:34:47] (03PS1) 10Jdlrobson: smaller wiki a/b tests are bumped to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463875 (https://phabricator.wikimedia.org/T200792) [00:50:27] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10chelsyx) Hi @herron , I was just checking @mpopov (username: bearloga) 's groups and these three are the o... [00:56:11] (03PS1) 10BryanDavis: jdk8: make java-1.8.0-openjdk-amd64 default [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/463877 (https://phabricator.wikimedia.org/T205774) [00:59:31] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 31 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:00:21] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 30 probes of 339 (alerts on 25) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:02:50] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 55.89 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:05:30] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 12 probes of 339 (alerts on 25) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:10:31] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 77.95 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:12:50] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 26 probes of 339 (alerts on 25) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:14:51] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 18 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:17:51] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 1 probes of 339 (alerts on 25) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:23:32] (03CR) 10Catrope: [C: 04-2] "Yes I do, oops" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033) (owner: 10Sbisson) [01:24:40] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) timed out before a response was received [01:25:40] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [02:34:54] 10Operations, 10MediaWiki-Shell, 10Core Platform Team, 10Core Platform Team Kanban (Later): Update limit.sh to support systemd-based cgroup management - https://phabricator.wikimedia.org/T136603 (10CCicalese_WMF) [02:49:50] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [02:52:00] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:26:30] PROBLEM - Disk space on actinium is CRITICAL: DISK CRITICAL - free space: / 338 MB (3% inode=88%) [03:49:40] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 57.69 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:52:00] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:52:34] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Mholloway) [03:52:51] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 57.17 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:56:20] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:59:53] 10Operations, 10Cloud-Services, 10Parsing-Team, 10Datacenter-Switchover-2018, and 2 others: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." - https://phabricator.wikimedia.org/T163438 (10Krinkle) >>! In T163438#4577979... [03:59:59] 10Operations, 10Cloud-Services, 10Parsing-Team, 10Datacenter-Switchover-2018, and 2 others: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." - https://phabricator.wikimedia.org/T163438 (10Krinkle) 05Resolved>03Open [04:01:05] (03CR) 10Zhuyifei1999: "I'm neither familiar with java nor update-alternatives, but I can help with building the image :). 'jdk8' image using jdk8 as the default " [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/463877 (https://phabricator.wikimedia.org/T205774) (owner: 10BryanDavis) [04:05:50] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 76.48 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:08:41] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Scrapes sample page) is CRITICAL: Test Scrapes sample page returned the unexpected status 404 (expecting: 200) [04:09:00] PROBLEM - url_downloader on alcyone is CRITICAL: connect to address url-downloader.wikimedia.org and port 8080: Connection refused [04:09:20] PROBLEM - url_downloader on actinium is CRITICAL: connect to address url-downloader.wikimedia.org and port 8080: Connection refused [04:09:21] PROBLEM - url_downloader on alsafi is CRITICAL: connect to address url-downloader.wikimedia.org and port 8080: Connection refused [04:09:21] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) is CRITICAL: Test Scrapes sample page returned the unexpected status 404 (expecting: 200) [04:09:30] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Scrapes sample page) is CRITICAL: Test Scrapes sample page returned the unexpected status 404 (expecting: 200) [04:09:31] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Scrapes sample page) is CRITICAL: Test Scrapes sample page returned the unexpected status 404 (expecting: 200) [04:09:31] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) is CRITICAL: Test Scrapes sample page returned the unexpected status 404 (expecting: 200) [04:09:41] PROBLEM - url_downloader on aluminium is CRITICAL: connect to address url-downloader.wikimedia.org and port 8080: Connection refused [05:01:41] <_joe_> wtf is happening with citoid and url-downloader? [05:07:11] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:08:50] RECOVERY - url_downloader on actinium is OK: TCP OK - 0.001 second response time on url-downloader.wikimedia.org port 8080 [05:08:51] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [05:08:51] RECOVERY - url_downloader on alsafi is OK: TCP OK - 0.001 second response time on url-downloader.wikimedia.org port 8080 [05:09:20] RECOVERY - url_downloader on aluminium is OK: TCP OK - 0.001 second response time on url-downloader.wikimedia.org port 8080 [05:09:30] RECOVERY - Disk space on actinium is OK: DISK OK [05:09:30] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [05:09:40] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [05:09:40] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [05:09:40] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [05:09:41] RECOVERY - url_downloader on alcyone is OK: TCP OK - 0.001 second response time on url-downloader.wikimedia.org port 8080 [05:11:21] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 56 probes of 341 (alerts on 25) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:16:30] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 0 probes of 341 (alerts on 25) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:22:42] <_joe_> !log stopped tilerator on maps1004, was spamming like crazy [05:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:11] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (Scrapes sample page) is CRITICAL: Test Scrapes sample page returned the unexpected status 404 (expecting: 200) [05:24:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 404 (expecting: 200) [05:24:30] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (Scrapes sample page) is CRITICAL: Test Scrapes sample page returned the unexpected status 404 (expecting: 200) [05:24:31] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (Scrapes sample page) is CRITICAL: Test Scrapes sample page returned the unexpected status 404 (expecting: 200) [05:24:31] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 404 (expecting: 200) [05:24:40] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (Scrapes sample page) is CRITICAL: Test Scrapes sample page returned the unexpected status 404 (expecting: 200) [05:24:51] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Scrapes sample page) is CRITICAL: Test Scrapes sample page returned the unexpected status 404 (expecting: 200) [05:24:51] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 404 (expecting: 200) [05:25:00] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (Scrapes sample page) is CRITICAL: Test Scrapes sample page returned the unexpected status 404 (expecting: 200) [05:25:01] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (Scrapes sample page) is CRITICAL: Test Scrapes sample page returned the unexpected status 404 (expecting: 200) [05:25:01] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 404 (expecting: 200) [05:25:10] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 404 (expecting: 200) [05:25:10] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 404 (expecting: 200) [05:25:11] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 404 (expecting: 200) [05:25:11] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 404 (expecting: 200) [05:25:11] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 404 (expecting: 200) [05:25:20] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 404 (expecting: 200) [05:35:10] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [05:35:10] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [05:35:11] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [05:35:11] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [05:35:11] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [05:35:20] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [05:35:20] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [05:35:31] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [05:35:31] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [05:35:31] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [05:35:41] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [05:35:41] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [05:36:00] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [05:36:01] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [05:36:01] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [05:36:01] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [05:36:10] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [05:36:50] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:39:30] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 49.12 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:46:01] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 71.77 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:08:30] what on earth happened ? [06:28:11] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:31:00] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:31:03] akosiaris: I don't know but it looks like it's ongoing [06:31:07] Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.32.0-wmf.23/includes/parser/Preprocessor_Hash.php on line 184 [06:31:22] quite a lot of those fatals [06:32:30] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:32:30] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:34:10] twentyafterfour: the root cause was resolved for the thing I was terrified about. I am not sure if those fatals are related or not [06:34:26] 10Operations, 10Puppet, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10jcrespo) > This needs the debian package to rebuild. Please setup a specific system user (with post-inst or puppet) to run the script and give a passwordless login... [06:40:04] 10Operations, 10Citoid, 10Services, 10Patch-For-Review, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10Mvolz) This version is now tested the compatible patch for citoid is here: https://gerrit.wikimedia.org/r/#/c/mediawiki/services/citoid/+/463713/ [06:56:20] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:57:51] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:10:30] (03PS1) 10Marostegui: db-codfw.php: Depool db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463897 (https://phabricator.wikimedia.org/T205913) [07:12:55] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463897 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [07:13:43] (03PS1) 10Banyek: wikireplicas: config file changed for wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/463900 (https://phabricator.wikimedia.org/T203674) [07:16:19] (03CR) 10Marostegui: [C: 031] wikireplicas: config file changed for wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/463900 (https://phabricator.wikimedia.org/T203674) (owner: 10Banyek) [07:16:21] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463897 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [07:17:42] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2072 (duration: 01m 02s) [07:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:51] !log Deploy schema change on db2072 T205913 [07:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:56] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [07:18:57] (03PS3) 10Alexandros Kosiaris: Use conftool to populate mw canaries in scap [puppet] - 10https://gerrit.wikimedia.org/r/463469 (https://phabricator.wikimedia.org/T204907) [07:18:59] (03PS2) 10Alexandros Kosiaris: scap: Replace an ugly hack with puppet 4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/463708 (https://phabricator.wikimedia.org/T204907) [07:19:01] (03PS2) 10Alexandros Kosiaris: scap: Move prefix from confd to key creation [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) [07:20:45] 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10Gehel) This was approved during weekly SRE meeting. Let's @RobH confirm how we want to implement in this case. [07:23:09] (03CR) 10jenkins-bot: db-codfw.php: Depool db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463897 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [07:24:53] (03CR) 10Banyek: [C: 032] wikireplicas: config file changed for wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/463900 (https://phabricator.wikimedia.org/T203674) (owner: 10Banyek) [07:26:13] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Banyek) >>! In T203674#4632896, @jcrespo wrote: >> This needs the debian package to rebuild. > > Please setup a specific system user (with pos... [07:27:37] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463901 [07:27:45] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10jcrespo) Just to be clear, passwordless here means "using socket authentication", not literally passwordless :-) [07:29:51] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463901 (owner: 10Marostegui) [07:30:55] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463901 (owner: 10Marostegui) [07:32:00] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2072 (duration: 00m 55s) [07:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:36] (03PS1) 10Marostegui: db-codfw.php: Depool db2088:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463902 [07:32:48] (03CR) 10Giuseppe Lavagetto: [C: 031] scap: Replace an ugly hack with puppet 4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/463708 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [07:33:50] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2088:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463902 (owner: 10Marostegui) [07:34:17] (03CR) 10Giuseppe Lavagetto: [C: 031] Use conftool to populate mw canaries in scap [puppet] - 10https://gerrit.wikimedia.org/r/463469 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [07:34:42] PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [07:34:52] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2088:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463902 (owner: 10Marostegui) [07:36:11] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2088:3311 (duration: 00m 55s) [07:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:53] !log Deploy schema change on db2088:3311 T205913 [07:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:57] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [07:38:19] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2088:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463904 [07:38:21] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463901 (owner: 10Marostegui) [07:38:23] (03CR) 10jenkins-bot: db-codfw.php: Depool db2088:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463902 (owner: 10Marostegui) [07:39:27] (03PS4) 10Alexandros Kosiaris: Use conftool to populate mw canaries in scap [puppet] - 10https://gerrit.wikimedia.org/r/463469 (https://phabricator.wikimedia.org/T204907) [07:39:29] (03PS3) 10Alexandros Kosiaris: scap: Replace an ugly hack with puppet 4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/463708 (https://phabricator.wikimedia.org/T204907) [07:39:31] (03PS3) 10Alexandros Kosiaris: scap: Move prefix from confd to key creation [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) [07:39:48] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2088:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463904 (owner: 10Marostegui) [07:40:16] (03PS4) 10Alexandros Kosiaris: scap: Replace an ugly hack with puppet 4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/463708 (https://phabricator.wikimedia.org/T204907) [07:41:13] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2088:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463904 (owner: 10Marostegui) [07:42:33] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2088:3311 (duration: 00m 56s) [07:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:14] (03PS1) 10Elukey: Add prometheus mysql exporter for analytics-meta and matomo [puppet] - 10https://gerrit.wikimedia.org/r/463906 (https://phabricator.wikimedia.org/T202962) [07:45:27] marostegui: o/ - is it ok if I use the mysqld exporters in --^ and then add them to the prometheus analytics prometheus instance? [07:45:44] (03PS1) 10Marostegui: db-codfw.php: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463908 [07:45:47] in this way those will not pop up in your metrics [07:45:59] but I'll have some for those databases :) [07:46:28] elukey: I haven't touched much the exporter, but that looks reasonable yeah [07:46:53] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463908 (owner: 10Marostegui) [07:47:49] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, 10Release-Engineering-Team (Watching / External): Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) As of 2018-10-02: ``` ls -rw-r--r-- 1 dump dump 3.2G Sep 25 21:37 mgwiktionary.gz.tar -rw-r--r-- 1 dump... [07:47:55] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463908 (owner: 10Marostegui) [07:48:28] !log mholloway-shell@deploy1001 Started deploy [kartotherian/deploy@0bf513a] (maps1004): Remove HTTP proxy [07:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:45] !log mholloway-shell@deploy1001 Finished deploy [kartotherian/deploy@0bf513a] (maps1004): Remove HTTP proxy (duration: 00m 16s) [07:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:02] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2071 (duration: 00m 56s) [07:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:55] !log mholloway-shell@deploy1001 Started deploy [tilerator/deploy@6c80537] (maps1004): Disable event logging requests and remove HTTP proxy [07:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:13] !log mholloway-shell@deploy1001 Finished deploy [tilerator/deploy@6c80537] (maps1004): Disable event logging requests and remove HTTP proxy (duration: 00m 17s) [07:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:21] !log Deploy schema change on db2071 T205913 [07:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:25] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [07:50:48] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463909 [07:52:01] (03PS2) 10Elukey: Add prometheus mysql exporter for analytics-meta and matomo [puppet] - 10https://gerrit.wikimedia.org/r/463906 (https://phabricator.wikimedia.org/T202962) [07:52:12] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463909 (owner: 10Marostegui) [07:53:15] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463909 (owner: 10Marostegui) [07:53:45] (03CR) 10Alexandros Kosiaris: [C: 032] scap: Replace an ugly hack with puppet 4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/463708 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [07:54:06] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2088:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463904 (owner: 10Marostegui) [07:54:08] (03CR) 10jenkins-bot: db-codfw.php: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463908 (owner: 10Marostegui) [07:54:10] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463909 (owner: 10Marostegui) [07:54:23] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2071 (duration: 00m 55s) [07:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:27] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler1002/12710/deploy1001.eqiad.wmnet/ is actually a noop. The conftool variable change is bec" [puppet] - 10https://gerrit.wikimedia.org/r/463708 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [07:55:30] (03PS1) 10Jcrespo: mariadb: Depool db1110 for testing s3 imports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463910 (https://phabricator.wikimedia.org/T184805) [07:56:45] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1110 for testing s3 imports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463910 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [07:57:42] PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [07:57:48] (03Merged) 10jenkins-bot: mariadb: Depool db1110 for testing s3 imports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463910 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [08:01:23] !log converting wikidatawiki.content to TokuDB on host dbstrore1002 (T205544) [08:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:27] T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 [08:01:28] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12712/" [puppet] - 10https://gerrit.wikimedia.org/r/463906 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [08:01:36] (03PS3) 10Elukey: Add prometheus mysql exporter for analytics-meta and matomo [puppet] - 10https://gerrit.wikimedia.org/r/463906 (https://phabricator.wikimedia.org/T202962) [08:04:12] !log Deploy schema change on s5 eqiad master, lag will be generated T205913 [08:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:16] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [08:04:41] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1110 (duration: 00m 56s) [08:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:01] PROBLEM - Check systemd state on analytics1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:08:17] this is me --^ [08:09:19] (03CR) 10jenkins-bot: mariadb: Depool db1110 for testing s3 imports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463910 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [08:10:33] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, 10Release-Engineering-Team (Watching / External): Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10Petar.petkovic) [08:12:40] PROBLEM - Check systemd state on matomo1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:12:58] (03CR) 10Zfilipin: smaller wiki a/b tests are bumped to 100% (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463875 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson) [08:16:02] !log test recover some s3 wiki data onto db1110 (s5) [08:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:30] Amir1: (re: T181630) sure, let's chat tomorrow? [08:20:31] T181630: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630 [08:20:46] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, 10Release-Engineering-Team (Watching / External): Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) Running to check timing and correctness: ``` # time recover_section.py s3 --database enwikivoyage --host d... [08:21:01] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) This patch should go live on wikidata.org with the... [08:21:42] PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [08:25:19] 10Operations, 10Continuous-Integration-Infrastructure, 10Goal, 10Release-Engineering-Team (Kanban), and 2 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10fgiunchedi) >>! In T182759#4631716, @dduvall wrote: > As of this morning both Jenkins master have the... [08:26:30] RECOVERY - Check systemd state on analytics1003 is OK: OK - running: The system is fully operational [08:27:41] RECOVERY - Check systemd state on matomo1001 is OK: OK - running: The system is fully operational [08:27:43] 10Operations, 10Wikimedia-Logstash: Investigate Kafka main cluster usage for logging pipeline - https://phabricator.wikimedia.org/T205873 (10fgiunchedi) At yesterday's monitoring/logging meeting we've discussed this and concluded that for good hygiene and decoupling it makes sense to spin up a new Kafka cluste... [08:28:08] !log Deploy schema change on s6 eqiad master, lag will be generated T205913 [08:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:13] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [08:36:40] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:40:31] 10Operations, 10Traffic, 10Maps (Tilerator), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) This needs some work: - We need to document that this should be disabled on initial data import /... [08:42:22] 10Operations, 10Traffic, 10Maps (Tilerator), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) Blocking this until the service/config is up to date on all hosts. [08:42:26] (03PS1) 10Elukey: role::prometheus::analytics: add mysqld exporter references [puppet] - 10https://gerrit.wikimedia.org/r/463915 (https://phabricator.wikimedia.org/T202962) [08:42:46] 10Operations, 10Cloud-Services, 10Parsing-Team, 10Datacenter-Switchover-2018, and 2 others: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." - https://phabricator.wikimedia.org/T163438 (10akosiaris) Yup, https://gerrit.... [08:43:11] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:43:15] 10Operations, 10Traffic, 10Maps (Tilerator), 10Reading-Infrastructure-Team-Backlog (Kanban): Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) [08:43:19] !log disabling puppet on es2001 and disabling backups too [08:43:20] (03CR) 10Elukey: [C: 032] role::prometheus::analytics: add mysqld exporter references [puppet] - 10https://gerrit.wikimedia.org/r/463915 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [08:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:57] godog: sure thing, just one thing, it seems logstash is broken on beta. Elastic fails to load up there it seems [08:45:20] PROBLEM - HHVM rendering on mw2261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:45:33] (03CR) 10Alexandros Kosiaris: [C: 031] Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [08:46:11] RECOVERY - HHVM rendering on mw2261 is OK: HTTP OK: HTTP/1.1 200 OK - 75162 bytes in 0.245 second response time [08:46:28] Amir1: ugh, yeah I saw the task, thanks! I can't actually login on kibana there but afaik the ldap credentials should work? [08:47:48] godog: it uses a different ldap credentials [08:47:53] let me get it for you [08:48:21] godog: user name "preview" [08:48:30] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196 (10fgiunchedi) [08:49:39] wikitech credentials won't work there. [08:49:52] *nod* thanks [09:01:20] Amir1: I kicked logstash and looks like it is back now, but I can't tell right away what was wrong :| [09:02:13] godog: Thanks! keep me posted [09:03:32] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 58.68 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:03:48] (03PS6) 10Alex Monk: api: Also handle SIGHUP signals to the API process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 [09:04:17] (03CR) 10Alexandros Kosiaris: [C: 031] sre.switchdc.mediawiki: remove HHVM restart [cookbooks] - 10https://gerrit.wikimedia.org/r/463747 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:05:00] 10Operations, 10Community-Tech, 10MediaWiki-Parser, 10Thumbor, and 2 others: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10fgiunchedi) Adding #thumbor too since I'm sure it'll be affected as well. re: swift space concerns I don't think it'll be a problem unless the... [09:05:41] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 73.36 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:05:55] Amir1: will do, my hunch is either one logstash input freaked out or death by GC [09:11:06] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Logstash packet loss - https://phabricator.wikimedia.org/T200960 (10fgiunchedi) 05Open>03Resolved >>! In T200960#4525248, @fgiunchedi wrote: > A couple of days ago a sudden spike of syslog udp input caused again packet loss. IOW we have mitigated th... [09:11:26] (03PS1) 10Elukey: role::prometheus::analytics: add mysql jobs to the config [puppet] - 10https://gerrit.wikimedia.org/r/463918 (https://phabricator.wikimedia.org/T202962) [09:11:59] (03CR) 10Elukey: [C: 032] role::prometheus::analytics: add mysql jobs to the config [puppet] - 10https://gerrit.wikimedia.org/r/463918 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [09:18:12] PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [09:19:52] (03PS1) 10Elukey: role::prometheus::analytics: fix mysql exporter config [puppet] - 10https://gerrit.wikimedia.org/r/463919 (https://phabricator.wikimedia.org/T202962) [09:20:01] 10Operations, 10monitoring: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) Also to take into consideration that services moving to k8s have `statsd_exporter` listening on `localhost`, for those there's no deployment needed, only writing... [09:20:26] (03CR) 10Elukey: [C: 032] role::prometheus::analytics: fix mysql exporter config [puppet] - 10https://gerrit.wikimedia.org/r/463919 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [09:25:53] (03PS7) 10Alex Monk: api: Also handle SIGHUP signals to the API process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 [09:29:45] (03CR) 10Vgutierrez: [C: 032] api: Also handle SIGHUP signals to the API process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 (owner: 10Alex Monk) [09:30:01] !log Deploy schema change on s2 eqiad master, lag will be generated T205913 [09:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:05] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [09:31:26] (03Merged) 10jenkins-bot: api: Also handle SIGHUP signals to the API process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 (owner: 10Alex Monk) [09:32:44] (03CR) 10Vgutierrez: [C: 032] Be a lot more verbose about problems in the ACME process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459798 (owner: 10Alex Monk) [09:32:48] (03CR) 10Vgutierrez: [C: 032] Log command we run for DNS zone updates [software/certcentral] - 10https://gerrit.wikimedia.org/r/459799 (owner: 10Alex Monk) [09:32:51] (03CR) 10Vgutierrez: [C: 032] setup.py test dependencies: Remove pylint maximum version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459811 (owner: 10Alex Monk) [09:32:54] (03CR) 10Vgutierrez: [C: 032] Compatibility with new flask version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459841 (owner: 10Alex Monk) [09:33:12] (03CR) 10jenkins-bot: api: Also handle SIGHUP signals to the API process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 (owner: 10Alex Monk) [09:35:18] (03CR) 10jerkins-bot: [V: 04-1] Log command we run for DNS zone updates [software/certcentral] - 10https://gerrit.wikimedia.org/r/459799 (owner: 10Alex Monk) [09:35:20] (03CR) 10jerkins-bot: [V: 04-1] setup.py test dependencies: Remove pylint maximum version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459811 (owner: 10Alex Monk) [09:35:22] (03CR) 10jerkins-bot: [V: 04-1] Compatibility with new flask version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459841 (owner: 10Alex Monk) [09:35:49] 10Operations, 10Maps, 10Traffic, 10Reading-Infrastructure-Team-Backlog (Kanban): Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Gehel) While we validated with @BBlack that the expected invalidation load on varnish is reasonable, we have not checked that this lo... [09:36:23] (03PS5) 10Alex Monk: Be a lot more verbose about problems in the ACME process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459798 [09:36:34] (03CR) 10Alex Monk: [C: 032] "rebase approved" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459798 (owner: 10Alex Monk) [09:38:13] (03Merged) 10jenkins-bot: Be a lot more verbose about problems in the ACME process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459798 (owner: 10Alex Monk) [09:38:53] (03PS4) 10Alex Monk: Log command we run for DNS zone updates [software/certcentral] - 10https://gerrit.wikimedia.org/r/459799 [09:39:00] (03CR) 10Alex Monk: [C: 032] "rebase approved" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459799 (owner: 10Alex Monk) [09:39:56] (03CR) 10jenkins-bot: Be a lot more verbose about problems in the ACME process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459798 (owner: 10Alex Monk) [09:40:37] (03Merged) 10jenkins-bot: Log command we run for DNS zone updates [software/certcentral] - 10https://gerrit.wikimedia.org/r/459799 (owner: 10Alex Monk) [09:41:29] (03PS3) 10Alex Monk: setup.py test dependencies: Remove pylint maximum version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459811 [09:41:32] (03CR) 10Alex Monk: [V: 032 C: 032] setup.py test dependencies: Remove pylint maximum version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459811 (owner: 10Alex Monk) [09:41:40] (03CR) 10Alex Monk: [C: 032] "wrong button" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459811 (owner: 10Alex Monk) [09:41:52] (03CR) 10Alex Monk: [C: 032] "meant to say rebase approved" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459811 (owner: 10Alex Monk) [09:41:58] :) [09:42:14] (03CR) 10jenkins-bot: Log command we run for DNS zone updates [software/certcentral] - 10https://gerrit.wikimedia.org/r/459799 (owner: 10Alex Monk) [09:43:20] (03Merged) 10jenkins-bot: setup.py test dependencies: Remove pylint maximum version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459811 (owner: 10Alex Monk) [09:43:40] (03PS5) 10Alex Monk: Compatibility with new flask version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459841 [09:43:46] (03CR) 10Alex Monk: [C: 032] "rebase approved" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459841 (owner: 10Alex Monk) [09:45:02] (03CR) 10jenkins-bot: setup.py test dependencies: Remove pylint maximum version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459811 (owner: 10Alex Monk) [09:46:08] (03Merged) 10jenkins-bot: Compatibility with new flask version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459841 (owner: 10Alex Monk) [09:47:46] (03CR) 10jenkins-bot: Compatibility with new flask version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459841 (owner: 10Alex Monk) [09:47:53] (03PS5) 10Vgutierrez: [WIP] Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [09:51:40] (03PS1) 10Elukey: profile::piwik::database: deploy the mysql exporter on jessie [puppet] - 10https://gerrit.wikimedia.org/r/463920 (https://phabricator.wikimedia.org/T202962) [09:54:38] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12713/" [puppet] - 10https://gerrit.wikimedia.org/r/463920 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [09:57:33] (03PS1) 10Elukey: Revert "profile::piwik::database: deploy the mysql exporter on jessie" [puppet] - 10https://gerrit.wikimedia.org/r/463921 [09:58:31] PROBLEM - Check systemd state on bohrium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:59:20] this is me --^ [09:59:23] ^elukey: what issue did you have? [10:00:10] jynus: on bohrium I don't have the unix plugin deployed, I thought it was but I was wrong [10:00:30] (bohrium is going to be replaced soon by matomo1001, I just wanted to check metrics before switching) [10:00:53] you can enable the plugin in a hot way [10:01:14] ah I thought it was only doable via a restart [10:01:31] https://mariadb.com/kb/en/library/authentication-plugin-unix-socket/ [10:01:47] assuming it is somewhat on path [10:01:55] checking thanks :) [10:02:13] and then you have to add/update the right grants [10:04:05] jynus: the rest works! https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fanalytics&var-server=analytics1003&var-port=13306 [10:06:57] apergos: a rsync from dumpsdata1001 is causing load issues in labstore1007, did this happen before? [10:07:15] no [10:08:03] apergos: shall I just throttle the rsync? [10:08:58] (03CR) 10Elukey: [C: 032] Revert "profile::piwik::database: deploy the mysql exporter on jessie" [puppet] - 10https://gerrit.wikimedia.org/r/463921 (owner: 10Elukey) [10:09:44] let me look into it for a minute [10:09:50] ok [10:11:16] I see that one of these rsyncs goes wtihout a throttle, the rest do [10:11:22] I'll fix it right now [10:11:51] great! [10:11:58] (03PS3) 10Zoranzoki21: Create Photowalk and Photowalk Talk namespaces for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463582 (https://phabricator.wikimedia.org/T205747) [10:12:11] PROBLEM - HTTP releases-jenkins.wikimedia.org on releases2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 5880 bytes in 2.063 second response time [10:12:16] (03PS4) 10Zoranzoki21: Change acewiki default time zone to Asia/Jakarta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) [10:13:24] oh, lie [10:13:35] it's only a (very small) tarball that goes over without a limit [10:13:46] I'll add it in any case but all other rsyncs have a bw limit already [10:13:46] (03PS1) 10Marostegui: db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463924 [10:14:07] arturo: [10:14:19] ac [10:14:21] ack* [10:15:21] are you still seeing the issue now? [10:16:18] (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463924 (owner: 10Marostegui) [10:17:08] apergos: now rsyncd is not the most cpu consuming proc in labstore1007, but load avg numbers are still high, we will have to check in like 15 mins [10:18:19] (03PS2) 10Marostegui: db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463924 [10:19:07] cpu load probably is not a matter of throttling the network bandwidth but rather more generation of the file list [10:19:30] and with the tarball there's only one file, so there's not even that [10:19:59] (03PS1) 10ArielGlenn: rsync dumps status files to peers with bwlimit [puppet] - 10https://gerrit.wikimedia.org/r/463925 [10:20:23] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463924 (owner: 10Marostegui) [10:20:31] RECOVERY - Check systemd state on bohrium is OK: OK - running: The system is fully operational [10:21:13] apergos: https://graphite.wikimedia.org/S/P [10:22:59] what's using the most cpu right now? [10:23:08] `htop` lol [10:23:14] and diamond [10:23:21] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463924 (owner: 10Marostegui) [10:23:23] and prometheus [10:23:34] but load is slowly going down [10:23:39] now in ~16 [10:23:40] (03CR) 10ArielGlenn: [C: 032] rsync dumps status files to peers with bwlimit [puppet] - 10https://gerrit.wikimedia.org/r/463925 (owner: 10ArielGlenn) [10:24:03] `load average: 17.18, 17.81, 19.21` [10:24:52] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2092 (duration: 00m 56s) [10:24:53] !log Deploy schema change on db2092 - T203709 [10:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:57] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [10:25:15] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463926 [10:25:41] RECOVERY - High load average on labstore1007 is OK: OK: Less than 50.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [10:25:50] apergos: ^^^ [10:28:47] well the tarball took less than two minutes to go over (looking at the entries in the syslog) [10:28:50] worst case. [10:29:03] so it's certainly not responsible for all the load spike on the graph [10:30:09] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463926 (owner: 10Marostegui) [10:31:31] there are the stat1005 rsync pulls going on too [10:31:56] yup [10:33:02] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463926 (owner: 10Marostegui) [10:34:08] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2092 (duration: 00m 56s) [10:34:08] (03PS1) 10Banyek: wmf-pt-kill: WMF patched version 2 [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 [10:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:19] (03CR) 10Ema: [C: 04-1] "Fails to build with:" [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (owner: 10Alex Monk) [10:34:38] it might be nice to ensure in some way that there's no more than one pull going on at a time [10:35:41] (03PS10) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 [10:36:15] how does it look on labstore1006? [10:38:14] apergos: https://graphite.wikimedia.org/S/R not very busy [10:38:28] it gets those same rsyncs [10:38:43] the exact same? [10:38:52] both the pushes from dumpsdata1001 and the stat1005 pulls [10:38:53] yup [10:38:54] (03CR) 10jenkins-bot: db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463924 (owner: 10Marostegui) [10:38:56] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463926 (owner: 10Marostegui) [10:39:44] weird... [10:39:58] so there's some additional factor on labstore1007 making things worse [10:40:40] ok, but anyway when I first checked, rsyncd was the winner in both htop and iotop tables [10:41:34] (03PS1) 10Banyek: wikireplicas: wmf-pt-kill system user should log in to mysql [puppet] - 10https://gerrit.wikimedia.org/r/463933 [10:42:55] do we know the time period where it was the top consumer? [10:43:03] (03CR) 10Jcrespo: [C: 031] "Looks good to me, did you test it on your vm and it worked ok?" [puppet] - 10https://gerrit.wikimedia.org/r/463933 (owner: 10Banyek) [10:43:20] not really :-/ [10:43:33] (03CR) 10Banyek: "> Looks good to me, did you test it on your vm and it worked ok?" [puppet] - 10https://gerrit.wikimedia.org/r/463933 (owner: 10Banyek) [10:43:56] well wthin 30 minutes that one change will be live on dumpsdata1001 (which pushes out to the labstores) [10:44:23] it won't kick in until this run of the script is complete, whenever that is [10:44:42] apergos: which patch? [10:44:56] bw capping the tarball rsync [10:45:02] link? :-) [10:45:08] the other rsyncs (the ones that take any length of time) are already capped [10:45:24] (01:23:40 μμ) wikibugs: (CR) ArielGlenn: [C: 2] rsync dumps status files to peers with bwlimit [puppet] - https://gerrit.wikimedia.org/r/463925 (owner: ArielGlenn) [10:45:39] (I ignore wikibugs on IRC, too noisy for me) [10:47:01] actually since there's no rsync to labstore1007 happening now that I would interrupt, I'm going to run puppet on dumpsdata1001 now [10:47:10] thanks apergos ! [10:47:22] honestly I don't expect this to be the issue though [10:49:01] well, nfsd is now the winner in iotop by far [10:49:37] marostegui: still deploying db changes? can i steal deploy1001 off of you for 5 mins? [10:49:56] mobrovac: all yours, I am about to leave for lunch! [10:50:01] kk thnx [10:51:08] (03PS3) 10Mobrovac: RunSingleJob: Delay job execution while in read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) [10:53:58] (03CR) 10Mobrovac: [C: 032] RunSingleJob: Delay job execution while in read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) (owner: 10Mobrovac) [10:56:01] (03Merged) 10jenkins-bot: RunSingleJob: Delay job execution while in read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) (owner: 10Mobrovac) [10:56:15] (03PS1) 10Mathew.onipe: check_elasticsearch_shard_size: alert display format update [puppet] - 10https://gerrit.wikimedia.org/r/463934 [10:58:16] !log mobrovac@deploy1001 Synchronized rpc/RunSingleJob.php: RunSingleJob: Delay job execution while in read-only mode - T204154 (duration: 00m 57s) [10:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:21] T204154: Kafka JobQueue should respect DB readonly mode - https://phabricator.wikimedia.org/T204154 [10:59:01] (03PS1) 10Jcrespo: mariadb: Move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) [10:59:11] (03CR) 10Ema: Debian packaging (033 comments) [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (owner: 10Alex Monk) [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181002T1100). [11:00:04] Jdrewniak (WMF) and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:09] o/ [11:00:15] \o [11:00:36] jan_drewniak, Amir1: both of you are deployers, right? [11:00:43] I am [11:00:43] so, you know what to do ;) [11:00:51] I'm around in case you need me [11:01:00] I go first if that's fine [11:01:07] This fatals: https://en.wikipedia.org/wiki/Special:ProblemChanges [11:01:12] Amir1: looks like jan_drewniak is not around, so go ahead [11:01:26] Amir1: you're fixing that? or just noticed it? [11:01:32] fixing it [11:01:41] then do please go ahead :D [11:01:53] :D [11:02:09] (03CR) 10Jcrespo: "The plan is to move them for now on eqiad only, and reuse the read only time for the datacenter switch back to move the configuration so n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [11:02:32] (03CR) 10jenkins-bot: RunSingleJob: Delay job execution while in read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) (owner: 10Mobrovac) [11:02:47] (03CR) 10Jcrespo: "Note: We may have to sync the dblists on eqiad only at first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [11:06:42] wow, https://integration.wikimedia.org/zuul/ says it will take 27 minutes [11:08:54] (03PS11) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 [11:09:33] Amir1: i think thats my fauly [11:09:34] fauly [11:09:37] fauLT [11:09:45] or rather, flakey selenium tests fault [11:10:03] i had a chain of about 6 patches, and patch 2 failed because of flakely selenium, and i had to resubmit [11:10:07] php7 passed quickly but hhvm is having fun right now [11:11:06] (03PS12) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 [11:15:52] (03PS13) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (https://phabricator.wikimedia.org/T199711) [11:16:18] tested on mwdebug2002, works fine [11:16:20] moving on [11:19:01] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/FlaggedRevs/frontend/specialpages/reports/ProblemChanges_body.php: SWAT: [[gerrit:463917|Use proper index on change_tag table (T205904)]] (duration: 00m 57s) [11:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:05] T205904: Key 'change_tag_rev_tag' doesn't exist in table 'change_tag' - https://phabricator.wikimedia.org/T205904 [11:19:53] I'm done [11:20:15] jan_drewniak: around for swat? [11:20:22] PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [11:21:02] looks like that's all for swat [11:21:16] !log EU SWAT finished [11:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:24] jouncebot: next [11:21:24] In 0 hour(s) and 38 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181002T1200) [11:21:30] jouncebot: update [11:21:38] jouncebot: refresh [11:21:40] I refreshed my knowledge about deployments. [11:28:46] (03PS14) 10Ema: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (https://phabricator.wikimedia.org/T199711) (owner: 10Alex Monk) [11:29:59] (03PS15) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (https://phabricator.wikimedia.org/T199711) [11:34:30] (03PS16) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (https://phabricator.wikimedia.org/T199711) [11:34:45] (03PS17) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (https://phabricator.wikimedia.org/T199711) [11:41:40] !log downtime labstore1007 load check in icinga for 1d [11:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:06] (03CR) 10Banyek: [C: 032] wikireplicas: wmf-pt-kill system user should log in to mysql [puppet] - 10https://gerrit.wikimedia.org/r/463933 (owner: 10Banyek) [11:51:20] (03PS2) 10Banyek: wikireplicas: wmf-pt-kill system user should log in to mysql [puppet] - 10https://gerrit.wikimedia.org/r/463933 [11:51:39] (03CR) 10Banyek: [V: 032 C: 032] wikireplicas: wmf-pt-kill system user should log in to mysql [puppet] - 10https://gerrit.wikimedia.org/r/463933 (owner: 10Banyek) [11:55:33] 10Operations, 10Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562 (10faidon) @Dzahn, what's the status of this? I did a cursory search and saw both existing critical hosts (like rdb1005/6 that were mentioned above) with no RAID, as well as a couple of new hos... [11:55:40] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 54.22 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:57:03] (03CR) 10Gehel: [C: 04-1] "A few cleanup to do, see comments inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463934 (owner: 10Mathew.onipe) [11:58:12] (03CR) 10Gehel: [C: 031] Fix: Regenerate map tiles up to zoom level 9 [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos) [11:58:42] !log converting wikidatawiki.slots to TokuDB on host dbstrore1002 (T205544) [11:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:47] T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 [11:58:50] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 71.6 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:59:41] (03CR) 10Banyek: [C: 032] user: some dotfiles for user banyek [puppet] - 10https://gerrit.wikimedia.org/r/463502 (owner: 10Banyek) [11:59:51] (03PS3) 10Banyek: user: some dotfiles for user banyek [puppet] - 10https://gerrit.wikimedia.org/r/463502 [11:59:54] (03CR) 10Banyek: [V: 032 C: 032] user: some dotfiles for user banyek [puppet] - 10https://gerrit.wikimedia.org/r/463502 (owner: 10Banyek) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181002T1200) [12:01:32] (03CR) 10Gehel: [C: 031] "LGTM, let's see if @volans has a last say" [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [12:02:30] (03PS2) 10Gehel: wdqs: don't send nginx logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/463248 (https://phabricator.wikimedia.org/T200563) [12:05:49] (03PS18) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (https://phabricator.wikimedia.org/T199711) [12:08:38] (03PS3) 10Gehel: wdqs: don't send nginx logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/463248 (https://phabricator.wikimedia.org/T200563) [12:08:40] (03PS4) 10Gehel: wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) [12:09:06] (03CR) 10Gehel: wdqs: cleanup logback configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [12:10:22] (03CR) 10Ema: [C: 032] Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (https://phabricator.wikimedia.org/T199711) (owner: 10Alex Monk) [12:15:59] (03CR) 10jenkins-bot: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (https://phabricator.wikimedia.org/T199711) (owner: 10Alex Monk) [12:23:01] (03PS1) 10Elukey: profile::piwik: increase innodb buffer pool [puppet] - 10https://gerrit.wikimedia.org/r/463944 (https://phabricator.wikimedia.org/T202962) [12:25:43] (03PS2) 10Elukey: profile::piwik: tune mariadb configs [puppet] - 10https://gerrit.wikimedia.org/r/463944 (https://phabricator.wikimedia.org/T202962) [12:28:08] (03CR) 10Elukey: [C: 032] profile::piwik: tune mariadb configs [puppet] - 10https://gerrit.wikimedia.org/r/463944 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [12:30:21] (03CR) 10Jcrespo: [C: 04-1] "See my comment, although I would CC Moritz" (031 comment) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [12:32:07] (03PS1) 10Sbisson: Don't purge articlequality, draftquality scores [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463945 (https://phabricator.wikimedia.org/T203286) [12:32:12] 10Operations, 10Traffic: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Vgutierrez) p:05Triage>03High [12:46:24] (03CR) 10Banyek: wmf-pt-kill: WMF patched version 2 (031 comment) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [12:47:25] !log converting enwiki.contents to TokuDB on host dbstrore1002 (T205544) [12:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:29] T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 [12:47:50] !log converting enwiki.content to TokuDB on host dbstrore1002 (T205544) [12:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:57] (03PS2) 10Mathew.onipe: check_elasticsearch_shard_size: alert display format update [puppet] - 10https://gerrit.wikimedia.org/r/463934 [12:49:47] (03PS2) 10Banyek: wmf-pt-kill: WMF patched version 2 [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 [12:50:19] 10Operations, 10Security-team-backlog, 10monitoring: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300 (10chasemp) I had a few minutes so I looked at this because it would be super swell to have it rigged up. It's a bit complicated at the moment. Note {T12... [12:50:31] (03PS1) 10Elukey: matomo: remove plugin not compatible with 3.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/463947 (https://phabricator.wikimedia.org/T202962) [12:52:50] (03CR) 10Marostegui: "This looks good to me, but what about stuff like:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [12:53:37] (03CR) 10Jcrespo: [C: 04-1] "extra space" (031 comment) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [12:55:08] (03PS3) 10Banyek: wmf-pt-kill: WMF patched version 2 [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 [12:58:27] (03PS1) 10Elukey: profile::piwik::webserver: deploy php7.0-xml [puppet] - 10https://gerrit.wikimedia.org/r/463952 (https://phabricator.wikimedia.org/T202962) [13:00:05] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181002T1300) [13:00:44] (03CR) 10Jcrespo: "> [14:59] BTW, that post-inst should check if the user is already there before creating it" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [13:03:22] (03CR) 10Mathew.onipe: check_elasticsearch_shard_size: alert display format update (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463934 (owner: 10Mathew.onipe) [13:03:38] (03CR) 10Jcrespo: "> This looks good to me, but what about stuff like:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [13:05:59] (03CR) 10Marostegui: [C: 031] "Yep, it is, my ls was not showing it for some reason." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [13:06:56] !log Deploy schema change on s7 eqiad, this will generate lag on eqiad - T205913 [13:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:00] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [13:07:14] (03PS6) 10Vgutierrez: [WIP] Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [13:07:47] (03PS3) 10Mathew.onipe: check_elasticsearch_shard_size: alert display format update [puppet] - 10https://gerrit.wikimedia.org/r/463934 [13:10:26] (03PS3) 10Elukey: Replace analytics team's contacts with analytics-alerts [puppet] - 10https://gerrit.wikimedia.org/r/463306 (https://phabricator.wikimedia.org/T172532) [13:11:16] (03CR) 10Elukey: [C: 032] Replace analytics team's contacts with analytics-alerts [puppet] - 10https://gerrit.wikimedia.org/r/463306 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [13:12:55] 10Operations: logrotate cronspam on ms-be1040 - https://phabricator.wikimedia.org/T205974 (10fgiunchedi) [13:13:33] (03CR) 10Mobrovac: [C: 031] "PCC looking good - https://puppet-compiler.wmflabs.org/compiler1002/12717/" [puppet] - 10https://gerrit.wikimedia.org/r/458476 (owner: 10Giuseppe Lavagetto) [13:18:09] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Ottomata) > Is the hope that https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/462810... [13:19:03] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10chelsyx) Hi @herron , I chatted with @mpopov and he doesn't know about it neither... So never mind :P Can... [13:19:18] (03PS7) 10Vgutierrez: [WIP] Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [13:19:24] (03PS2) 10Elukey: matomo: remove plugin not compatible with 3.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/463947 (https://phabricator.wikimedia.org/T202962) [13:20:06] (03CR) 10Elukey: [C: 032] matomo: remove plugin not compatible with 3.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/463947 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [13:20:30] (03PS2) 10Elukey: profile::piwik::webserver: deploy php7.0-xml [puppet] - 10https://gerrit.wikimedia.org/r/463952 (https://phabricator.wikimedia.org/T202962) [13:20:40] (03CR) 10Arturo Borrero Gonzalez: "Given GET /os-services is fine, but DELETE /os-services/id isn't fine," [puppet] - 10https://gerrit.wikimedia.org/r/463790 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [13:20:48] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) [13:21:07] (03CR) 10Elukey: [C: 032] profile::piwik::webserver: deploy php7.0-xml [puppet] - 10https://gerrit.wikimedia.org/r/463952 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [13:22:42] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron) [13:23:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron) [13:26:34] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron) @chelsyx thanks for clarifying! I've updated the description to reflect this. Thi... [13:27:19] 10Operations, 10Traffic: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Vgutierrez) Looks like the culprit is https://github.com/wikimedia/puppet/blob/production/modules/install_server/files/autoinstall/netboot.cfg#L126-L127: ``` lvs100[7-9]|lvs101[012]|lvs2*) ec... [13:27:35] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10Ottomata) This doesn't really escalate any access to data, as Chelsey already has analytics... [13:28:20] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10elukey) Adding @Nuria to approve from our side :) [13:29:31] (03CR) 10Ottomata: [C: 031] Only allow HTTP port for Hue [puppet] - 10https://gerrit.wikimedia.org/r/463428 (owner: 10Muehlenhoff) [13:30:54] (03PS1) 10Elukey: role::cache::text: add matomo1001 backend for piwik.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/463957 (https://phabricator.wikimedia.org/T202962) [13:33:32] (03PS3) 10Herron: admin: create new group wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/463835 (https://phabricator.wikimedia.org/T205543) [13:38:58] (03CR) 10Herron: [C: 032] admin: create new group wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/463835 (https://phabricator.wikimedia.org/T205543) (owner: 10Herron) [13:40:47] (03PS5) 10Gehel: Fix: Regenerate map tiles up to zoom level 9 [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos) [13:41:43] (03PS2) 10Herron: wdqs: add wdqs-roots group to wdqs common role [puppet] - 10https://gerrit.wikimedia.org/r/463837 (https://phabricator.wikimedia.org/T205543) [13:42:18] (03CR) 10Herron: [C: 032] wdqs: add wdqs-roots group to wdqs common role [puppet] - 10https://gerrit.wikimedia.org/r/463837 (https://phabricator.wikimedia.org/T205543) (owner: 10Herron) [13:42:38] (03CR) 10Gehel: [C: 032] Fix: Regenerate map tiles up to zoom level 9 [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos) [13:43:00] (03PS6) 10Gehel: Fix: Regenerate map tiles up to zoom level 9 [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos) [13:43:23] herron: thanks for the merge! cc onimisionipe [13:43:55] (03PS2) 10Herron: admin: add onimisionipe to group wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/463836 (https://phabricator.wikimedia.org/T205543) [13:44:02] gehel: you bet! [13:44:42] (03CR) 10Herron: [C: 032] admin: add onimisionipe to group wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/463836 (https://phabricator.wikimedia.org/T205543) (owner: 10Herron) [13:46:58] !log Deploy schema change on s4 eqiad, this will generate lag on eqiad - T205913 [13:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:03] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [13:52:36] (03PS1) 10Gehel: maps: change date of job to regenerate high level tiles [puppet] - 10https://gerrit.wikimedia.org/r/463962 (https://phabricator.wikimedia.org/T202201) [13:53:11] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Mathew full root on wdqs servers - https://phabricator.wikimedia.org/T205543 (10herron) 05Open>03Resolved a:03herron @Mathew.onipe you should now have root on the wdqs systems. From an example wdqs system: ``` herron@wdqs1004:~$ getent pas... [13:53:18] (03CR) 10MSantos: [C: 031] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/463962 (https://phabricator.wikimedia.org/T202201) (owner: 10Gehel) [13:53:32] (03CR) 10Gehel: [C: 032] maps: change date of job to regenerate high level tiles [puppet] - 10https://gerrit.wikimedia.org/r/463962 (https://phabricator.wikimedia.org/T202201) (owner: 10Gehel) [13:54:10] 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10herron) a:03RobH [13:56:12] (03PS4) 10Gehel: check_elasticsearch_shard_size: alert display format update [puppet] - 10https://gerrit.wikimedia.org/r/463934 (owner: 10Mathew.onipe) [13:56:27] RECOVERY - High load average on labstore1007 is OK: OK: Less than 50.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [13:56:59] (03CR) 10Gehel: [C: 032] check_elasticsearch_shard_size: alert display format update [puppet] - 10https://gerrit.wikimedia.org/r/463934 (owner: 10Mathew.onipe) [14:00:28] (03CR) 10Ema: "There's ./hieradata/role/common/trafficserver/backend.yaml too :)" [puppet] - 10https://gerrit.wikimedia.org/r/463957 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [14:01:57] (03CR) 10Elukey: "> There's ./hieradata/role/common/trafficserver/backend.yaml too :)" [puppet] - 10https://gerrit.wikimedia.org/r/463957 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [14:02:34] (03PS2) 10Elukey: role::cache::text: add matomo1001 backend for piwik.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/463957 (https://phabricator.wikimedia.org/T202962) [14:06:17] 10Operations, 10Maps (Tilerator), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10MSantos) The temporary solution seems not enough, probably because of {T205735}, when the populate_admin cron starts... [14:07:02] 10Operations: Switch the main etcd cluster in eqiad to use conf1004-1006 - https://phabricator.wikimedia.org/T205814 (10Joe) More detailed plan: [] Replicate from the codfw etcd cluster to the new etcd cluster in eqiad. This means setting up an instance of `profile::etcd::replication` as active on one of the ne... [14:09:23] 10Operations, 10Core Platform Team Kanban (Watching / External), 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213 (10CCicalese_WMF) [14:09:29] 10Operations, 10Core Platform Team Kanban (Watching / External), 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support - https://phabricator.wikimedia.org/T175206 (10CCicalese_WMF) [14:11:46] !log powering off dbstore2002.codfw.wmnet for BBU change (T205257) [14:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:51] T205257: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 [14:15:24] 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10RobH) If the person is in the SRE team, add them to the #acl*sre-team. If they are just staff and need access, add to #acl-procurement-review. As @Mathew.onipe is part of SR... [14:15:28] (03PS2) 10Gehel: Kartotherian: Add wikidata_query_service var for configuring WDQS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/463281 (https://phabricator.wikimedia.org/T205607) (owner: 10Mholloway) [14:15:32] 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10RobH) 05Open>03Resolved [14:16:17] (03CR) 10Gehel: [C: 032] "puppet compiler looks good: https://puppet-compiler.wmflabs.org/compiler1002/12718/" [puppet] - 10https://gerrit.wikimedia.org/r/463281 (https://phabricator.wikimedia.org/T205607) (owner: 10Mholloway) [14:17:45] (03PS3) 10Gehel: Switch public cluster to Kafka event source [puppet] - 10https://gerrit.wikimedia.org/r/462907 (https://phabricator.wikimedia.org/T189458) (owner: 10Mathew.onipe) [14:18:20] (03CR) 10Gehel: [C: 032] Switch public cluster to Kafka event source [puppet] - 10https://gerrit.wikimedia.org/r/462907 (https://phabricator.wikimedia.org/T189458) (owner: 10Mathew.onipe) [14:22:25] (03CR) 10Ottomata: [C: 031] profile::statistics::cruncher|private: remove unused bacula settings [puppet] - 10https://gerrit.wikimedia.org/r/454480 (https://phabricator.wikimedia.org/T201165) (owner: 10Elukey) [14:23:27] SMalyshev: ^^ wdqs public cluster is not on kafka, logs are looking good [14:25:57] ACKNOWLEDGEMENT - HP RAID on db2058 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T205980 [14:26:04] 10Operations, 10ops-codfw: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T205980 (10ops-monitoring-bot) [14:26:05] (03CR) 10Andrew Bogott: [C: 04-1] "> Do you know if we can do some fine-grain policy filtering here?" [puppet] - 10https://gerrit.wikimedia.org/r/463790 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [14:27:27] 10Operations, 10ops-codfw, 10DBA: db2058: Disk #1 predictive failure - https://phabricator.wikimedia.org/T205872 (10Papaul) a:05Papaul>03Marostegui @Marostegui Disk replacement complete. [14:28:17] 10Operations, 10ops-codfw: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T205980 (10Marostegui) [14:28:22] 10Operations, 10ops-codfw, 10DBA: db2058: Disk #1 predictive failure - https://phabricator.wikimedia.org/T205872 (10Marostegui) [14:29:36] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10herron) >>! In T41785#4625537, @herron wrote: >Afaict we'll use the `mx-outNN.wmflabs.org` records in cloud/labs mail client configs and will need to... [14:29:43] 10Operations, 10ops-codfw, 10DBA: db2058: Disk #1 predictive failure - https://phabricator.wikimedia.org/T205872 (10Marostegui) Thanks - I see it rebuilding: ``` rroot@db2058:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337DC560) Port Name: 1I Port Na... [14:30:21] (03CR) 10Ottomata: Introduce cumin::selector dummy class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [14:30:32] (03PS7) 10Ottomata: Introduce cumin::selector dummy class [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [14:38:51] (03CR) 10Ottomata: "Noop other than adding dummy class: https://puppet-compiler.wmflabs.org/compiler1002/12719/stat1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [14:38:56] (03CR) 10Ottomata: [C: 032] Introduce cumin::selector dummy class [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [14:39:21] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Discovery-Analysis (Current work): Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10Mathew.onipe) [14:41:27] (03PS1) 10Ottomata: Use cumin::selector instead of profile::cumin::target in get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/463966 (https://phabricator.wikimedia.org/T204088) [14:42:01] cmjohnson1: o/ [14:44:05] PROBLEM - puppet last run on ms-be2042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:45:35] PROBLEM - puppet last run on analytics1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:46:05] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:49:58] (03PS1) 10Mathew.onipe: admin: add Matt(onimisionipe) to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/463967 (https://phabricator.wikimedia.org/T205981) [14:51:06] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10Nuria) Approved. [14:51:19] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) [15:02:14] (03CR) 10Giuseppe Lavagetto: [C: 032] Drop python 2.7 support [software/conftool] - 10https://gerrit.wikimedia.org/r/443394 (owner: 10Giuseppe Lavagetto) [15:04:14] (03CR) 10Giuseppe Lavagetto: [C: 032] Sanitize class names for entities [software/conftool] - 10https://gerrit.wikimedia.org/r/442899 (owner: 10Giuseppe Lavagetto) [15:05:06] (03Merged) 10jenkins-bot: Sanitize class names for entities [software/conftool] - 10https://gerrit.wikimedia.org/r/442899 (owner: 10Giuseppe Lavagetto) [15:05:55] <_joe_> win 21 [15:10:55] RECOVERY - puppet last run on analytics1058 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:11:25] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:14:25] RECOVERY - puppet last run on ms-be2042 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:19:06] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10faidon) I'm not familiar with WMCS' networking -- does this floating IP DNAT imply SNAT as well? i.e. would it not be possible to have flows of the fo... [15:21:27] !log akosiaris@deploy1001 scap-helm mathoid upgrade -h [namespace: mathoid, clusters: eqiad,codfw] [15:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:31] !log akosiaris@deploy1001 scap-helm mathoid cluster eqiad completed [15:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:32] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [15:21:33] !log akosiaris@deploy1001 scap-helm mathoid finished [15:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:51] dammit this need some fixes [15:23:08] !log akosiaris@deploy1001 scap-helm mathoid upgrade production stable/mathoid [namespace: mathoid, clusters: eqiad,codfw] [15:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:20] !log akosiaris@deploy1001 scap-helm mathoid cluster eqiad completed [15:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:38] (03PS6) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968) [15:23:53] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [15:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:06] !log akosiaris@deploy1001 scap-helm mathoid finished [15:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:40] !log upgrade mathoid chart version to 0.0.11 [15:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:51] this should be a noop [15:26:48] famous last words [15:27:55] yeah more or less [15:28:16] looks fine though :-) [15:28:26] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type={create_container,remove_container,start_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:28:33] that's expected [15:28:41] I should bump the threshold [15:29:15] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet operation_type={create_container,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:29:33] wow 4 secs ? [15:29:35] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={create_container,remove_container,run_podsandbox,start_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:29:35] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet operation_type={create_container,remove_container,run_podsandbox,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:30:06] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet operation_type={create_container,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:30:06] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={create_container,podsandbox_status,start_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:30:16] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:30:29] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Krenair) In the current deployment there's a thing called labsaliaser which causes the DNS recursors to substitute responses containing labs public fl... [15:30:35] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:30:36] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:30:36] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:31:06] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:31:15] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:31:57] ah that 6 secs is the sum, ok no worries about that [15:32:06] but the per node threshold probably need to be bumped [15:33:45] !log cleanup old cronjob (cleanup GC logs) on all elasticsearch servers [15:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:57] elukey: ^^ thanks for the ping! [15:34:19] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/12720/mw1261.eqiad.wmnet/ LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [15:35:01] gehel: nice! [15:39:11] (03CR) 10Giuseppe Lavagetto: profile::mediawiki::php: add support for php-fpm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [15:42:57] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Discovery-Analysis (Current work), 10Patch-For-Review: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10Gehel) I can confirm that @Mathew.onipe needs to be able to deploy wikidata query service. t... [15:48:00] (03PS1) 10Dzahn: admins: disable account for Maximilian Pany [puppet] - 10https://gerrit.wikimedia.org/r/463970 [15:50:23] !log cutting 1.32.0-wmf.24 branch [15:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:46] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10herron) >>! In T41785#4634182, @Krenair wrote: > In the current 'main' deployment there's a thing called labsaliaser which causes the DNS recursors to... [15:51:15] (03PS4) 10Paladox: Gerrit: Add flogger javaopts [puppet] - 10https://gerrit.wikimedia.org/r/463519 (https://phabricator.wikimedia.org/T200739) [15:54:13] (03CR) 10Elukey: [C: 031] mediawiki::web::prod_sites: convert mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [15:57:35] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Dzahn) @Papaul please see T205970 [15:58:16] 10Operations, 10Traffic: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Dzahn) i would say this is part of T196560 which is still open. i left a comment there. [16:00:04] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181002T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:18] (03CR) 10Imarlier: profile::mediawiki::php: add support for php-fpm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [16:01:47] PROBLEM - Host cloudnet1004 is DOWN: PING CRITICAL - Packet loss = 100% [16:03:17] RECOVERY - Host cloudnet1004 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [16:03:21] ?? [16:03:45] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@093551f]: Increase cirrusSearchLinksUpdate concurrency [16:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:57] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10herron) [16:04:51] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@093551f]: Increase cirrusSearchLinksUpdate concurrency (duration: 01m 06s) [16:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:05] 10Operations, 10Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562 (10Dzahn) Regarding rdb1005/1006 i have pinged in May, August and September on T140442#4186806 because to me it was the last one keeping this ticket open and i would like to finally close it.... [16:07:47] 10Operations, 10Traffic: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Dzahn) [16:07:52] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Dzahn) [16:09:48] (03PS5) 10Alex Monk: [WIP] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 [16:10:13] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [16:11:34] (03CR) 10Alex Monk: [C: 04-1] "package and probably service were renamed in some of the latest PSes on the packaging commit" [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [16:12:13] (03PS1) 10Cwhite: icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) [16:12:52] (03CR) 10jerkins-bot: [V: 04-1] icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [16:16:00] 10Operations, 10netops: Renumber office-DC interconnect link - https://phabricator.wikimedia.org/T205985 (10ayounsi) p:05Triage>03High [16:16:44] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10herron) Let's go forward with this. The workboard is enabled, and a few columns have been created (acknowledged and radar). Backlog is still "backlog" for the time being since all opera... [16:19:10] (03PS1) 10Ayounsi: Assign IPs for ulsfo-office interco [dns] - 10https://gerrit.wikimedia.org/r/463977 (https://phabricator.wikimedia.org/T205985) [16:19:16] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 58.55 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:22:19] 10Operations, 10netops, 10Patch-For-Review: Renumber office-DC interconnect link - https://phabricator.wikimedia.org/T205985 (10ayounsi) [16:23:30] (03CR) 10Ayounsi: [C: 032] Assign IPs for ulsfo-office interco [dns] - 10https://gerrit.wikimedia.org/r/463977 (https://phabricator.wikimedia.org/T205985) (owner: 10Ayounsi) [16:24:45] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 71.3 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:26:29] (03CR) 10Cwhite: "Only class parameter changes on einsteinium/tegmen. Expected changes on icinga1001. https://puppet-compiler.wmflabs.org/compiler1002/127" [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [16:29:25] (03CR) 10Arturo Borrero Gonzalez: "> > Do you know if we can do some fine-grain policy filtering here?" [puppet] - 10https://gerrit.wikimedia.org/r/463790 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [16:29:39] 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security: Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584 (10BPirkle) [16:31:23] (03CR) 10Krinkle: [C: 031] "LGTM. Just as before, it's fine for the warmup to happen ahead of the switch. Although not more than 2 days imho, given stuff will start t" [cookbooks] - 10https://gerrit.wikimedia.org/r/463747 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:33:49] (03PS1) 10Cwhite: icinga, nagios_common: use commands.cfg on stretch [puppet] - 10https://gerrit.wikimedia.org/r/463981 (https://phabricator.wikimedia.org/T202782) [16:38:42] !log swapping failed disk db1067 T205780 [16:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:46] T205780: db1067 (enwiki master) disk #7 with errors - https://phabricator.wikimedia.org/T205780 [16:38:48] marostegui ^ [16:40:20] (03CR) 10Andrew Bogott: [C: 04-1] "> This script is only supposed to run in cloudcontrol boxes" [puppet] - 10https://gerrit.wikimedia.org/r/463790 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [16:41:54] 10Operations, 10ops-eqiad, 10DBA: db1067 (enwiki master) disk #7 with errors - https://phabricator.wikimedia.org/T205780 (10Cmjohnson) The disk has been swapped [16:42:43] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Cmjohnson) Waiting on the part still [16:43:16] (03PS1) 10Ayounsi: Reserve new frack-bastion-codfw subnet [dns] - 10https://gerrit.wikimedia.org/r/463983 (https://phabricator.wikimedia.org/T204271) [16:44:16] 10Operations, 10Traffic: Simplify comment misc-frontend.inc.vcl.erb - https://phabricator.wikimedia.org/T205988 (10Imarlier) [16:44:27] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) Updated HP that the status remains the same and that the 3rd battery they sent us still does not fix the problem. [16:44:27] (03CR) 10Imarlier: sitemaps: Generalize varnish rule for sitemaps, to apply to all domains (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/456169 (https://phabricator.wikimedia.org/T198965) (owner: 10Imarlier) [16:44:45] (03CR) 10Imarlier: sitemaps: Generalize varnish rule for sitemaps, to apply to all domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/456169 (https://phabricator.wikimedia.org/T198965) (owner: 10Imarlier) [16:46:09] (03CR) 10Ayounsi: [C: 032] Reserve new frack-bastion-codfw subnet [dns] - 10https://gerrit.wikimedia.org/r/463983 (https://phabricator.wikimedia.org/T204271) (owner: 10Ayounsi) [16:46:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsing-Team: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10Cmjohnson) @mortzm Is it safe to say that this can be resolved? Thanks! Chris [16:49:06] !log assign 10.195.0.129/29 to pfw3-codfw:reth0.2133 - T204271 [16:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:10] T204271: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 [16:50:01] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 (10ayounsi) [16:53:03] (03PS1) 10Ayounsi: Remove old frack-bastion-codfw grow frack-administration-codfw to /28 [dns] - 10https://gerrit.wikimedia.org/r/463990 (https://phabricator.wikimedia.org/T204271) [16:53:31] !log setup test s3 replication channel on db1110 (filtered) [16:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:33] (03CR) 10Krinkle: [C: 04-1] Beta: enable MobileFrontend and move some config in to labs settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462819 (https://phabricator.wikimedia.org/T205495) (owner: 10Imarlier) [16:56:27] (03CR) 10Krinkle: [C: 031] Beta: enable MobileFrontend and move some config in to labs settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462819 (https://phabricator.wikimedia.org/T205495) (owner: 10Imarlier) [16:58:45] (03PS5) 10Imarlier: Beta: enable MobileFrontend and move some config in to labs settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462819 (https://phabricator.wikimedia.org/T205495) [16:58:59] (03CR) 10Imarlier: Beta: enable MobileFrontend and move some config in to labs settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462819 (https://phabricator.wikimedia.org/T205495) (owner: 10Imarlier) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181002T1700). [17:01:43] 10Operations, 10ops-eqiad, 10DBA: db1067 (enwiki master) disk #7 with errors - https://phabricator.wikimedia.org/T205780 (10jcrespo) Thank you! Our alerting detects the rebuild as a down so I had worried at first without context :-) Will close when I can assure the rebuild completed successfully. [17:02:35] PROBLEM - MegaRAID on db1067 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [17:02:36] ACKNOWLEDGEMENT - MegaRAID on db1067 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T205990 [17:02:41] 10Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T205990 (10ops-monitoring-bot) [17:02:57] (03PS2) 10Jdlrobson: smaller wiki Minerva a/b tests are bumped to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463875 (https://phabricator.wikimedia.org/T200792) [17:03:53] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) The full import process took around 7-8 hours (although it was quite inefficient to re-import just one database at a time, as the last minutes ar... [17:04:39] 10Operations, 10ops-eqiad, 10DBA: db1067 (enwiki master) disk #7 with errors - https://phabricator.wikimedia.org/T205780 (10jcrespo) [17:04:45] 10Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T205990 (10jcrespo) [17:05:45] 10Operations, 10Mail, 10User-herron: Mail relays needed for VMs in eqiad1 - https://phabricator.wikimedia.org/T205158 (10herron) [17:08:36] RECOVERY - Memory correctable errors -EDAC- on wtp2020 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw%2520prometheus%252Fops [17:11:18] (03PS1) 10Urbanecm: [DNM] Test if tests are working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463996 [17:11:36] (03PS2) 10Jdlrobson: Page issues A/B test to 20% of users (Start the a/b test!) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463807 (https://phabricator.wikimedia.org/T200792) [17:11:43] (03Abandoned) 10Jdlrobson: Page issues A/B test to 20% of users (Start the a/b test!) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463807 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson) [17:12:15] 10Operations, 10Cloud-Services, 10Parsing-Team, 10Datacenter-Switchover-2018, and 2 others: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." - https://phabricator.wikimedia.org/T163438 (10Krinkle) >>! In T163438#4633123... [17:12:34] (03CR) 10Jdlrobson: Remove dead config relating to wgRelatedArticlesEnabledBucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462573 (https://phabricator.wikimedia.org/T202306) (owner: 10Jdlrobson) [17:13:25] RECOVERY - Device not healthy -SMART- on db1067 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1067&var-datasource=eqiad%2520prometheus%252Fops [17:14:03] 10Operations, 10Traffic: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Papaul) @Vgutierrez does this mean reinstalling both LVS servers (lvs2009 and lvs2010) if yes please elaborate how you want to approach this . Thanks. [17:16:38] (03CR) 10Gehel: wdqs: cleanup logback configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [17:16:43] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) Command executed, for posterity: ``` root@db1110.eqiad.wmnet[(none)]> CHANGE MASTER 's3' TO ... root@db1110.eqiad.wmne... [17:18:35] !log upgrading debian packages and MediaWiki version on wikitech-static [17:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:36] (03PS2) 10Urbanecm: [DNM] Test if tests are working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463996 (https://phabricator.wikimedia.org/T205995) [17:21:33] 10Operations, 10monitoring: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10herron) p:05Triage>03Normal [17:21:44] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) a:03Cmjohnson [17:22:23] !log upgraded wikitech-static to remotes/origin/REL1_31 [17:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:37] 10Operations, 10Wikimedia-Mailing-lists, 10User-Urbanecm: Non-working archive for wikimediacz-l list - https://phabricator.wikimedia.org/T205380 (10herron) p:05Triage>03Normal [17:22:40] !log update fw policies on pfw3-codfw - T204271 [17:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:43] T204271: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 [17:23:01] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10herron) p:05Triage>03High [17:23:40] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Scap, and 2 others: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10herron) p:05Triage>03Normal [17:23:41] (03CR) 10Catrope: [C: 031] Don't purge articlequality, draftquality scores [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463945 (https://phabricator.wikimedia.org/T203286) (owner: 10Sbisson) [17:24:05] 10Operations, 10Wikimedia-Logstash: Deprecate >= 50% of udp2log producers - https://phabricator.wikimedia.org/T205856 (10herron) p:05Triage>03Normal [17:24:35] PROBLEM - puppet last run on kubetcd2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:24:40] 10Operations, 10ops-ulsfo, 10Traffic, 10decommission, 10Patch-For-Review: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs4001.ulsfo.wmnet and performed the following actions: - Revoked Puppet certif... [17:24:53] 10Operations, 10Traffic: Simplify comment misc-frontend.inc.vcl.erb - https://phabricator.wikimedia.org/T205988 (10herron) p:05Triage>03Normal [17:24:54] (03CR) 10Imarlier: [C: 032] Beta: enable MobileFrontend and move some config in to labs settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462819 (https://phabricator.wikimedia.org/T205495) (owner: 10Imarlier) [17:24:56] PROBLEM - Wikitech-static main page has content on labtestweb2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 291 bytes in 0.099 second response time [17:24:57] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 291 bytes in 0.105 second response time [17:24:58] 10Operations, 10ops-ulsfo, 10Traffic, 10decommission, 10Patch-For-Review: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs4002.ulsfo.wmnet and performed the following actions: - Revoked Puppet certif... [17:25:05] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 291 bytes in 0.100 second response time [17:25:07] 10Operations, 10ops-ulsfo, 10Traffic, 10decommission, 10Patch-For-Review: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs4003.ulsfo.wmnet and performed the following actions: - Revoked Puppet certif... [17:25:16] 10Operations, 10Wikimedia-Logstash: Investigate Kafka main cluster usage for logging pipeline - https://phabricator.wikimedia.org/T205873 (10herron) p:05Triage>03High [17:25:19] 10Operations, 10ops-ulsfo, 10Traffic, 10decommission, 10Patch-For-Review: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs4004.ulsfo.wmnet and performed the following actions: - Revoked Puppet certif... [17:25:28] !log update fw policies on pfw3-eqiad - T204271 [17:25:29] 10Operations, 10Wikimedia-Logstash: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) p:05Triage>03Normal [17:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:51] 10Operations, 10monitoring: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) - https://phabricator.wikimedia.org/T205862 (10herron) p:05Triage>03Normal [17:26:10] 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10herron) p:05Triage>03Normal [17:26:24] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 (10ayounsi) [17:26:29] 10Operations, 10Wikimedia-Logstash: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10herron) p:05Triage>03Normal [17:26:41] 10Operations, 10Wikimedia-Logstash: Procure and provision Logging pipeline hardware in multiple datacenters - https://phabricator.wikimedia.org/T205850 (10herron) p:05Triage>03Normal [17:26:56] (03Merged) 10jenkins-bot: Beta: enable MobileFrontend and move some config in to labs settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462819 (https://phabricator.wikimedia.org/T205495) (owner: 10Imarlier) [17:27:02] 10Operations, 10netops: IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert - https://phabricator.wikimedia.org/T205829 (10herron) p:05Triage>03High [17:27:34] 10Operations, 10ops-codfw: wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10herron) p:05Triage>03High [17:27:47] 10Operations, 10ops-ulsfo, 10Traffic, 10decommission: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535 (10RobH) [17:27:59] 10Operations, 10Wikimedia-Logstash: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10herron) p:05Triage>03Normal [17:28:16] 10Operations, 10Wikimedia-Mailing-lists: Request new mail list for Vietnam Wikimedians User Group - https://phabricator.wikimedia.org/T204974 (10herron) p:05Triage>03Normal [17:28:20] 10Operations, 10Cloud-Services, 10Parsing-Team, 10Datacenter-Switchover-2018, and 2 others: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." - https://phabricator.wikimedia.org/T163438 (10Arlolra) That patch hasn't been... [17:28:34] 10Operations, 10ops-ulsfo, 10Traffic, 10decommission: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535 (10RobH) So, these are all racked in the two new racks, but without any power or network. As such, I'll just continue with the remainder of the steps (puppet was never dis... [17:28:41] 10Operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10herron) p:05Triage>03Normal [17:30:14] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10ovasileva) [17:30:28] 10Operations, 10Puppet: Why doesn't profile::mediawiki::nutcracker create /var/run/nutcracker/ ? - https://phabricator.wikimedia.org/T204450 (10herron) p:05Triage>03Normal [17:30:37] (03PS1) 10RobH: decom lvs400[1-4].ulsfo.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/464005 (https://phabricator.wikimedia.org/T178535) [17:30:52] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10ovasileva) [17:31:09] (03CR) 10RobH: [C: 032] decom lvs400[1-4].ulsfo.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/464005 (https://phabricator.wikimedia.org/T178535) (owner: 10RobH) [17:32:37] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10JKatzWMF) Approved. Thanks! [17:33:24] (03CR) 10jenkins-bot: Beta: enable MobileFrontend and move some config in to labs settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462819 (https://phabricator.wikimedia.org/T205495) (owner: 10Imarlier) [17:34:08] (03PS1) 10RobH: decom lvs400[1-4] dns entries [dns] - 10https://gerrit.wikimedia.org/r/464006 (https://phabricator.wikimedia.org/T178535) [17:34:35] (03CR) 10RobH: [C: 032] decom lvs400[1-4] dns entries [dns] - 10https://gerrit.wikimedia.org/r/464006 (https://phabricator.wikimedia.org/T178535) (owner: 10RobH) [17:35:50] (03PS2) 10Dzahn: icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:36:26] 10Operations, 10ops-ulsfo, 10decommission: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535 (10RobH) [17:36:36] (03CR) 10jerkins-bot: [V: 04-1] icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:36:52] 10Operations, 10ops-ulsfo, 10decommission: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535 (10RobH) [17:37:52] !log update NAT for frbast2001 on pfw3-codfw - T204271 [17:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:56] T204271: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 [17:37:59] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10MBinder_WMF) [17:38:42] (03CR) 10Dzahn: "the comment from the style check seems to make no sense:" [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:40:06] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 (10ayounsi) [17:46:32] (03PS4) 10Gehel: wdqs: don't send nginx logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/463248 (https://phabricator.wikimedia.org/T200563) [17:46:59] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Cmjohnson) Swapped the failed disk [17:47:24] (03CR) 10Gehel: [C: 032] wdqs: don't send nginx logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/463248 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [17:49:40] RECOVERY - puppet last run on kubetcd2002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:50:56] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10RobH) So i get the same error just attempting to boot into the BIOS. ``` iDRAC Settings: CBL0009: Backplane 1 connector A0 is not connected. CBL0009: Backplane... [17:51:13] !log arlolra@deploy1001 Started deploy [parsoid/deploy@19053a3]: Updating Parsoid to 65d6f82 [17:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:11] PROBLEM - Wikitech and wt-static content in sync on labweb1001 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static [18:01:27] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10Cmjohnson) THe disks are now being seen by the contorller, this server was the spare we borrowed a cable from to work on cloudvirt1023. Re-connected the cable and n... [18:01:57] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@19053a3]: Updating Parsoid to 65d6f82 (duration: 10m 44s) [18:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:10] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) a:05Cmjohnson>03elukey [18:03:23] (03PS1) 10Mforns: Add backup parameter to saltrotate cron job [puppet] - 10https://gerrit.wikimedia.org/r/464009 (https://phabricator.wikimedia.org/T199900) [18:04:31] (03CR) 10Elukey: [C: 032] Add backup parameter to saltrotate cron job [puppet] - 10https://gerrit.wikimedia.org/r/464009 (https://phabricator.wikimedia.org/T199900) (owner: 10Mforns) [18:04:43] (03PS1) 10ArielGlenn: move multiversion config setting to 'wiki' section for incr dumps [dumps] - 10https://gerrit.wikimedia.org/r/464010 [18:05:54] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2018), 10Goal, 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Johan) [18:06:06] (03CR) 10ArielGlenn: [C: 032] move multiversion config setting to 'wiki' section for incr dumps [dumps] - 10https://gerrit.wikimedia.org/r/464010 (owner: 10ArielGlenn) [18:07:20] !log ariel@deploy1001 Started deploy [dumps/dumps@a9570fb]: fix incr dumps multiversion conf setting [18:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:26] !log ariel@deploy1001 Finished deploy [dumps/dumps@a9570fb]: fix incr dumps multiversion conf setting (duration: 00m 06s) [18:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:50] (03PS3) 10Cwhite: icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) [18:07:58] !log Updated Parsoid to 65d6f82 (T163438, T205674, T205673) [18:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:05] T163438: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." - https://phabricator.wikimedia.org/T163438 [18:08:06] T205673: Template range without arginfo. - https://phabricator.wikimedia.org/T205673 [18:08:07] T205674: Flipped range should have been enclosed. - https://phabricator.wikimedia.org/T205674 [18:08:10] RECOVERY - Device not healthy -SMART- on helium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=helium&var-datasource=eqiad%2520prometheus%252Fops [18:08:42] (03CR) 10jerkins-bot: [V: 04-1] icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:09:11] PROBLEM - Wikitech and wt-static content in sync on labweb1002 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static [18:09:28] (03CR) 10BBlack: [C: 04-1] sitemaps: Generalize varnish rule for sitemaps, to apply to all domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/456169 (https://phabricator.wikimedia.org/T198965) (owner: 10Imarlier) [18:09:33] those wikitech-static alerts are me [18:10:48] 10Operations, 10Cloud-Services, 10Parsing-Team, 10Datacenter-Switchover-2018, and 2 others: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." - https://phabricator.wikimedia.org/T163438 (10Arlolra) 05Open>03Resolved [18:12:11] 10Operations, 10Traffic: Simplify comment misc-frontend.inc.vcl.erb - https://phabricator.wikimedia.org/T205988 (10BBlack) Probably we need to do more than simplify the comment here, and instead actually fix/refactor the logic so it can work sanely. Either way, we'll need some relatively-bulletproof way to li... [18:12:29] PROBLEM - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) [18:12:34] ACKNOWLEDGEMENT - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206004 [18:12:38] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T206004 (10ops-monitoring-bot) [18:14:59] RECOVERY - Wikitech-static main page has content on labtestweb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 33279 bytes in 0.425 second response time [18:15:00] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 33279 bytes in 0.281 second response time [18:15:01] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 33278 bytes in 0.223 second response time [18:15:08] (03CR) 10Dzahn: icinga: add new icinga.cfg and move configuration (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:18:58] (03CR) 10Dzahn: [C: 031] icinga, nagios_common: use commands.cfg on stretch [puppet] - 10https://gerrit.wikimedia.org/r/463981 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:21:49] (03PS1) 10Andrew Bogott: Settings.php: use wfLoadExtension for a few more extensions [wikitech-static] - 10https://gerrit.wikimedia.org/r/464012 [18:24:54] !log restarting ferm on dbstore2002 T205257 [18:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:58] T205257: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 [18:26:20] (03PS3) 10Ottomata: Add chelsyx to analytics-search-users group [puppet] - 10https://gerrit.wikimedia.org/r/463517 (https://phabricator.wikimedia.org/T205736) (owner: 10Bearloga) [18:26:30] !log remove old 10.195.0.65/29 from pfw3-codfw - T204271 [18:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:34] T204271: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 [18:26:43] (03PS4) 10Cwhite: icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) [18:26:45] (03CR) 10Ottomata: [V: 032 C: 032] Add chelsyx to analytics-search-users group [puppet] - 10https://gerrit.wikimedia.org/r/463517 (https://phabricator.wikimedia.org/T205736) (owner: 10Bearloga) [18:27:53] (03PS5) 10Cwhite: icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) [18:28:55] (03CR) 10jerkins-bot: [V: 04-1] icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:32:12] (03CR) 10Imarlier: sitemaps: Generalize varnish rule for sitemaps, to apply to all domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/456169 (https://phabricator.wikimedia.org/T198965) (owner: 10Imarlier) [18:33:10] RECOVERY - MegaRAID on db1067 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy [18:39:15] !log replace 10.195.0.73/29 with 10.195.0.65/28 on pfw3-codfw - T204271 [18:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:20] T204271: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 [18:42:10] 10Operations, 10ops-eqiad, 10DBA: db1067 (enwiki master) disk #7 with errors - https://phabricator.wikimedia.org/T205780 (10Marostegui) 05Open>03Resolved The alert cleared and the RAID is back to optimal! ``` root@db1067:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0... [18:42:16] (03PS6) 10Cwhite: icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) [18:42:59] (03CR) 10jerkins-bot: [V: 04-1] icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:43:18] 10Operations, 10ops-codfw, 10DBA: db2058: Disk #1 predictive failure - https://phabricator.wikimedia.org/T205872 (10Marostegui) 05Open>03Resolved All good! Thank you! ``` root@db2058:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337DC560) Port Name: 1I... [18:44:06] (03PS7) 10Cwhite: icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) [18:44:47] (03CR) 10jerkins-bot: [V: 04-1] icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:45:11] (03CR) 10Dzahn: [C: 031] "i did an actual comparison one more like this:" [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:45:58] (03CR) 10Ayounsi: [C: 032] Remove old frack-bastion-codfw grow frack-administration-codfw to /28 [dns] - 10https://gerrit.wikimedia.org/r/463990 (https://phabricator.wikimedia.org/T204271) (owner: 10Ayounsi) [18:46:05] (03PS2) 10Ayounsi: Remove old frack-bastion-codfw grow frack-administration-codfw to /28 [dns] - 10https://gerrit.wikimedia.org/r/463990 (https://phabricator.wikimedia.org/T204271) [18:46:39] (03PS2) 10Andrew Bogott: Settings.php: use wfLoadExtension for a few more extensions [wikitech-static] - 10https://gerrit.wikimedia.org/r/464012 [18:46:41] (03PS1) 10Andrew Bogott: import-wikitech.sh: run without --uploads to make sure we get new pages [wikitech-static] - 10https://gerrit.wikimedia.org/r/464014 (https://phabricator.wikimedia.org/T204840) [18:48:09] (03PS8) 10Cwhite: icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) [18:48:46] (03CR) 10jerkins-bot: [V: 04-1] icinga: add new icinga.cfg and move configuration [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:49:03] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 (10ayounsi) [18:49:45] (03CR) 10Cwhite: [V: 032 C: 032] icinga: add new icinga.cfg and move configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:49:50] (03CR) 10Andrew Bogott: [V: 032 C: 032] Settings.php: use wfLoadExtension for a few more extensions [wikitech-static] - 10https://gerrit.wikimedia.org/r/464012 (owner: 10Andrew Bogott) [18:49:56] (03CR) 10Dzahn: "i think we should merge it despite what jenkins-bot says. it looks like a bug to me. can't be "0 found" but also "delta 3" at the same tim" [puppet] - 10https://gerrit.wikimedia.org/r/463975 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:50:06] (03CR) 10Andrew Bogott: [V: 032 C: 032] import-wikitech.sh: run without --uploads to make sure we get new pages [wikitech-static] - 10https://gerrit.wikimedia.org/r/464014 (https://phabricator.wikimedia.org/T204840) (owner: 10Andrew Bogott) [18:51:22] (03PS2) 10Cwhite: icinga, nagios_common: use commands.cfg on stretch [puppet] - 10https://gerrit.wikimedia.org/r/463981 (https://phabricator.wikimedia.org/T202782) [19:00:05] marxarelli: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181002T1900). [19:08:53] (03PS2) 10Dzahn: admins: disable account for Maximilian Pany [puppet] - 10https://gerrit.wikimedia.org/r/463970 [19:09:28] (03CR) 10Dzahn: [C: 032] admins: disable account for Maximilian Pany [puppet] - 10https://gerrit.wikimedia.org/r/463970 (owner: 10Dzahn) [19:10:49] (03PS3) 10Dzahn: icinga, nagios_common: use commands.cfg on stretch [puppet] - 10https://gerrit.wikimedia.org/r/463981 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [19:11:21] (03CR) 10Dzahn: [C: 032] "bast1002: Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/mpany]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/463970 (owner: 10Dzahn) [19:11:30] (03CR) 10Dzahn: [C: 032] icinga, nagios_common: use commands.cfg on stretch [puppet] - 10https://gerrit.wikimedia.org/r/463981 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [19:14:10] (03CR) 10Cwhite: "Changes expected: https://puppet-compiler.wmflabs.org/compiler1002/12722/" [puppet] - 10https://gerrit.wikimedia.org/r/463981 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [19:19:54] !log update fw policies on pfw3-codfw - T204271 [19:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:59] T204271: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 [19:21:24] !log update fw policies on pfw3-eqiad - T204271 [19:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:09] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:26:40] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:27:19] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:28:19] (03PS1) 10Dduvall: group0 to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464019 [19:29:29] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:33:05] (03CR) 10Dzahn: "i understand it's needed in 2.16 but still not clear whether it's ok to be merged _before_ 2.16 or will cause issues unless it waits until" [puppet] - 10https://gerrit.wikimedia.org/r/463519 (https://phabricator.wikimedia.org/T200739) (owner: 10Paladox) [19:33:40] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:34:10] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:35:34] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10chelsyx) Thanks everyone! The patch above gave me access to analytics-search-users. Can I b... [19:36:37] !log dduvall@deploy1001 Pruned MediaWiki: 1.32.0-wmf.19 (duration: 07m 25s) [19:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:22] (03CR) 10Dduvall: [C: 032] group0 to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464019 (owner: 10Dduvall) [19:42:47] !log dduvall@deploy1001 Started scap: group0 to php-1.32.0-wmf.24 [19:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:03] (03Merged) 10jenkins-bot: group0 to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464019 (owner: 10Dduvall) [19:45:59] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [19:45:59] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [19:46:59] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [19:46:59] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [19:47:53] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Dzahn) 05Open>03Resolved a:03Dzahn Thanks Chris! The Icinga alert is green again: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=helium&service=Device+not+healthy... [19:48:08] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Dzahn) a:05Dzahn>03Cmjohnson [19:49:01] (03PS1) 10Smalyshev: Fix WDQS service name [puppet] - 10https://gerrit.wikimedia.org/r/464020 [19:49:32] (03CR) 10jenkins-bot: group0 to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464019 (owner: 10Dduvall) [19:50:46] !log update prefix-list fundraising-codfw-internal4 to /24 on pfw3-codfw - T204271 [19:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:50] T204271: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 [19:52:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsing-Team: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10Dzahn) >>! In T196886#4577448, @akosiaris wrote: > https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wtp1043&service=MD+RAID still complains btw. It's green again now... [19:53:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsing-Team: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10Dzahn) 05Open>03Resolved a:05MoritzMuehlenhoff>03None ``` [wtp1043:~] $ sudo smartctl -H /dev/sda smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build) Copyri... [19:56:16] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T206004 (10Dzahn) uhm... also see T205364 [19:56:34] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Dzahn) also T206004 [19:59:11] RECOVERY - Wikitech and wt-static content in sync on labweb1001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (3064 200000s) [20:02:28] (03PS1) 10Jgreen: update frbast2001.frack.codfw.wmnet IP address to .130 [dns] - 10https://gerrit.wikimedia.org/r/464030 [20:03:47] (03CR) 10Jgreen: [C: 032] update frbast2001.frack.codfw.wmnet IP address to .130 [dns] - 10https://gerrit.wikimedia.org/r/464030 (owner: 10Jgreen) [20:04:31] !log authdns-update to deploy new IP for frbast2001.frack.eqiad.wmnet [20:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:39] PROBLEM - High CPU load on API appserver on mw2138 is CRITICAL: CRITICAL - load average: 65.81, 30.92, 19.54 [20:08:19] PROBLEM - High CPU load on API appserver on mw2145 is CRITICAL: CRITICAL - load average: 61.71, 37.30, 23.98 [20:09:11] RECOVERY - Wikitech and wt-static content in sync on labweb1002 is OK: wikitech-static OK - wikitech and wikitech-static in sync (3506 200000s) [20:09:49] RECOVERY - High CPU load on API appserver on mw2138 is OK: OK - load average: 22.87, 29.04, 20.62 [20:10:30] RECOVERY - High CPU load on API appserver on mw2145 is OK: OK - load average: 20.90, 29.47, 22.86 [20:15:47] !log dduvall@deploy1001 Finished scap: group0 to php-1.32.0-wmf.24 (duration: 33m 00s) [20:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:30] (03PS4) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) [20:42:18] (03CR) 10jerkins-bot: [V: 04-1] mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [20:43:25] (03PS5) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) [20:44:32] (03CR) 10jerkins-bot: [V: 04-1] mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [20:49:29] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 58.35 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:53:49] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 75.2 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:01:50] !log labstore2003 stopped service block_sync [21:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:13] !log re-enabling and starting backups on host es2001 (TT205257) [21:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:44] !log enabling puppet on es2001 [21:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:09] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active, AS6939/IPv4: OpenConfirm https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:23:09] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 405, down: 6, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:23:49] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 27 probes of 317 (alerts on 25) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:28:50] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 317 (alerts on 25) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:39:53] !log Fix unused vlans XLink1/2 on asw2-a5 [21:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:53] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 (10ayounsi) [21:43:35] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 (10ayounsi) 05Open>03Resolved An oversight prevented frbast2001 to reach eqiad: codfw only advertised 10.195.0.0/25 to eqiad over ipsec. Making it a /24... [21:44:14] 10Operations, 10Cloud-Services, 10Developer-Advocacy, 10LDAP: Create a single application to provision and manage developer (LDAP) accounts - https://phabricator.wikimedia.org/T179463 (10Andrew) [21:50:43] 10Operations, 10Citoid, 10Services, 10Patch-For-Review, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10thcipriani) Added test `.pipeline` files for zotero. Now running through CI and tests are passing: https://gerrit.wikimedia.org/r/c/mediawiki/services/zotero/+/... [21:58:34] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Aklapper) >>! In T197624#4634347, @herron wrote: > @Aklapper how would you suggest transitioning the ~1400 existing tasks currently in "backlog" on the workboard to "acknowledged" without... [22:09:38] 10Operations, 10Analytics, 10Traffic, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Pchelolo) We did enable the feature after all by looking at requests reaching #RESTBase, but that's not very convenient. Technically this is no more required. How... [22:13:10] (03PS1) 10Ayounsi: Fix PTR for cr3-ulsfo<-->cr4-ulsfo link [dns] - 10https://gerrit.wikimedia.org/r/464068 [22:16:29] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:26:56] !log labstore2003 re-started service block_sync [22:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:40] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:42:00] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 51.84 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:49:39] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 70.98 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:53:58] !log powercycling icinga1001 after removing problematic entry from fstab [22:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181002T2300). [23:00:05] Amir1, stephanebisson, and MaxSem: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:16] o/ [23:00:20] hello [23:00:29] here, but can't really deploy [23:02:18] I guess I can do SWAT today [23:06:30] stephanebisson: any given order for your patches? any note I should keep in mind? Which ones are testable? [23:06:44] (03PS1) 10Gergő Tisza: Move auth logging to different channels for easier counting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464077 (https://phabricator.wikimedia.org/T150300) [23:07:36] Amir1: The config patch can go anytime. It's not really testable, as you know. Of the other ones 464028 should go first as the other ones depend on it. [23:07:59] Amir1: Also, 464028 is not testable but the next 2 are. [23:09:41] (03PS1) 10Cwhite: icinga: use correct user for tmpfs mount [puppet] - 10https://gerrit.wikimedia.org/r/464078 (https://phabricator.wikimedia.org/T202782) [23:10:06] hmm, wmf.24 is only deployed on group0 (testwikis + mediawiki) and ORES is not enabled on them [23:10:43] Amir1: it is on testwiki, no? [23:11:10] hmm, yeah should be [23:14:01] (03CR) 10Cwhite: [C: 032] "Changes expected: https://puppet-compiler.wmflabs.org/compiler1002/12724/" [puppet] - 10https://gerrit.wikimedia.org/r/464078 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [23:14:09] my patch is working on mwdebug2002, moving forward [23:16:41] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/FlaggedRevs/frontend/specialpages/reports/ProblemChanges_body.php: SWAT: [[gerrit:463943|Fix using the old index when new indexes are not there (T205904)]] (duration: 00m 57s) [23:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:45] T205904: Key 'change_tag_rev_tag' doesn't exist in table 'change_tag' - https://phabricator.wikimedia.org/T205904 [23:26:45] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/ORES/tests/phpunit/includes/HooksTest.php: SWAT: [[gerrit:464028|Disable RCFilters in tests]] (duration: 00m 54s) [23:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:22] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Jdlrobson) @Ottomata still not seeing them.. does that mean https://gerrit.wikimedia.org/r... [23:33:03] 10Operations, 10Wikimedia-Mailing-lists: Transfer Mailman List ownership - https://phabricator.wikimedia.org/T206089 (10eliza) [23:33:24] (03PS1) 10Cwhite: icinga: move configuration directory for resource.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464079 (https://phabricator.wikimedia.org/T202782) [23:34:04] (03CR) 10jerkins-bot: [V: 04-1] icinga: move configuration directory for resource.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464079 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [23:34:28] stephanebisson: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/464021 is live on mwdebug2002 [23:36:08] Amir1: works as expected [23:36:21] (03PS2) 10Cwhite: icinga: move configuration directory for resource.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464079 (https://phabricator.wikimedia.org/T202782) [23:36:47] (03PS3) 10Cwhite: icinga: move configuration directory for resource.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464079 (https://phabricator.wikimedia.org/T202782) [23:37:37] ack [23:37:39] going live [23:37:55] (03CR) 10jerkins-bot: [V: 04-1] icinga: move configuration directory for resource.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464079 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [23:38:39] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/PageTriage/modules/ext.pageTriage.views.list/ext.pageTriage.listControlNav.underscore: SWAT: [[gerrit:464021|Hide copyvio, none afc filter options behind flag (T205918)]] (duration: 00m 56s) [23:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:43] T205918: [betalabs] NPP: 'Potential issues' do not display 'None' and 'Copyvio' filters - https://phabricator.wikimedia.org/T205918 [23:39:45] 10Operations, 10Wikimedia-Mailing-lists: Transfer Mailman List ownership - https://phabricator.wikimedia.org/T206089 (10Krenair) If I understand correctly it doesn't work like that. Each list has an administrative password which can be used to administrate the list, including changing the published list of adm... [23:40:13] (03CR) 10Cwhite: [V: 032 C: 032] "Introducing new class define is expected." [puppet] - 10https://gerrit.wikimedia.org/r/464079 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [23:47:51] come on zuul [23:49:07] stephanebisson: "Align copyvio log terminology" is live on mwdebug2002 [23:50:18] Amir1: I don't see it... is there something else that needs to be done since this is a change to a message? [23:51:03] hmm, we l10n caches [23:51:11] but I don't know how to flush them [23:51:14] *we have [23:51:26] you have to run a full scap [23:52:02] 10Operations, 10Community-Tech, 10MediaWiki-Parser, 10Thumbor, and 2 others: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10Niharika) [23:52:32] Amir1: I think it's safe to sync without more testing. It's really just different words in a message. [23:52:34] found this [23:52:34] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#ResourceLoader_and_l10n_messages [23:52:36] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Isaac) [23:53:03] okay, I will go forward [23:53:28] 10Operations, 10Community-Tech, 10MediaWiki-Parser, 10Thumbor, and 3 others: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10Niharika) [23:53:32] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Isaac) I believe I have completed my bullet points - just let me know if something is checking out! [23:54:04] (03PS1) 10Ayounsi: Add fake ssh keys for netbox user [labs/private] - 10https://gerrit.wikimedia.org/r/464081 (https://phabricator.wikimedia.org/T205898) [23:54:05] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/PageTriage/i18n/en.json: SWAT: [[gerrit:464047|Align copyvio log terminology (T199359)]] (duration: 00m 56s) [23:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:09] T199359: New Pages Feed: copyvio addition - https://phabricator.wikimedia.org/T199359 [23:54:38] MaxSem: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GlobalPreferences/+/464071 this is going first [23:54:53] stephanebisson: I didn't forget your config change, will deploy it after this one [23:54:58] yup [23:55:08] (03CR) 10BryanDavis: "> Though I do wonder, what causes jdk7 to be installed in the first" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/463877 (https://phabricator.wikimedia.org/T205774) (owner: 10BryanDavis) [23:56:37] stephanebisson: regarding the i18n file, there will be a full scap tomorrow in deploying the branch to group1, that would fix it, if not, poke me