[00:06:58] (03PS1) 10GTirloni: toollabs-golang - Update to Stretch and Go 1.10 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/464736 (https://phabricator.wikimedia.org/T206276) [00:20:15] (03CR) 10Dzahn: "this need to be amended. the actual user name has only been created now and uid is uid: skvjold" [puppet] - 10https://gerrit.wikimedia.org/r/460943 (https://phabricator.wikimedia.org/T204377) (owner: 10Ayounsi) [00:24:43] (03PS1) 10Dzahn: admins: update ldap user name of Margeigh Novotny [puppet] - 10https://gerrit.wikimedia.org/r/464740 (https://phabricator.wikimedia.org/T204377) [00:24:59] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/464740/" [puppet] - 10https://gerrit.wikimedia.org/r/460943 (https://phabricator.wikimedia.org/T204377) (owner: 10Ayounsi) [00:25:21] (03PS2) 10Dzahn: admins: update ldap user name of Margeigh Novotny [puppet] - 10https://gerrit.wikimedia.org/r/464740 (https://phabricator.wikimedia.org/T204377) [00:26:38] (03CR) 10Dzahn: [C: 032] ""ldap_only" not a shell user" [puppet] - 10https://gerrit.wikimedia.org/r/464740 (https://phabricator.wikimedia.org/T204377) (owner: 10Dzahn) [00:27:53] (03CR) 10Dzahn: [C: 032] ""mnovotny" doesn't exist in LDAP but this one does (now)" [puppet] - 10https://gerrit.wikimedia.org/r/464740 (https://phabricator.wikimedia.org/T204377) (owner: 10Dzahn) [00:31:29] !log LDAP: added user skvjold to group wmf (T204377) [00:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:33] T204377: LDAP Acess request for Margeigh Novotny - https://phabricator.wikimedia.org/T204377 [00:36:14] (03CR) 10BryanDavis: "formatting comment inline. Before merging and deploying there should be some check to see if we have folks actually using the go image and" (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/464736 (https://phabricator.wikimedia.org/T206276) (owner: 10GTirloni) [00:39:48] (03CR) 10Legoktm: "legoktm@tools-k8s-master-01:~$ ./pods.sh" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/464736 (https://phabricator.wikimedia.org/T206276) (owner: 10GTirloni) [00:42:09] (03CR) 10BryanDavis: "Added Dan and Tyler as reviewers since they are apparently the current golang tools maintainers. (Thanks legoktm!)" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/464736 (https://phabricator.wikimedia.org/T206276) (owner: 10GTirloni) [00:45:07] (03CR) 10GTirloni: toollabs-golang - Update to Stretch and Go 1.10 (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/464736 (https://phabricator.wikimedia.org/T206276) (owner: 10GTirloni) [00:46:49] (03PS2) 10GTirloni: toollabs-golang - Update to Stretch and Go 1.10 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/464736 (https://phabricator.wikimedia.org/T206276) [00:52:24] PROBLEM - HHVM rendering on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:53:23] RECOVERY - HHVM rendering on mw1324 is OK: HTTP OK: HTTP/1.1 200 OK - 80831 bytes in 0.181 second response time [01:43:43] PROBLEM - MD RAID on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:43:53] PROBLEM - mathoid endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:44:04] PROBLEM - SSH on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:04] PROBLEM - apertium apy on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:23] PROBLEM - cpjobqueue endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:44:33] PROBLEM - eventstreams on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:33] PROBLEM - changeprop endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:44:43] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title}{/revision}{/tid} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before [01:44:43] ived: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received [01:44:44] PROBLEM - pdfrender on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:53] PROBLEM - Check the NTP synchronisation status of timesyncd on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:45:04] PROBLEM - cxserver endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:45:13] PROBLEM - Check systemd state on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:45:14] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:45:24] PROBLEM - Disk space on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:45:33] RECOVERY - eventstreams on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1043 bytes in 2.393 second response time [01:45:43] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [01:45:43] RECOVERY - pdfrender on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.074 second response time [01:45:44] RECOVERY - MD RAID on scb2005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:46:03] RECOVERY - mathoid endpoints health on scb2005 is OK: All endpoints are healthy [01:46:14] RECOVERY - apertium apy on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.077 second response time [01:46:14] RECOVERY - SSH on scb2005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [01:46:14] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [01:46:15] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [01:46:24] RECOVERY - Disk space on scb2005 is OK: DISK OK [01:46:33] RECOVERY - cpjobqueue endpoints health on scb2005 is OK: All endpoints are healthy [01:46:43] RECOVERY - changeprop endpoints health on scb2005 is OK: All endpoints are healthy [01:47:14] RECOVERY - cxserver endpoints health on scb2005 is OK: All endpoints are healthy [02:14:44] RECOVERY - Check the NTP synchronisation status of timesyncd on scb2005 is OK: OK: synced at Fri 2018-10-05 02:14:41 UTC. [02:49:23] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [02:55:44] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:27:13] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 730.68 seconds [03:42:50] (03PS2) 10Andrew Bogott: labs-ip-alias-dump.py: remove an unused variable [puppet] - 10https://gerrit.wikimedia.org/r/464721 [03:43:02] (03PS2) 10Andrew Bogott: labs-ip-alias-dump.py: Fix enumerating IPs in Neutron [puppet] - 10https://gerrit.wikimedia.org/r/464722 [03:43:58] (03CR) 10Andrew Bogott: [C: 032] labs-ip-alias-dump.py: remove an unused variable [puppet] - 10https://gerrit.wikimedia.org/r/464721 (owner: 10Andrew Bogott) [03:44:09] (03CR) 10Andrew Bogott: [C: 032] labs-ip-alias-dump.py: Fix enumerating IPs in Neutron [puppet] - 10https://gerrit.wikimedia.org/r/464722 (owner: 10Andrew Bogott) [03:59:23] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 294.74 seconds [04:16:29] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/CirrusSearch/includes/DataSender.php: I0769c50c (duration: 01m 01s) [04:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:03] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/includes/libs/filebackend/FileBackendStore.php: T205567 - I75f1eb6dc2cb (duration: 00m 56s) [04:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:08] T205567: PHP Warning "Unable to delete stat cache" from file uploads - https://phabricator.wikimedia.org/T205567 [04:32:18] (03PS1) 10Varnent: Update for Governance Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464748 [04:38:15] (03PS2) 10Varnent: Update to sitename for Governance Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464748 (https://phabricator.wikimedia.org/T205599) [05:03:50] (03PS1) 10Varnent: Additional namespaces for Governance Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464752 (https://phabricator.wikimedia.org/T206173) [05:08:27] (03PS1) 10Marostegui: db-eqiad.php: Clarify db1092 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464753 (https://phabricator.wikimedia.org/T205514) [05:10:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Clarify db1092 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464753 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui) [05:11:18] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1073 is CRITICAL: cluster=mysql device=megaraid,3 instance=db1073:9100 job=node site=eqiad Marostegui T206254 - The acknowledgement expires at: 2018-10-10 05:10:51. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops [05:11:30] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui) [05:12:18] (03Merged) 10jenkins-bot: db-eqiad.php: Clarify db1092 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464753 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui) [05:13:05] (03PS3) 10Varnent: Update to sitename for Governance Wiki to reflect new name of site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464748 (https://phabricator.wikimedia.org/T205599) [05:13:41] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Clarify db1092 status - T205514 (duration: 00m 57s) [05:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:46] T205514: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 [05:25:20] (03CR) 10jenkins-bot: db-eqiad.php: Clarify db1092 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464753 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui) [05:27:54] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [05:46:12] (03CR) 10Krinkle: [C: 032] Update to sitename for Governance Wiki to reflect new name of site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464748 (https://phabricator.wikimedia.org/T205599) (owner: 10Varnent) [05:47:52] (03Merged) 10jenkins-bot: Update to sitename for Governance Wiki to reflect new name of site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464748 (https://phabricator.wikimedia.org/T205599) (owner: 10Varnent) [05:49:26] varnent: If you're here, do you want to verify the change on staging? [05:51:48] It's a two-step process, I can walk you through it if you haven't done it before. [05:53:39] <_joe_> !log upgrading python-etcd on conf1004-6, restarting etcdmirror [05:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:17] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix TLS connections to etcdv3 on stretch [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/464602 (owner: 10Giuseppe Lavagetto) [05:54:27] (03CR) 10jenkins-bot: Update to sitename for Governance Wiki to reflect new name of site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464748 (https://phabricator.wikimedia.org/T205599) (owner: 10Varnent) [05:54:30] * Krinkle verified on mwdebug2002 [05:55:19] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T205599 - Ic28e00c30 (duration: 00m 57s) [05:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:26] T205599: Change $wgSitename for Governance Wiki - https://phabricator.wikimedia.org/T205599 [05:56:11] (03CR) 10Giuseppe Lavagetto: wmfSetupEtcd: Correctly initialize the local cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [06:00:43] (03PS3) 10Giuseppe Lavagetto: wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) [06:01:01] <_joe_> I'll never wrap my head around Mediawiki's use of whitespace around parentheses [06:05:24] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "You also need to change the records in template/wikimedia.org and anywhere else they might appear." [dns] - 10https://gerrit.wikimedia.org/r/464503 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [06:05:48] (03CR) 10Krinkle: mediawiki::web::prod_sites: convert wiktionary.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:06:46] <_joe_> Krinkle: yeah, I found so many of these small WTFs in our virtual hosts [06:07:18] _joe_: so which one do we generate, or did you add support for the broken variant? [06:07:21] * Krinkle runs compiler to check [06:07:35] <_joe_> we fix it [06:07:48] <_joe_> the generated one will have the correct rewrite [06:07:57] cool, Yeah, I see it now [06:08:05] <_joe_> if we want to support that broken version (should we?) we need to add it manuall [06:08:08] <_joe_> *manually [06:08:36] No, it makes no sense. [06:08:38] <_joe_> that's one of the advantages of letting computers do the formatting, it's harder to make mistakes :P [06:08:51] Even if (big if) we really stored them there in the past, we don't currently, so they'd just be 404 [06:09:13] <_joe_> ack [06:09:28] _joe_: Hm.. retry=0 is being added. What's the default currently? [06:09:43] <_joe_> we planned to have retry=0 everywhere [06:10:07] <_joe_> we don't want apache to retry something that timed out on hhvm because of a bad db query for instance [06:10:24] <_joe_> the retries are done by varnish [06:10:49] <_joe_> so no point in retrying locally (and blindly, while varnish IIRC only retries GET requests) [06:11:56] Yeah that makes sense [06:12:07] But does that mean it currently retries where it doesn't say that [06:12:09] <_joe_> so wherever it was missing, it was because or.i and I had too many things to do around HHVM and forgot to standardize on it [06:12:15] or are we just subting the default for clarity [06:12:16] <_joe_> yes [06:12:26] <_joe_> I don't remember what's the default on apache 2.4.x [06:12:30] <_joe_> for mod_proxy [06:13:11] <_joe_> sorry, I got off track completely [06:13:26] <_joe_> what I said is for the proxy balancers [06:13:28] np. [06:13:30] Just one last thing [06:13:35] - RewriteRule ^/w/$ /w/index.php [06:13:44] <_joe_> in our case, "retry" is the time apache waits to repoll the backend [06:13:47] Looks like that one isn't +'ed again. [06:13:51] <_joe_> yes [06:14:02] <_joe_> because we have in the root configuration [06:14:09] <_joe_> DirectoryIndex index.php [06:14:18] Interesting [06:14:22] I guess that works [06:14:30] <_joe_> so if you require /w/, apache will try /w/index.php [06:14:38] what about http://en.wikipedia.org/w/?foo [06:14:42] <_joe_> I checked because half of our vhosts had that rewrite [06:14:48] <_joe_> half didn't [06:14:54] <_joe_> that should work [06:14:59] <_joe_> lemme test [06:15:32] <_joe_> interestingly, doesn't get redirected to /wiki/Main_Page [06:15:39] <_joe_> which is what we'd want, right? [06:15:55] <_joe_> while http://en.wikipedia.org/w/ does [06:15:55] No, it should get handled by MW the same way as /? [06:16:07] <_joe_> yes, I'm talking about Mediawiki handling it [06:16:09] <_joe_> :) [06:16:22] plain /w/ redirects because MW decides to normalize the url given no query string to dictate otherwise [06:16:26] <_joe_> MW sends out a redirect for /w/ [06:16:28] <_joe_> oh I see [06:16:50] Im guessing meta-wiki is on the new format because https://meta.wikimedia.org/w/?foo redirects and loses the query [06:17:13] <_joe_> https://en.wikivoyage.org/w/?foo works, and is in the new format [06:17:34] cool [06:17:45] <_joe_> I don't know about meta, but I see the urls is localized upon redirect [06:17:48] <_joe_> lemme try via curl [06:18:35] nvm, I don't know why ?foo is lost on Meta-Wiki, but other queries are not [06:18:36] https://meta.wikimedia.org/w/?diff=18447142 [06:18:40] So I guess that's fine. [06:20:37] Interesting, the new rules make ENV:RW_PROTO optional for the /math/ redirect. Most of our new config just assumes it is set though, looks interesting, but fine either way [06:20:53] (03CR) 10Krinkle: [C: 031] "Looks like it fixes that, cool." [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:21:24] <_joe_> we always set it at the top of the file [06:21:43] + RewriteCond %{ENV:RW_PROTO} !="" [06:21:44] + RewriteRule ^/math/(.*) %{ENV:RW_PROTO}://upload.wikimedia.org/math/$1 [R=301] [06:21:44] + RewriteRule ^/math/(.*) https://upload.wikimedia.org/math/$1 [R=301] [06:21:54] <_joe_> oh right the conditional, yes [06:21:56] I assume you didn't introduce it but just happens to a variation you used as basis [06:22:01] <_joe_> also, it's redundant :) [06:22:05] Yeah [06:22:07] <_joe_> it was the most common [06:22:10] Cool [06:22:36] <_joe_> ultimately we should, IMO, move all sites to redirect to https [06:22:48] <_joe_> and thus we can just remove RW_PROTO [06:22:50] Yeah, there's no point in keeping it variable everywhere. [06:25:44] 10Operations, 10WMF-JobQueue, 10Core Platform Team Kanban (Watching / External), 10User-ArielGlenn: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle) [06:26:15] 10Operations, 10HHVM, 10User-ArielGlenn: Run all maintenance scripts on PHP7 or HHVM - https://phabricator.wikimedia.org/T195393 (10Krinkle) [06:27:52] 10Operations, 10WMF-JobQueue, 10Core Platform Team Kanban (Watching / External), 10User-ArielGlenn: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle) Thanks. I'll purpose it for maintenance hosts (CLI maintenance scripts from cron). For job runners, we us... [06:29:01] 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [06:31:03] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/nf_conntrack.conf] [06:31:44] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:31:44] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:33:14] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/confd-lint-wrap] [06:42:45] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10jcrespo) > would you have time to chat on IRC some time today / this week / next week (or the week after Let's... [06:45:55] (03CR) 10Krinkle: [C: 031] wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [06:56:23] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:57:13] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:13] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:33] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:09:26] !log installing python3.4/2.7 security updates [07:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:02] (03Abandoned) 10Muehlenhoff: Create component/hhvm324 [puppet] - 10https://gerrit.wikimedia.org/r/439548 (owner: 10Muehlenhoff) [07:10:18] (03PS5) 10Muehlenhoff: Print group memberships which granted Hadoop access to check for HDFS cleanups [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) [07:18:49] !log stopping s5 replication on db1070 [07:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:14] !log stopping s3 replication on db1075 [07:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:29] !log temporarily stop prometheus on bast4001 to finalize data transfer - T179050 [07:20:33] !log stopping x1 replication on db1069 [07:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:35] T179050: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 [07:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:41] !log stopping s3 replication on db1070 [07:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:46] (03PS1) 10Muehlenhoff: Add Cumin aliases for new eqiad1 roles [puppet] - 10https://gerrit.wikimedia.org/r/464762 [07:32:34] RECOVERY - Device not healthy -SMART- on bast4001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=bast4001&var-datasource=ulsfo%2520prometheus%252Fops [07:33:00] !log chaning s3 master for db1070 [07:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:14] (03PS1) 10Ema: Backport ATS 8.0.0 to stretch-wikimedia [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/464764 (https://phabricator.wikimedia.org/T204232) [07:35:10] (03CR) 10jerkins-bot: [V: 04-1] Backport ATS 8.0.0 to stretch-wikimedia [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/464764 (https://phabricator.wikimedia.org/T204232) (owner: 10Ema) [07:38:11] 10Operations, 10Patch-For-Review: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10Volans) I checked the current code and I see that in the same try/except that raises that error it does also: ``` from napalm_base.exceptions import ConnectAuthError, ModuleImportError ``` and `n... [07:38:15] (03CR) 10Elukey: "elukey@conf1005:~$ curl conf1005.eqiad.wmnet:8000/lag -i" [puppet] - 10https://gerrit.wikimedia.org/r/464573 (owner: 10Giuseppe Lavagetto) [07:40:56] (03PS2) 10Volans: ircecho: log exception on exit [puppet] - 10https://gerrit.wikimedia.org/r/463749 (https://phabricator.wikimedia.org/T205522) [07:42:48] (03CR) 10Giuseppe Lavagetto: [C: 031] "Nevermind, brainfart." [dns] - 10https://gerrit.wikimedia.org/r/464503 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [07:45:26] (03PS4) 10Mathew.onipe: icinga::monitor::elasticsearch: throttle alerts notifications [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) [07:46:52] this is me ^^^ deployed a code change, it should rejoin in few seconds [07:47:12] welcome back icinga-wm :) [07:47:39] (03PS5) 10Gehel: icinga::monitor::elasticsearch: throttle alerts notifications [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe) [07:47:53] (03CR) 10Filippo Giunchedi: [C: 032] changing prometheus.svc.ulsfo.wmnet entry to bast4002 [dns] - 10https://gerrit.wikimedia.org/r/464369 (https://phabricator.wikimedia.org/T179050) (owner: 10RobH) [07:48:47] (03Abandoned) 10Volans: Custom fields: fix field type [software/netbox] - 10https://gerrit.wikimedia.org/r/462860 (https://phabricator.wikimedia.org/T199083) (owner: 10Volans) [07:48:49] (03CR) 10Gehel: [C: 032] icinga::monitor::elasticsearch: throttle alerts notifications [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe) [07:50:50] !log stopping dbstore1001:x1 [07:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:03] PROBLEM - MariaDB Slave IO: s5 on db1070 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: Error: connecting slave requested to start from GTID 0-171966669-4075108480, which is not in the masters binlog. Since the masters binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to execut [07:52:03] transactions [07:54:22] !log starting replicatios on db1075; db1070, db1070:s3 with disabled gtid [07:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:26] PROBLEM - DPKG on bast3002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:56:26] RECOVERY - DPKG on bast3002 is OK: All packages OK [07:58:06] RECOVERY - MariaDB Slave IO: s5 on db1070 is OK: OK slave_io_state Slave_IO_Running: Yes [08:00:40] (03PS2) 10Muehlenhoff: Add Cumin aliases for new eqiad1 roles [puppet] - 10https://gerrit.wikimedia.org/r/464762 [08:01:07] (03PS2) 10Elukey: profile::etcd::replication: fix regex in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/464573 (owner: 10Giuseppe Lavagetto) [08:01:22] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 (10fgiunchedi) Data transfer and CNAME flip completed. I've documented the data transfer itself at https://wikitech.wikimedia.org/wiki/Prometheus#Sync_data_from_an_existing_Prometheu... [08:01:58] (03CR) 10Muehlenhoff: [C: 032] Add Cumin aliases for new eqiad1 roles [puppet] - 10https://gerrit.wikimedia.org/r/464762 (owner: 10Muehlenhoff) [08:02:29] (03PS6) 10Muehlenhoff: Print group memberships which granted Hadoop access to check for HDFS cleanups [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) [08:02:50] !log start replication on db1069 (x1) [08:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:53] (03CR) 10Elukey: [C: 032] "There seems to be a temporary inconsistency in the lag reported causing the negative numbers, but we decided to go forward with this chang" [puppet] - 10https://gerrit.wikimedia.org/r/464573 (owner: 10Giuseppe Lavagetto) [08:03:02] (03PS3) 10Elukey: profile::etcd::replication: fix regex in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/464573 (owner: 10Giuseppe Lavagetto) [08:03:09] ah! merge sniped :D [08:03:10] (03CR) 10Muehlenhoff: [C: 032] Print group memberships which granted Hadoop access to check for HDFS cleanups [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) (owner: 10Muehlenhoff) [08:03:56] (03PS4) 10Elukey: profile::etcd::replication: fix regex in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/464573 (owner: 10Giuseppe Lavagetto) [08:03:59] (03CR) 10Elukey: [V: 032 C: 032] profile::etcd::replication: fix regex in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/464573 (owner: 10Giuseppe Lavagetto) [08:04:38] moritzm: feel free to merge mine :) [08:08:09] elukey: done! [08:08:14] thanks! [08:13:28] 10Operations, 10Services, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Add wdqs-updater to scap target in puppet - https://phabricator.wikimedia.org/T206303 (10Mathew.onipe) [08:16:06] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Krenair) labsaliaser is working now: ```krenair@bastion-eqiad1-01:~$ dig eqiad1.bastion.wmflabs.org @8.8.8.8 +short 185.15.56.13 krenair@bastion-eqiad... [08:16:27] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Services (doing): Add sudo rules for wdqs-updater in puppet - https://phabricator.wikimedia.org/T206303 (10mobrovac) a:03mobrovac [08:16:39] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Services (doing): Add sudo rules for wdqs-updater in puppet - https://phabricator.wikimedia.org/T206303 (10mobrovac) p:05Triage>03Normal [08:19:49] (03PS1) 10Mobrovac: WDQS: Add sudo rules for wdqs-updater [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) [08:20:24] (03CR) 10jerkins-bot: [V: 04-1] WDQS: Add sudo rules for wdqs-updater [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) (owner: 10Mobrovac) [08:21:40] (03PS3) 10Jcrespo: mariadb: Move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) [08:22:12] (03PS2) 10Mobrovac: WDQS: Add sudo rules for wdqs-updater [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) [08:23:08] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 2 others: Add sudo rules for wdqs-updater in puppet - https://phabricator.wikimedia.org/T206303 (10Gehel) @mobrovac thanks for the fast response! I was wondering if we had a cleaner way to declare that a scap::targe... [08:26:11] 10Operations, 10cloud-services-team: Use of wrapper script in prometheus-openstack-exporter prevents automated restarts - https://phabricator.wikimedia.org/T206304 (10MoritzMuehlenhoff) [08:27:21] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/compiler1002/12779/" [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) (owner: 10Mobrovac) [08:28:59] 10Operations, 10cloud-services-team: Use of wrapper script in prometheus-openstack-exporter prevents automated restarts - https://phabricator.wikimedia.org/T206304 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:30:47] (03CR) 10Mathew.onipe: "Jenkins dry run:" [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) (owner: 10Mobrovac) [08:33:14] (03CR) 10Mathew.onipe: [C: 031] WDQS: Add sudo rules for wdqs-updater [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) (owner: 10Mobrovac) [08:35:56] <_joe_> !log reenabling notifications for etcdmirror on conf1005 [08:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:38] (03CR) 10Banyek: [C: 032] wmf-pt-kill: WMF patched version 2 [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [08:37:50] (03CR) 10Banyek: [V: 032 C: 032] wmf-pt-kill: WMF patched version 2 [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [08:39:12] (03CR) 10ArielGlenn: "I'll merge this through then when this week's wd dumps are complete.It's recompressing all-BETA-nt right now, which means there's still th" [puppet] - 10https://gerrit.wikimedia.org/r/461862 (https://phabricator.wikimedia.org/T202830) (owner: 10Smalyshev) [08:44:34] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464769 (https://phabricator.wikimedia.org/T135991) [08:45:24] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464769 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:46:31] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464769 (https://phabricator.wikimedia.org/T135991) [08:49:51] (03PS2) 10Elukey: Move _etcd._tcp* SRV records to etcd codfw [dns] - 10https://gerrit.wikimedia.org/r/464503 (https://phabricator.wikimedia.org/T205814) [08:52:18] <_joe_> elukey: I have the cache wipe ready, just for the RW records, which are more important [08:52:34] <_joe_> I'll also wipe the rest afterwards [08:52:47] <_joe_> so just tell me when you've distributed the change [08:52:59] <_joe_> I'll do the cache wipes, then we can verify [08:53:17] <_joe_> and we will start the long process of chasing down long-connected clients [08:53:24] <_joe_> to the old eqiad cluster [08:53:25] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10fgiunchedi) a:05fgiunchedi>03None >>! In T170817#4643270, @kaldari wrote: > @fgiunchedi - Are you still working on this or shou... [08:53:42] all right, merging then [08:53:49] (03CR) 10Elukey: [C: 032] Move _etcd._tcp* SRV records to etcd codfw [dns] - 10https://gerrit.wikimedia.org/r/464503 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [08:54:38] _joe_ authdns updated [08:54:46] <_joe_> elukey: ok, wiping caches now [08:55:06] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, and 2 others: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Banyek) I'd happily get involved in this [08:55:36] for everybody reading - we are moving etcd srv records from conf100[1-3] to codfw [08:55:46] probably better logging it [08:55:46] <_joe_> elukey: done [08:56:02] <_joe_> !log read-write connections to etcd only go to codfw now [08:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:15] good [08:56:23] so now we chase down the long lived conns [08:56:24] right? [08:56:34] <_joe_> elukey: while I keep cleaning caches, would you stop replication eqiad => codfw? [08:56:37] <_joe_> via puppet [08:56:40] ack [08:56:44] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1315 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:56:54] <_joe_> oh right, meh [08:57:05] <_joe_> hehe we might get a ton of those now [08:57:08] <_joe_> sorry [08:57:14] <_joe_> but it's not true [08:57:20] <_joe_> it's just different clusters [08:57:35] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1224 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:57:44] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1348 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:57:45] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1331 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:57:54] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1272 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:57:54] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1315 is OK: etcd last index (302926) matches the master one (302926) [08:57:56] so no outage, right? [08:58:02] _joe_: mmmh in theory just few of them [08:58:11] given the index on einsteinium is updated every 30s IIRC [08:58:25] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1247 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:58:34] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1343 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:58:34] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1249 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:58:35] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1288 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:58:35] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1324 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:58:35] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1289 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:58:35] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1281 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:58:40] <_joe_> !log wiped cached values for the read-only etcd SRV record [08:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:45] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1245 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:58:45] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1228 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:58:48] <_joe_> volans: well a few is still a lot [08:58:51] <_joe_> and yes, no outage [08:58:54] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1332 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:58:54] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1269 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:58:55] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1278 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:58:55] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1272 is OK: etcd last index (302926) matches the master one (302926) [08:58:55] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1287 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:58:57] they seem too many to me [08:58:58] <_joe_> sorry for the noise [08:59:00] checking the check [08:59:04] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1339 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926) [08:59:19] <_joe_> volans: all mediawikis still reading from eqiad will have that error [08:59:27] <_joe_> einsteinium is updated [08:59:33] _joe_: the check does dig +short SRV "_etcd._tcp.${dc}.wmnet" [08:59:35] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1247 is OK: etcd last index (302926) matches the master one (302926) [08:59:44] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1343 is OK: etcd last index (302926) matches the master one (302926) [08:59:44] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1249 is OK: etcd last index (302926) matches the master one (302926) [08:59:44] on einsteinium [08:59:45] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1288 is OK: etcd last index (302926) matches the master one (302926) [08:59:45] <_joe_> see the recoveries? [08:59:45] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1324 is OK: etcd last index (302926) matches the master one (302926) [08:59:45] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1289 is OK: etcd last index (302926) matches the master one (302926) [08:59:45] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1281 is OK: etcd last index (302926) matches the master one (302926) [08:59:51] yeah ok, [08:59:54] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1224 is OK: etcd last index (302926) matches the master one (302926) [08:59:54] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1245 is OK: etcd last index (302926) matches the master one (302926) [08:59:55] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1228 is OK: etcd last index (302926) matches the master one (302926) [08:59:55] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1348 is OK: etcd last index (302926) matches the master one (302926) [09:00:03] <_joe_> it's expected, it was just a few of them [09:00:04] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1331 is OK: etcd last index (302926) matches the master one (302926) [09:00:04] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1332 is OK: etcd last index (302926) matches the master one (302926) [09:00:04] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1269 is OK: etcd last index (302926) matches the master one (302926) [09:00:04] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1278 is OK: etcd last index (302926) matches the master one (302926) [09:00:05] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1287 is OK: etcd last index (302926) matches the master one (302926) [09:00:14] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1339 is OK: etcd last index (302926) matches the master one (302926) [09:00:15] <_joe_> actually this proves the check works well :P [09:00:30] (03PS1) 10Elukey: Stop replicating etcd data from conf100[1-3] to codfw [puppet] - 10https://gerrit.wikimedia.org/r/464772 (https://phabricator.wikimedia.org/T205814) [09:00:36] there you go --^ [09:00:38] ahahah [09:00:48] <_joe_> elukey: downtime conf2002, merge and apply [09:00:52] ack [09:01:02] (03CR) 10Giuseppe Lavagetto: [C: 032] Stop replicating etcd data from conf100[1-3] to codfw [puppet] - 10https://gerrit.wikimedia.org/r/464772 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [09:02:18] downtimed [09:02:31] <_joe_> ok [09:02:33] merging and applying [09:02:46] <_joe_> cool tnx [09:04:24] <_joe_> elukey: I'll now restart confd on a server in esams to check the connections go away from conf1001 [09:05:17] <_joe_> cool works as expected [09:05:26] (03CR) 10Jcrespo: [C: 032] mariadb: Move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [09:05:36] Notice: /Stage[main]/Profile::Etcd::Replication/Etcdmirror::Instance[/conftool@eqiad.wmnet]/Systemd::Service[etcdmirror-conftool-eqiad-wmnet]/Service[etcdmirror-conftool-eqiad-wmnet]/ensure: ensure changed 'running' to 'stopped' [09:05:49] <_joe_> !log restarting confd on all nodes in eqiad and esams [09:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:08] !log stop etcdmirror replication on conf2002 [09:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:55] (03PS1) 10Volans: Upstream release v0.0.9 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464773 [09:07:09] (03Merged) 10jenkins-bot: mariadb: Move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [09:08:43] (03CR) 10Alexandros Kosiaris: [C: 031] Upstream release v0.0.9 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464773 (owner: 10Volans) [09:09:06] (03CR) 10Volans: [C: 032] Upstream release v0.0.9 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464773 (owner: 10Volans) [09:10:23] (03CR) 10jerkins-bot: [V: 04-1] Upstream release v0.0.9 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464773 (owner: 10Volans) [09:10:48] bad jenkins... what's wrong [09:11:18] damn again the randomly failing test from a static checker, I need to dig into this [09:11:21] <_joe_> volans: why can't cumin connect to tegmen? [09:11:27] (03CR) 10jenkins-bot: mariadb: Move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [09:11:33] <_joe_> elukey: now can you go to like conf1001 and check what's still connecting there? [09:11:34] _joe_: checking [09:11:52] _joe_: I think because we had a re-occurrence of [09:12:00] T199413 [09:12:01] T199413: Systemd restart loop of timer filled the disk on tegmen - https://phabricator.wikimedia.org/T199413 [09:12:07] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Move some wikis for s3 to s5 (duration: 00m 56s) [09:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:18] _joe_ already doing it [09:12:41] (03PS1) 10Banyek: Builder: add dh-sysuser for builders > stretch [puppet] - 10https://gerrit.wikimedia.org/r/464775 [09:12:42] _joe_: I'm checking the console [09:12:52] godog: FYI ^^^ (tegmen) [09:13:57] <_joe_> elukey: it should only be pybals [09:14:15] <_joe_> but, since it's also esams (we should've moved it before the switch) [09:14:19] <_joe_> let's start there [09:14:28] <_joe_> I'm merging the puppet changes in a moment [09:14:45] _joe_ yep it seems so, I grepped in netstat and I can see other stuff but not related to etcd [09:15:35] also conf100[2,3] looks good afaics [09:15:36] tegmen pretty unresponsive on console, I'll reboot it [09:16:22] !log rebooting tegmen, console stuck, possible re-occurrence of T199413 (to be confirmed) [09:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:42] volans: uuughhhh, looks like fs was at 30% tho [09:16:55] I was looking at https://grafana.wikimedia.org/dashboard/db/host-overview-grafanalib?refresh=300s&orgId=1&panelId=12&fullscreen&var-datasource=codfw%20prometheus%2Fops&var-server=tegmen&var-cluster=misc [09:17:02] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for PowerDNS Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/464778 (https://phabricator.wikimedia.org/T135991) [09:17:31] godog: maybe was stuck for other reasons, let's see now that comes back ;) [09:17:47] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for PowerDNS Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/464778 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:19:09] godog: confirmed, disk space is ok, seems unreleated [09:19:13] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for PowerDNS Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/464778 (https://phabricator.wikimedia.org/T135991) [09:19:40] 10Operations, 10Core Platform Team Kanban (Watching / External), 10User-ArielGlenn: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10mobrovac) [09:19:54] godog: but funnily enough, it's now in the restart loop [09:20:03] (03PS1) 10Giuseppe Lavagetto: pybal: stop reading from the old etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/464779 (https://phabricator.wikimedia.org/T205814) [09:20:15] <_joe_> elukey: ^^ [09:20:21] <_joe_> let's do this quick [09:22:06] <_joe_> ok, merging it [09:22:09] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: stop reading from the old etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/464779 (https://phabricator.wikimedia.org/T205814) (owner: 10Giuseppe Lavagetto) [09:23:14] ack [09:23:15] <_joe_> now I'll apply puppet on the lvs servers in eqiad and esams [09:23:31] does it require a rolling restart of pybal or just a puppet run? [09:23:45] <_joe_> both ofc [09:23:48] ok [09:24:07] <_joe_> we're not foolish enough to let puppet control pybal that much [09:24:18] yep [09:24:30] (03CR) 10Muehlenhoff: [C: 031] "Looks good. Given that our package build host is now also running stretch, you can also simply add it (and the golang one) to the package " [puppet] - 10https://gerrit.wikimedia.org/r/464775 (owner: 10Banyek) [09:24:38] s/foolish/crazy/g :P [09:25:23] <_joe_> elukey: can you do the dance in eqiad? [09:25:29] <_joe_> remember, first the secondary pybals [09:25:32] <_joe_> then the primaries [09:26:10] me restarting pybals? I'd prefer not :) [09:26:22] <_joe_> why? [09:26:31] :_( he doesn't trust pybal [09:26:33] * vgutierrez sad [09:26:37] never done it and I don't want to make things exploding [09:26:39] <_joe_> somehow my fingers give magical juice to systemctl? [09:26:40] <_joe_> :P [09:26:44] <_joe_> it's eqiad [09:26:53] <_joe_> you can play in the sand, kid [09:26:55] <_joe_> :P [09:27:04] there is no problem in restart pybals [09:27:13] as long as you don't restart both of a group at the same time [09:27:21] even in that case [09:27:35] well, there is gonna be a spike in 5xx then [09:27:37] <_joe_> that's why I said "first secondaries, then primaries" [09:27:38] as I 've found out [09:27:49] the default static routers would trigger and route the traffic via the primary lvs instances [09:27:49] <_joe_> so the problem right now is [09:27:53] s/routers/routes/g [09:27:57] <_joe_> puppet changes an npre check [09:28:05] <_joe_> that will start alarming if we don't restart pybal [09:28:10] vgutierrez: actually it would cause a BGP flap and traffic going back and forth [09:28:12] <_joe_> which is a nice reminder :) [09:28:28] it would coalesce pretty soon but still [09:29:01] <_joe_> elukey: ok I wil restart pybal on the lvs servers [09:29:53] <_joe_> can someone check lvs1002? [09:30:04] thanks, it takes me a bit of time to find 1) the lvs 2) understand who is primary/secondary 3) restart [09:30:45] <_joe_> vgutierrez: I see an alert about bgp sessions on lvs3004 [09:31:01] hmmm [09:31:04] * vgutierrez checking [09:31:33] <_joe_> just webnt away [09:31:34] <_joe_> sorry [09:31:50] yup.. seems happy :) [09:32:00] don't worry [09:32:04] <_joe_> ok so lvs1002 seems to be farting since 15 hours [09:32:13] <_joe_> how ocme I didn't notice it? [09:32:42] <_joe_> Puppet is disabled. bblack [09:32:51] <_joe_> nice, informative comment brandon :P [09:33:35] from the logs it might be due to the network maintenance that we did yesterday [09:33:49] <_joe_> pybal is dead too there, so there isn't much to worry about [09:36:14] !log adding wmf-pt-kill_2.2.20-1+wmf2 package for stretch [09:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:38] PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=42) [09:36:38] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Systemd restart loop of timer filled the disk on tegmen - https://phabricator.wikimedia.org/T199413 (10Volans) Sorry, false alarm, it was unrelated. [09:37:11] !log disabling puppet on labsdb1009,labsdb1010,labsdb1011 (T203674) [09:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:15] T203674: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 [09:37:17] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [09:37:30] this is me --^ [09:37:48] !log restart rsyslog on lithium - broken connection to tegmen - T199406 [09:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:52] T199406: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 [09:38:07] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1114 days) [09:38:28] <_joe_> elukey: can you confirm there are no more external connections to 1003? [09:38:38] checking [09:38:45] godog: Cc for rsyslog --^ [09:39:11] (03CR) 10Volans: [C: 032] "recheck" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464773 (owner: 10Volans) [09:39:30] _joe_ I can see established to lvs3002.esams.wmn [09:39:33] thanks elukey! [09:40:00] <_joe_> elukey: uhm [09:40:08] tcp 0 0 conf1003.eqiad.wmn:2379 lvs3002.esams.wmn:58102 ESTABLISHED - [09:40:23] <_joe_> can you look which applicaiton is doing that? [09:40:27] <_joe_> that's pretty strange [09:40:37] (03Merged) 10jenkins-bot: Upstream release v0.0.9 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464773 (owner: 10Volans) [09:41:03] nginx 31412 www-data 20u IPv4 1040953863 0t0 TCP conf1003.eqiad.wmnet:2379->lvs3002.esams.wmnet:58096 (ESTABLISHED) [09:41:12] there you go :) [09:41:20] <_joe_> no I mean on lvs3002 :P [09:41:42] <_joe_> I knew it was nginx on the conf1003 side [09:42:30] pybal [09:42:47] PROBLEM - Device not healthy -SMART- on bast4001 is CRITICAL: cluster=misc device=sdc instance=bast4001:9100 job=node site=ulsfo https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=bast4001&var-datasource=ulsfo%2520prometheus%252Fops [09:43:00] <_joe_> meh [09:43:48] PROBLEM - PyBal connections to etcd on lvs1001 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=4) [09:44:37] there are two crits for 1006 and 1016 as well [09:44:43] <_joe_> yes, expected [09:44:57] ack [09:45:08] <_joe_> 1006 should recover soon [09:45:27] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=42) [09:46:38] RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 42 connections established with conf2001.codfw.wmnet:2379 (min=42) [09:47:51] <_joe_> labpuppetmaster1001.wikimedia.org is still connected to conf1001 [09:48:57] RECOVERY - PyBal connections to etcd on lvs1001 is OK: OK: 4 connections established with conf2001.codfw.wmnet:2379 (min=4) [09:49:56] I can see webperf1001.eqiad also on conf1003 [09:50:28] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 42 connections established with conf2001.codfw.wmnet:2379 (min=42) [09:50:33] /srv/deployment/performance/navtiming/run_navtiming.py [09:52:04] <_joe_> elukey: wat [09:52:28] <_joe_> what is that reading? [09:52:32] <_joe_> it's not documented [09:52:59] I am trying to understand [09:53:35] ah yes it is in /srv/deployment/performance/navtiming/config.ini [09:53:47] /conftool/v1/mediawiki-config/common/WMFMasterDatacenter [09:53:58] <_joe_> ok [09:54:10] <_joe_> how can we change the configuration? [09:54:37] lemme check the navtiming puppet config [09:54:46] or possibly the repo [09:54:51] <_joe_> also, if it used python-etcd, a simple restart would suffice [09:56:05] <_joe_> I'll submit a patch in case [09:57:45] yes python-etcd is deployed [10:00:10] (03PS2) 10Jcrespo: mariadb: Update dblists to move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805) [10:03:17] <_joe_> ok navtiming uses python-etcd AFAICT [10:03:50] <_joe_> !log restarting navtiming.service on webperf1001 to pick up the dns change for etcd [10:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:29] (03PS1) 10Banyek: wikirepicas: wmf-pt-kill template data only from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464781 (https://phabricator.wikimedia.org/T203674) [10:04:51] <_joe_> elukey: it's now connected to 2002 [10:05:02] _joe_ yep I was about to say that, I just finished reading the code [10:05:05] super [10:05:15] <_joe_> elukey: it uses python-etcd or conftool? [10:05:20] <_joe_> the latter would be better [10:05:57] https://github.com/wikimedia/performance-navtiming/blob/master/navtiming/__init__.py#L46 [10:06:09] <_joe_> heh [10:06:16] <_joe_> ok, I will send a patch then [10:07:22] <_joe_> elukey: now if you want to be double sure, check the grafana data for etcd in codfw and some of the logs [10:07:35] so grepping for 2379 I can only see labspuppetmaster still connected to conf1001 [10:07:53] <_joe_> yeah, we need to understand that too [10:08:05] <_joe_> I thought it was confd, but that should be restarted already [10:09:23] it is confd indeed [10:09:33] confd 974 root 3u IPv4 446566539 0t0 TCP labpuppetmaster1001.wikimedia.org:54070->conf1001.eqiad.wmnet:2379 (ESTABLISHED) [10:09:41] <_joe_> ok, just restart it :) [10:10:05] (03PS1) 10Volans: Add missing build dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/464783 [10:10:11] connected to conf2001 now :) [10:10:15] <_joe_> cool [10:10:25] (03Abandoned) 10Volans: Add missing build dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/464783 (owner: 10Volans) [10:10:35] !log restart confd on labs-puppetmaster to pick up new etcd settings (eqiad -> codfw) [10:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:52] <_joe_> elukey: let's take a break, then we can play with some hosts in eqiad and connect them to the new cluster [10:11:23] <_joe_> well done! [10:11:26] (03PS1) 10Volans: Add missing build dependency [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464784 [10:11:30] well you did all the work :D [10:11:41] glad that we are close to nuke conf100[1,3] :) [10:13:27] (03CR) 10jerkins-bot: [V: 04-1] Add missing build dependency [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464784 (owner: 10Volans) [10:13:49] (03CR) 10Volans: "recheck" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464784 (owner: 10Volans) [10:13:58] RECOVERY - MegaRAID on helium is OK: OK: optimal, 1 logical, 12 physical [10:15:53] (03CR) 10jerkins-bot: [V: 04-1] Add missing build dependency [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464784 (owner: 10Volans) [10:16:24] (03CR) 10Volans: [C: 032] "recheck" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464784 (owner: 10Volans) [10:17:14] !log rearmed keyholder on netmon2001 [10:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:28] RECOVERY - Keyholder SSH agent on netmon2001 is OK: OK: Keyholder is armed with all configured keys. [10:17:53] (03Merged) 10jenkins-bot: Add missing build dependency [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464784 (owner: 10Volans) [10:21:25] 10Operations, 10ops-ulsfo, 10Traffic: decommission/replace bast4001.wikimedia.org - https://phabricator.wikimedia.org/T178592 (10MoritzMuehlenhoff) Note that this host also emits SMART errors since two days, not worth investigating further as it's going to be decommed. [10:21:42] ACKNOWLEDGEMENT - Device not healthy -SMART- on bast4001 is CRITICAL: cluster=misc device=sdc instance=bast4001:9100 job=node site=ulsfo Muehlenhoff T178592 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=bast4001&var-datasource=ulsfo%2520prometheus%252Fops [10:26:48] 10Operations, 10Traffic: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Vgutierrez) p:05Triage>03Normal [10:34:32] 10Operations, 10Traffic, 10vm-requests: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Krenair) [10:35:43] (03PS2) 10Banyek: wikirepicas: wmf-pt-kill template data only from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464781 (https://phabricator.wikimedia.org/T203674) [10:36:33] 10Operations, 10Traffic, 10vm-requests: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Vgutierrez) a:03Vgutierrez [10:37:22] (03PS3) 10Banyek: wikirepicas: wmf-pt-kill template data only from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464781 (https://phabricator.wikimedia.org/T203674) [10:37:54] !log uploaded spicerack_0.0.9-1{,+deb9u1} to apt.wikimedia.org {jessie,stretch}-wikimedia - T199079 [10:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:59] T199079: Refactor the switchdc script - https://phabricator.wikimedia.org/T199079 [10:37:59] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) >>! In T203786#4629141, @elukey wrote:= > The throttling mechanism that memca... [10:39:20] (03CR) 10Banyek: [C: 032] wikirepicas: wmf-pt-kill template data only from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464781 (https://phabricator.wikimedia.org/T203674) (owner: 10Banyek) [10:40:19] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for ircecho [puppet] - 10https://gerrit.wikimedia.org/r/464791 (https://phabricator.wikimedia.org/T135991) [10:40:59] !log restarting replication on labsdb1010/1 on s3 and s5 [10:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:36] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/464791 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:46:08] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:50:27] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:51:01] * banyek|away lunch [10:56:05] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/464389 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [11:02:32] (03PS1) 10Vgutierrez: dns entries for certcentral[12]001 [dns] - 10https://gerrit.wikimedia.org/r/464795 (https://phabricator.wikimedia.org/T206308) [11:09:32] (03PS2) 10Vgutierrez: dns entries for certcentral[12]001 [dns] - 10https://gerrit.wikimedia.org/r/464795 (https://phabricator.wikimedia.org/T206308) [11:12:08] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:14:18] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:17:20] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) So the wikis have been loaded into s5, and they are the primary place to read them (and eventually, write them), the only think pending is, some... [11:17:40] (03PS1) 10Jcrespo: site.pp: Comment fixes due to dewiki no longer being the only s5 wiki [puppet] - 10https://gerrit.wikimedia.org/r/464797 (https://phabricator.wikimedia.org/T184805) [11:22:21] (03CR) 10Alex Monk: [C: 031] dns entries for certcentral[12]001 [dns] - 10https://gerrit.wikimedia.org/r/464795 (https://phabricator.wikimedia.org/T206308) (owner: 10Vgutierrez) [11:24:27] (03CR) 10Vgutierrez: [C: 032] dns entries for certcentral[12]001 [dns] - 10https://gerrit.wikimedia.org/r/464795 (https://phabricator.wikimedia.org/T206308) (owner: 10Vgutierrez) [11:25:29] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) This has to be done https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replica_DNS **after **the dblists are updated (without an... [11:29:14] !log rebooting ruthenium for kernel security update [11:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:48] PROBLEM - MegaRAID on db1072 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [11:30:51] ACKNOWLEDGEMENT - MegaRAID on db1072 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206313 [11:30:56] 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10ops-monitoring-bot) [11:31:14] (03PS1) 10Elukey: Add matomo1001's IPv6 PTR [dns] - 10https://gerrit.wikimedia.org/r/464799 (https://phabricator.wikimedia.org/T202962) [11:31:49] (03CR) 10Elukey: [C: 032] Add matomo1001's IPv6 PTR [dns] - 10https://gerrit.wikimedia.org/r/464799 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [11:33:02] 10Operations, 10Scap, 10Services, 10Discovery-Search (Current work), 10Release: Modify scap::target to define sudo rules for multiple services - https://phabricator.wikimedia.org/T206314 (10Mathew.onipe) [11:33:09] 10Operations, 10Scap, 10Services, 10Discovery-Search (Current work), 10Release: Modify scap::target to define sudo rules for multiple services - https://phabricator.wikimedia.org/T206314 (10Mathew.onipe) a:03Mathew.onipe [11:33:40] (03PS6) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 [11:34:03] (03CR) 10jerkins-bot: [V: 04-1] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [11:37:27] 10Operations, 10Analytics, 10Analytics-Kanban: Decommission bohrium - https://phabricator.wikimedia.org/T206315 (10elukey) p:05Triage>03Normal [11:42:10] (03CR) 10Arturo Borrero Gonzalez: [C: 031] Enable base::service_auto_restart for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464769 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:42:48] (03CR) 10Arturo Borrero Gonzalez: [C: 031] Enable base::service_auto_restart for PowerDNS Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/464778 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:43:10] !log rebooting wezen for kernel security update [11:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:27] (03PS1) 10Elukey: Decommission bohrium [puppet] - 10https://gerrit.wikimedia.org/r/464802 (https://phabricator.wikimedia.org/T206315) [11:44:45] 10Operations, 10Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 (10aborrero) Could you please also patch `profile::prometheus::openstack_exporter`? [11:46:16] (03CR) 10Elukey: [C: 032] Decommission bohrium [puppet] - 10https://gerrit.wikimedia.org/r/464802 (https://phabricator.wikimedia.org/T206315) (owner: 10Elukey) [11:48:53] 10Operations, 10Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 (10MoritzMuehlenhoff) >>! In T135991#4645190, @aborrero wrote: > Could you please also patch `profile::prometheus::openstack_exporter`? I actually looked into that earl... [11:49:54] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Decommission bohrium - https://phabricator.wikimedia.org/T206315 (10ops-monitoring-bot) wmf-decommission-host was executed by elukey for bohrium.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from Pu... [11:50:40] volans: just used wmf-decommission-hostm - is it working with ganeti instances? [11:50:50] because at some points it asks for the mgmt interface [11:53:50] 10Operations, 10Wikidata, 10Wikidata-Query-Service: wdqs1009 - cannot create /var/log/wdqs/wdqs_autodeployment.log - https://phabricator.wikimedia.org/T206318 (10Krenair) [11:53:53] the output in the task is awesome btw [11:57:18] 10Operations, 10cloud-services-team: Use of wrapper script in prometheus-openstack-exporter prevents automated restarts - https://phabricator.wikimedia.org/T206304 (10aborrero) Do you mean writing to a file somewhere all the env vars produced by `/root/novaenv.sh` and then loading them in `EnvironmentFile=`? I... [11:59:42] !log deleted bohrium from ganeti via gnt-instance [11:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:39] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Decommission bohrium - https://phabricator.wikimedia.org/T206315 (10elukey) ``` elukey@ganeti1001:~$ sudo gnt-instance remove bohrium.eqiad.wmnet This will remove the volumes of the instance bohrium.eqiad.wmnet (including mirrors), thus rem... [12:00:57] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Decommission bohrium - https://phabricator.wikimedia.org/T206315 (10elukey) [12:02:08] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Decommission bohrium - https://phabricator.wikimedia.org/T206315 (10elukey) 05Open>03Resolved [12:05:00] 10Operations, 10Cloud-Services, 10Mail, 10User-herron: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10aborrero) We have a mechanisms called `dmz_cidr` which we can use to exclude NATs between certain IP ranges. See a more detailed explanation h... [12:06:41] 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10jcrespo) p:05Triage>03Normal a:03Cmjohnson m3 master, hopefully you will have still a couple of 600GB disks for this and for T206254 [12:09:56] 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Marostegui) Not surprising after a reboot :( [12:12:03] !log Creating certcentral2001.codfw.wmnet in ganeti - T206308 [12:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:07] T206308: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 [12:13:41] !log Creating certcentral1001.eqiad.wmnet in ganeti - T206308 [12:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:16] PROBLEM - cpjobqueue endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:16:26] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-html/{title}{/revision}{/tid} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retr [12:16:26] data for April 29, 2016) timed out before a response was received: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received [12:16:27] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:16:36] PROBLEM - eventstreams on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:16:48] PROBLEM - Check size of conntrack table on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:16:56] PROBLEM - SSH on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:16:57] PROBLEM - Disk space on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:17:06] PROBLEM - apertium apy on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:17:06] PROBLEM - pdfrender on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:17:17] PROBLEM - Check whether ferm is active by checking the default input chain on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:17:26] PROBLEM - nutcracker process on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:17:27] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [12:17:47] PROBLEM - dhclient process on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:17:56] PROBLEM - configured eth on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:18:26] can't connect via ssh --^ [12:18:26] RECOVERY - nutcracker process on scb2001 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [12:18:36] RECOVERY - eventstreams on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1043 bytes in 0.076 second response time [12:18:42] ah yes now I can, oom killer party [12:18:46] RECOVERY - dhclient process on scb2001 is OK: PROCS OK: 0 processes with command name dhclient [12:18:47] RECOVERY - Check size of conntrack table on scb2001 is OK: OK: nf_conntrack is 6 % full [12:18:47] RECOVERY - SSH on scb2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [12:18:47] RECOVERY - configured eth on scb2001 is OK: OK - interfaces up [12:18:50] 10Operations, 10cloud-services-team: Use of wrapper script in prometheus-openstack-exporter prevents automated restarts - https://phabricator.wikimedia.org/T206304 (10MoritzMuehlenhoff) >>! In T206304#4645232, @aborrero wrote: > Do you mean writing to a file somewhere all the env vars produced by `/root/novaen... [12:18:57] RECOVERY - Disk space on scb2001 is OK: DISK OK [12:18:57] RECOVERY - pdfrender on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.073 second response time [12:18:57] RECOVERY - apertium apy on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.074 second response time [12:19:14] mobrovac: --^ [12:19:15] pdfrender? [12:19:16] RECOVERY - cpjobqueue endpoints health on scb2001 is OK: All endpoints are healthy [12:19:17] RECOVERY - Check whether ferm is active by checking the default input chain on scb2001 is OK: OK ferm input default policy is set [12:19:17] what's the load avg elukey? I can see it skyrocketed up [12:19:27] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [12:19:40] sigh [12:19:48] elukey@scb2001:~$ uptime 12:19:42 up 220 days, 22:01, 1 user, load average: 1780.46, 3441.59, 1778.26 [12:19:51] disk space? what happened? [12:20:00] ah no, just all critical [12:20:06] (03PS1) 10Muehlenhoff: Further WMCS Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/464805 [12:20:09] mobrovac: a nodejs app was killed by the oom [12:20:53] elukey: which one? [12:21:43] well I can only see [12:21:43] [Fri Oct 5 12:36:43 2018] Out of memory: Kill process 7903 (nodejs) score 24 or sacrifice child [12:21:46] [Fri Oct 5 12:36:43 2018] Killed process 7903 (nodejs) total-vm:10924168kB, anon-rss:810452kB, file-rss:0kB, shmem-rss:0kB [12:22:01] (03PS3) 10Gehel: WDQS: Add sudo rules for wdqs-updater [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) (owner: 10Mobrovac) [12:22:40] (03PS1) 10Vgutierrez: install_server: Add certcentral[12]001 to the DHCP configuration [puppet] - 10https://gerrit.wikimedia.org/r/464806 (https://phabricator.wikimedia.org/T206308) [12:23:00] (03CR) 10Gehel: [C: 032] WDQS: Add sudo rules for wdqs-updater [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) (owner: 10Mobrovac) [12:23:17] hm [12:24:51] 10Operations, 10Traffic, 10vm-requests, 10Patch-For-Review: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Vgutierrez) certcentral1001 created with the following cmd: ``` sudo gnt-instance add -t drbd -I hail --net 0:link=private --hypervisor-parameters=kvm:boot_order=netwo... [12:25:22] https://grafana.wikimedia.org/dashboard/db/eventstreams?refresh=1m&orgId=1&from=now-1h&to=now&var-stream=All&var-topic=All&var-scb_host=scb2001 looks suspicious [12:25:42] (03PS2) 10Muehlenhoff: Further WMCS Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/464805 [12:25:57] elukey: the time stamp of the log is weird, it's in the future [12:26:47] ah yes it is dmesg -T, might not be accurate IIRC [12:27:10] some ES worker died recently mobrovac [12:27:58] (03PS1) 10Mathew.onipe: scap::target: added services_names param [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) [12:29:59] hm i can't see anything suspicious in the logs [12:30:02] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for ircecho [puppet] - 10https://gerrit.wikimedia.org/r/464791 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:30:41] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464769 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:30:57] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for PowerDNS Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/464778 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:31:07] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Marostegui) [12:33:49] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 (10fgiunchedi) >>! In T179050#4644666, @fgiunchedi wrote: > Data transfer and CNAME flip completed. I've documented the data transfer itself at https://wikitech.wikimedia.org/wiki/Pr... [12:35:45] (03CR) 10Alexandros Kosiaris: "This looks substantially more code for a function that arguably envsubst seems to handle fine. Which platforms miss that tool ? On Debian " [deployment-charts] - 10https://gerrit.wikimedia.org/r/399256 (owner: 10Dduvall) [12:36:54] (03CR) 10Muehlenhoff: [C: 032] Further WMCS Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/464805 (owner: 10Muehlenhoff) [12:54:48] (03CR) 10Alexandros Kosiaris: [C: 031] Revert "nrpe: Don't set PrivateTmp=True" [puppet] - 10https://gerrit.wikimedia.org/r/464601 (owner: 10Jcrespo) [12:58:12] elukey: true, it's not explicitely supported, but should work, you can put a wrong host as mgmt and it will catch the error and just print a message but should continue [12:58:19] (03CR) 10Gehel: [C: 04-1] "See comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [12:58:22] if you could though please open a task for it [12:58:34] (03CR) 10Alexandros Kosiaris: [C: 031] Builder: add dh-sysuser for builders > stretch [puppet] - 10https://gerrit.wikimedia.org/r/464775 (owner: 10Banyek) [12:58:41] (03PS1) 10Elukey: Remove explicit hiera calls from hive/oozie mysql classes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/464808 [13:02:22] (03PS1) 10Banyek: wmf-pt-kill: bugfix for config file [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/464810 [13:02:30] !log upgraded spicerack to version 0.0.9 on sarin/neodymium/cumin* - T199079 [13:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:35] T199079: Refactor the switchdc script - https://phabricator.wikimedia.org/T199079 [13:03:59] 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10Volans) [13:07:20] (03CR) 10Banyek: [C: 032] Builder: add dh-sysuser for builders > stretch [puppet] - 10https://gerrit.wikimedia.org/r/464775 (owner: 10Banyek) [13:09:40] (03CR) 10Muehlenhoff: "For context; do you want to ship a default in the deb or not? It's also fine to only ship one in .examples and fully rely on puppet to cre" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/464810 (owner: 10Banyek) [13:12:28] (03CR) 10Banyek: [C: 032] "> For context; do you want to ship a default in the deb or not? It's" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/464810 (owner: 10Banyek) [13:12:36] (03CR) 10Banyek: [V: 032 C: 032] wmf-pt-kill: bugfix for config file [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/464810 (owner: 10Banyek) [13:13:03] (03PS2) 10Elukey: Remove explicit hiera calls from hive/oozie mysql classes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/464808 [13:15:13] (03PS2) 10Banyek: Builder: add dh-sysuser for builders > stretch [puppet] - 10https://gerrit.wikimedia.org/r/464775 [13:15:16] (03CR) 10Banyek: [V: 032 C: 032] Builder: add dh-sysuser for builders > stretch [puppet] - 10https://gerrit.wikimedia.org/r/464775 (owner: 10Banyek) [13:17:14] (03PS1) 10Elukey: Retrieve hive/oozie database configurations from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464816 [13:24:38] (03CR) 10Ema: [C: 031] Enable base::service_auto_restart for varnish-hospital [puppet] - 10https://gerrit.wikimedia.org/r/464502 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:25:52] !log installing python3.5/2.7 security updates [13:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:03] !log adding wmf-pt-kill_2.2.20-1+wmf3 package for stretch [13:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:56] (03CR) 10Ema: [C: 031] "Two minor comments, LGTM otherwise. Thanks!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:30:24] (03CR) 10Ottomata: [C: 031] "This is good. I think the motivation for this was to be able to run the mysql server separately from the oozie/hive servers, but we can d" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/464808 (owner: 10Elukey) [13:31:13] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for varnish-slowlog [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991) [13:31:31] (03CR) 10Muehlenhoff: Enable base::service_auto_restart for varnish-slowlog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:32:31] (03PS1) 10Volans: sre.switchdc.mediawiki: fix tendril host selection [cookbooks] - 10https://gerrit.wikimedia.org/r/464818 [13:35:07] (03PS1) 10Herron: Revert "lists: deny subscriptions from blocklisted IP addresses" [puppet] - 10https://gerrit.wikimedia.org/r/464819 [13:35:22] (03PS2) 10Herron: Revert "lists: deny subscriptions from blocklisted IP addresses" [puppet] - 10https://gerrit.wikimedia.org/r/464819 [13:36:51] (03CR) 10Herron: [C: 032] Revert "lists: deny subscriptions from blocklisted IP addresses" [puppet] - 10https://gerrit.wikimedia.org/r/464819 (owner: 10Herron) [13:42:05] (03CR) 10Jcrespo: [C: 031] sre.switchdc.mediawiki: fix tendril host selection [cookbooks] - 10https://gerrit.wikimedia.org/r/464818 (owner: 10Volans) [13:45:06] (03PS1) 10Banyek: wikireplicas: enable wmf-pt-kill service [puppet] - 10https://gerrit.wikimedia.org/r/464821 [13:45:58] (03CR) 10Marostegui: "will you make sure to test it on a host with puppet disabled on the others?" [puppet] - 10https://gerrit.wikimedia.org/r/464821 (owner: 10Banyek) [13:46:28] I see that! [13:46:32] (03CR) 10Marostegui: "Add the task number to the patch" [puppet] - 10https://gerrit.wikimedia.org/r/464821 (owner: 10Banyek) [13:47:27] (03PS2) 10Banyek: wikireplicas: enable wmf-pt-kill service [puppet] - 10https://gerrit.wikimedia.org/r/464821 (https://phabricator.wikimedia.org/T203674) [13:48:52] (03CR) 10Banyek: "- disabling puppet on hosts" [puppet] - 10https://gerrit.wikimedia.org/r/464821 (https://phabricator.wikimedia.org/T203674) (owner: 10Banyek) [13:50:20] (03CR) 10Marostegui: [C: 031] "> - disabling puppet on hosts" [puppet] - 10https://gerrit.wikimedia.org/r/464821 (https://phabricator.wikimedia.org/T203674) (owner: 10Banyek) [13:51:51] (03CR) 10Herron: [C: 032] "reverted this as subscription spam has subsided and we're seeing some false positives" [puppet] - 10https://gerrit.wikimedia.org/r/464819 (owner: 10Herron) [13:55:18] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: fix tendril host selection [cookbooks] - 10https://gerrit.wikimedia.org/r/464818 (owner: 10Volans) [13:56:25] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: fix tendril host selection [cookbooks] - 10https://gerrit.wikimedia.org/r/464818 (owner: 10Volans) [14:01:55] (03PS1) 10Muehlenhoff: Remove Diamond from openldap/labs hosts [puppet] - 10https://gerrit.wikimedia.org/r/464822 (https://phabricator.wikimedia.org/T183454) [14:07:48] (03CR) 10Ema: [C: 04-1] "Comments inline. Also, we might want to limit this to eqiad/codfw, where varnish-be communicate directly with the applayer. I'm not sure e" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) (owner: 10Gilles) [14:12:57] 10Operations, 10monitoring: Setup metrics monitoring for OpenLDAP/corp - https://phabricator.wikimedia.org/T206327 (10MoritzMuehlenhoff) [14:16:15] (03PS5) 10Gilles: Backend-Timing Varnish mtail program [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) [14:16:26] (03CR) 10Gilles: Backend-Timing Varnish mtail program (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) (owner: 10Gilles) [14:16:54] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: Collect Backend-Timing in Prometheus - https://phabricator.wikimedia.org/T131894 (10Gilles) 05stalled>03Open [14:17:08] (03CR) 10Jcrespo: "This is scheduled for some hours before the switch dc, so no maintenance script using these can be misslead. Labsdb dns, which uses these," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [14:21:18] (03PS6) 10Gilles: Backend-Timing Varnish mtail program [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) [14:23:37] (03CR) 10Marostegui: [C: 031] mariadb: Update dblists to move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [14:27:29] (03PS4) 10Mark Bergsma: Modernize and cleanup Coordinator [debs/pybal] - 10https://gerrit.wikimedia.org/r/447775 [14:27:31] (03PS5) 10Mark Bergsma: Extend testConfigServerRemoval test case. [debs/pybal] - 10https://gerrit.wikimedia.org/r/447770 [14:27:33] (03PS3) 10Mark Bergsma: Don't depool pooledDownServers in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447769 (https://phabricator.wikimedia.org/T184715) [14:28:11] (03CR) 10Vgutierrez: Central certificates service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [14:28:50] vgutierrez: ^ :) [14:29:13] <3 [14:30:09] (03CR) 10Elukey: [C: 032] Remove explicit hiera calls from hive/oozie mysql classes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/464808 (owner: 10Elukey) [14:33:47] (03PS1) 10Elukey: Add new passwords for oozie/hive to role::analytics_cluster::coordinator [labs/private] - 10https://gerrit.wikimedia.org/r/464825 [14:33:55] (03PS4) 10Mark Bergsma: Don't depool pooledDownServers in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447769 (https://phabricator.wikimedia.org/T184715) [14:34:45] (03CR) 10Elukey: [V: 032 C: 032] Add new passwords for oozie/hive to role::analytics_cluster::coordinator [labs/private] - 10https://gerrit.wikimedia.org/r/464825 (owner: 10Elukey) [14:34:48] (03PS2) 10Elukey: Retrieve hive/oozie database configurations from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464816 [14:35:47] 10Operations, 10Cloud-Services, 10Mail, 10User-herron: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10herron) >>! In T206261#4645246, @aborrero wrote: > We have a mechanisms called `dmz_cidr` which we can use to exclude NATs between certain IP... [14:36:45] (03CR) 10Elukey: "Looks good: https://puppet-compiler.wmflabs.org/compiler1002/12792/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/464816 (owner: 10Elukey) [14:36:52] (03PS3) 10Elukey: Retrieve hive/oozie database configurations from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464816 [14:38:00] (03CR) 10Elukey: [C: 032] Retrieve hive/oozie database configurations from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464816 (owner: 10Elukey) [14:46:40] (03PS5) 10Mark Bergsma: Don't depool pooledDownServers in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447769 (https://phabricator.wikimedia.org/T184715) [14:47:06] RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:52:28] Hello everyone, per discussion in https://phabricator.wikimedia.org/T163274 the CI job the run tests for JsonConfig extension needs some changes, but I don't know how to write a patch for that or even if that's something I am able to do. How can I proceed? [14:53:01] Sorry if that's not the right channel to ask that, I just couldn't find more information on mediawiki and wikitech pages related to jenkins [14:55:45] (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1011 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464619 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [14:56:27] (03PS2) 10Banyek: wiki replicas: depool labsdb1011 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464619 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [14:56:34] (03CR) 10Banyek: [V: 032 C: 032] wiki replicas: depool labsdb1011 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464619 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [14:56:54] !log depooling labsdb1011 [14:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:07] !log depooling labsdb1011 (T195747) [14:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:11] T195747: Create views for the schema change for refactored actor storage - https://phabricator.wikimedia.org/T195747 [15:03:37] 10Operations, 10Performance-Team, 10HHVM: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true - https://phabricator.wikimedia.org/T173786 (10BPirkle) [15:05:37] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10elukey) ping :) [15:06:35] AndyRussG: o/ - not sure if you saw T203669, can you comment next week whenever you have time? So we can plan it accordingly :) [15:06:36] T203669: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 [15:07:39] elukey: hi! aaaaargh i did see it, many apologies for the delay in replying [15:08:10] (03CR) 10Cwhite: [C: 031] Remove Diamond from openldap/labs hosts [puppet] - 10https://gerrit.wikimedia.org/r/464822 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:09:08] elukey: the plan had been to start using the EventLogging stream at 100% client-side sample rate (on Fundraising banner campaigns) in time for the end-of-the year campaigns [15:09:14] (03CR) 10Cwhite: [C: 031] Enable base::service_auto_restart for ircecho [puppet] - 10https://gerrit.wikimedia.org/r/464791 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:09:16] however other stuff has come up and I'm not sure we'll get there [15:09:50] The EL stream is currently turned on globally at 1% client-side sample rate, so it should be currently usable to build the realtime feature [15:10:23] What I don't know is how much Fundraising online really needs this particular stream of realtime data [15:11:07] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:11:14] I'll comment on the task in a little while and ask dstrine (David, our PM) if he has more input during standup [15:11:20] thanks a lot :) [15:11:32] elukey: likewise thanks much and apologies again for the delay!!!! :D [15:11:57] AndyRussG: np! I was reviewing the pending tasks and annoying people with pings :) [15:15:27] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:17:42] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10colewhite) [15:18:41] (03PS4) 10Cwhite: openstack, rabbitmq: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464399 (https://phabricator.wikimedia.org/T183454) [15:19:34] (03PS1) 10Elukey: profile::prometheus::alerts: tune druid alerts [puppet] - 10https://gerrit.wikimedia.org/r/464828 [15:20:29] elukey: heheh pings much appreciated :) [15:20:36] (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: tune druid alerts [puppet] - 10https://gerrit.wikimedia.org/r/464828 (owner: 10Elukey) [15:20:43] (03PS2) 10Elukey: profile::prometheus::alerts: tune druid alerts [puppet] - 10https://gerrit.wikimedia.org/r/464828 [15:24:45] (03PS1) 10Cwhite: ntp: remove diamond::collector in favor of prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464829 (https://phabricator.wikimedia.org/T183454) [15:26:07] (03CR) 10Cwhite: [C: 032] ntp: remove diamond::collector in favor of prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464829 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [15:27:14] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:33] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:33:25] (03CR) 10Muehlenhoff: "That looks wrong: These servers use ntp in server mode, i.e. acting as time servers, I don't think we currently collect the relevant metri" [puppet] - 10https://gerrit.wikimedia.org/r/464829 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [15:34:44] (03CR) 10Ayounsi: [C: 032] Fix PTR for cr3-ulsfo<-->cr4-ulsfo link [dns] - 10https://gerrit.wikimedia.org/r/464068 (owner: 10Ayounsi) [15:34:48] (03PS2) 10Ayounsi: Fix PTR for cr3-ulsfo<-->cr4-ulsfo link [dns] - 10https://gerrit.wikimedia.org/r/464068 [15:35:31] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:38:53] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Addshore) It looks like there was another little flood on the 1st of October with requests being banned again: https://logstash.... [15:41:32] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:46:33] 10Operations: Allow directing users to PHP7 based on a cookie - https://phabricator.wikimedia.org/T206338 (10Joe) [15:46:55] 10Operations: SRE quarterly goal: Ability to serve a fraction of the production traffic from PHP7 - https://phabricator.wikimedia.org/T206336 (10Joe) [15:52:22] (03PS1) 10Herron: admin: add turnilo and superset sudo privs to analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/464831 (https://phabricator.wikimedia.org/T206217) [15:53:32] PROBLEM - Device not healthy -SMART- on db1072 is CRITICAL: cluster=mysql device=megaraid,10 instance=db1072:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1072&var-datasource=eqiad%2520prometheus%252Fops [15:54:12] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10herron) Is https://gerrit.wikimedia.org/r/464831 what you had in mind in terms of sudo privs? [15:55:47] 10Operations, 10CirrusSearch, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): Resolve elasticsearch shard size alert by doing an in place reindex - https://phabricator.wikimedia.org/T204362 (10debt) 05Open>03Resolved [15:55:51] (03PS2) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) [15:56:41] (03CR) 10jerkins-bot: [V: 04-1] wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [15:56:51] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [15:57:33] (03PS1) 10Mforns: Add druid_load.pp to refinery jobs [puppet] - 10https://gerrit.wikimedia.org/r/464833 [15:58:10] (03PS3) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) [15:58:19] (03CR) 10jerkins-bot: [V: 04-1] Add druid_load.pp to refinery jobs [puppet] - 10https://gerrit.wikimedia.org/r/464833 (owner: 10Mforns) [15:58:36] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Mailman issues a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T195750 (10herron) 05Open>03Resolved The RBL check that was causing 403s for subscription attempts from IPs listed on spam blacklists was reve... [16:00:15] 10Operations, 10Wikimedia-Mailing-lists: I get a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T205694 (10herron) 05Open>03Resolved a:03herron The rbl check that was causing the 403 error when attempting to subscribe from an IP in a spam blocklist was reverted this... [16:03:40] 10Operations: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10Joe) [16:04:35] (03CR) 10Ottomata: [C: 031] admin: add turnilo and superset sudo privs to analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/464831 (https://phabricator.wikimedia.org/T206217) (owner: 10Herron) [16:06:04] (03PS1) 10Elukey: Decommission bohrium [dns] - 10https://gerrit.wikimedia.org/r/464836 (https://phabricator.wikimedia.org/T206315) [16:06:09] (03CR) 10Mforns: [C: 04-1] "Please, do not merge yet! Just for reference thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/464833 (owner: 10Mforns) [16:06:25] (03CR) 10Ottomata: Add druid_load.pp to refinery jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464833 (owner: 10Mforns) [16:06:27] (03CR) 10Cwhite: [C: 032] "> That looks wrong: These servers use ntp in server mode, i.e. acting" [puppet] - 10https://gerrit.wikimedia.org/r/464829 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:06:29] (03CR) 10Elukey: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/464831 (https://phabricator.wikimedia.org/T206217) (owner: 10Herron) [16:07:42] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Raise alert level on disk space for old elasticsearch servers - https://phabricator.wikimedia.org/T204361 (10debt) 05Open>03Resolved [16:09:32] PROBLEM - High lag on wdqs1010 is CRITICAL: 1.719e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:09:57] 10Operations: Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10Joe) [16:10:42] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 37 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:15:27] (03PS1) 10Bstorm: labstore: we really only want to know about prolonged load issues [puppet] - 10https://gerrit.wikimedia.org/r/464838 (https://phabricator.wikimedia.org/T206144) [16:16:01] PROBLEM - High lag on wdqs1010 is CRITICAL: 1.382e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:16:43] downtime expired on wdqs1010, I'll add some... [16:22:32] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 30 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:28:38] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Cmjohnson) @akosiaris I found a spare 4TB SAS disk...replacing it now [16:31:01] (03CR) 10Elukey: [C: 032] Decommission bohrium [dns] - 10https://gerrit.wikimedia.org/r/464836 (https://phabricator.wikimedia.org/T206315) (owner: 10Elukey) [16:32:00] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Cmjohnson) Failed disk has been swapped out [16:34:27] (03PS1) 10Cwhite: hiera: enable ntp collector on labcontrol to replace ntpd diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/464843 (https://phabricator.wikimedia.org/T183454) [16:35:26] (03CR) 10GTirloni: [C: 032] labstore: we really only want to know about prolonged load issues [puppet] - 10https://gerrit.wikimedia.org/r/464838 (https://phabricator.wikimedia.org/T206144) (owner: 10Bstorm) [16:35:32] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Cmjohnson) Failed disk has been swapped out [16:35:42] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Banyek) >>! In T203674#4645494, @gerritbot wrote: > Change 464821 had a related patch set uploaded (by Banyek; owner: Banyek): > [operations/pu... [16:37:57] (03CR) 10Ottomata: Add druid_load.pp to refinery jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464833 (owner: 10Mforns) [16:38:26] 10Operations: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10BBlack) Strawperson proposal from IRC, in pseudocode for cache_text, assuming the magic cookie is `f2b31d03ab7`: ``` sub recv_from_client_at_front_edge() { unset req.http.x-use-engine; if req.http.Co... [16:38:49] 10Operations, 10Traffic: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10BBlack) [16:38:56] (03CR) 10Cwhite: [C: 032] "> That looks wrong: These servers use ntp in server mode, i.e. acting" [puppet] - 10https://gerrit.wikimedia.org/r/464829 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:39:57] (03CR) 10Cwhite: "Changes look expeted: https://puppet-compiler.wmflabs.org/compiler1002/12793/" [puppet] - 10https://gerrit.wikimedia.org/r/464843 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:40:12] 10Operations, 10ops-eqiad, 10DBA: db1064 has disk smart error - https://phabricator.wikimedia.org/T206245 (10Cmjohnson) Swapped the failed disk [16:42:30] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10colewhite) [16:44:04] 10Operations, 10Patch-For-Review: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10colewhite) 05Open>03Resolved [16:53:42] RECOVERY - Device not healthy -SMART- on db1072 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1072&var-datasource=eqiad%2520prometheus%252Fops [17:00:42] 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [17:02:03] 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) I've revised the checklist based on chat with @Joe. In particular, we want to start earlier with the verifica... [17:04:52] PROBLEM - MegaRAID on db1064 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [17:04:54] ACKNOWLEDGEMENT - MegaRAID on db1064 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206345 [17:04:58] 10Operations, 10ops-eqiad: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T206345 (10ops-monitoring-bot) [17:06:21] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [17:07:12] (03PS1) 10Elukey: role::configcluster: set codfw to read/write and eqiad to readonly [puppet] - 10https://gerrit.wikimedia.org/r/464847 (https://phabricator.wikimedia.org/T205814) [17:07:41] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:08:31] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:08:42] <_joe_> XioNoX: ^^ [17:08:49] <_joe_> expected/known? [17:09:02] nop, looking [17:09:17] (03CR) 10Giuseppe Lavagetto: [C: 031] role::configcluster: set codfw to read/write and eqiad to readonly [puppet] - 10https://gerrit.wikimedia.org/r/464847 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [17:09:31] (03CR) 10Elukey: [C: 032] role::configcluster: set codfw to read/write and eqiad to readonly [puppet] - 10https://gerrit.wikimedia.org/r/464847 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [17:09:51] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:09:59] Last flapped : 2018-10-05 17:09:11 UTC (00:00:41 ago) [17:10:09] _joe_ merge and run puppet in codfw gently ok? [17:10:46] <_joe_> elukey: not gently [17:10:50] <_joe_> but yes, go on [17:10:55] seems like a (brief) issue with Equinix's OOB link, low priority, but I'll keep an eye on it, thanks [17:11:41] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 4.41 ms [17:12:31] !log set etcd in codfw as read/write (was readonly) and eqiad as readonly (was read/write) [17:12:32] <_joe_> elukey: things are ok now [17:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:42] 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10herron) 05Open>03Resolved a:03herron Hi @Shahadat, the requested list has been created and additional details with credentials should have been emailed to y... [17:13:42] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.91 ms [17:13:51] 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists: Create a mailling list for Wiki Loves Love - https://phabricator.wikimedia.org/T203792 (10herron) 05Open>03Resolved [17:17:37] _joe_ puppet ran across all the conf* [17:18:16] <_joe_> elukey: I successfully depooled/pooled an appserver, I think we're ok [17:18:53] super [17:23:18] (03PS2) 10Dzahn: icinga::web: don't include ::icinga [puppet] - 10https://gerrit.wikimedia.org/r/463405 [17:23:37] 10Operations, 10Wikimedia-Mailing-lists: Request new mail list for Vietnam Wikimedians User Group - https://phabricator.wikimedia.org/T204974 (10herron) 05Open>03Resolved a:03herron Hi @minhhuy, the requested list has been created and additional details with credentials should have been emailed to you di... [17:26:03] (03CR) 10Dzahn: "wut? compiler says syntax error in naggen.pp ? not even touching this" [puppet] - 10https://gerrit.wikimedia.org/r/463405 (owner: 10Dzahn) [17:26:19] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/12794/einsteinium.wikimedia.org/change.einsteinium.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/463405 (owner: 10Dzahn) [17:26:42] RECOVERY - Device not healthy -SMART- on db1073 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops [17:29:14] (03PS1) 10Dzahn: add blank passwords::network::snmp_ro_community [labs/private] - 10https://gerrit.wikimedia.org/r/464851 [17:29:38] (03PS2) 10Dzahn: add blank passwords::network::snmp_ro_community [labs/private] - 10https://gerrit.wikimedia.org/r/464851 [17:31:43] (03CR) 10Cwhite: [C: 031] "> wut? compiler says syntax error in naggen.pp ? not even touching" [puppet] - 10https://gerrit.wikimedia.org/r/463405 (owner: 10Dzahn) [17:32:52] RECOVERY - MegaRAID on db1072 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [17:33:42] RECOVERY - Device not healthy -SMART- on db1064 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1064&var-datasource=eqiad%2520prometheus%252Fops [17:40:31] 10Operations, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Create mailing list for Bureaucrat of zh.wikipedia - https://phabricator.wikimedia.org/T202435 (10herron) 05Open>03Resolved a:03herron Hi @Wong128hk, the requested list has been created and additional details with credentials should have been em... [17:40:32] (03PS3) 10Dzahn: add blank passwords::network::snmp_ro_community [labs/private] - 10https://gerrit.wikimedia.org/r/464851 [17:40:59] (03CR) 10Dzahn: [V: 032 C: 032] "labs only" [labs/private] - 10https://gerrit.wikimedia.org/r/464851 (owner: 10Dzahn) [17:42:35] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Krinkle) [17:45:11] RECOVERY - MegaRAID on db1064 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [17:45:39] (03PS1) 10Mathew.onipe: prometheus-blazegraph-exporter: added Query and Concurrency related counters [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) [17:47:35] (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1011 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464855 [17:48:59] (03CR) 10Aaron Schulz: [C: 031] wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [17:49:48] (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool labsdb1011 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464855 (owner: 10Bstorm) [17:50:37] !log repooling labsdb1011 (T195747) [17:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:42] T195747: Create views for the schema change for refactored actor storage - https://phabricator.wikimedia.org/T195747 [17:51:02] (03PS2) 10Banyek: Revert "wiki replicas: depool labsdb1011 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464855 (owner: 10Bstorm) [17:51:07] (03CR) 10Banyek: [V: 032 C: 032] Revert "wiki replicas: depool labsdb1011 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464855 (owner: 10Bstorm) [17:55:34] (03PS1) 10Andrew Bogott: host monitoring: create new hiera global, host_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) [17:55:36] (03PS1) 10Andrew Bogott: labvirt/cloudvirt hosts: only page when a host is down [puppet] - 10https://gerrit.wikimedia.org/r/464857 (https://phabricator.wikimedia.org/T206224) [17:56:20] (03PS1) 10Bstorm: wiki replicas: depool labsdb1009 for actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464859 (https://phabricator.wikimedia.org/T195747) [17:56:30] (03CR) 10jerkins-bot: [V: 04-1] host monitoring: create new hiera global, host_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott) [17:58:16] !log depooling labsdb1009 (T195747) [17:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:20] T195747: Create views for the schema change for refactored actor storage - https://phabricator.wikimedia.org/T195747 [17:58:46] (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1009 for actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464859 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [18:01:41] (03PS2) 10Andrew Bogott: host monitoring: create new hiera global, host_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) [18:01:43] (03PS2) 10Andrew Bogott: labvirt/cloudvirt hosts: only page when a host is down [puppet] - 10https://gerrit.wikimedia.org/r/464857 (https://phabricator.wikimedia.org/T206224) [18:02:35] (03CR) 10jerkins-bot: [V: 04-1] host monitoring: create new hiera global, host_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott) [18:07:41] (03PS3) 10Andrew Bogott: host monitoring: create new hiera global, host_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) [18:07:42] [4~[1~!log disabling puppet on icinga for 5 min for extra safety before a change that should be noop [18:07:43] (03PS3) 10Andrew Bogott: labvirt/cloudvirt hosts: only page when a host is down [puppet] - 10https://gerrit.wikimedia.org/r/464857 (https://phabricator.wikimedia.org/T206224) [18:08:08] (03CR) 10Dzahn: [C: 032] icinga::web: don't include ::icinga [puppet] - 10https://gerrit.wikimedia.org/r/463405 (owner: 10Dzahn) [18:08:18] (03PS3) 10Dzahn: icinga::web: don't include ::icinga [puppet] - 10https://gerrit.wikimedia.org/r/463405 [18:08:30] !log disabling puppet on icinga for 5 min for extra safety before a change that should be noop [18:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:44] (03CR) 10jerkins-bot: [V: 04-1] host monitoring: create new hiera global, host_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott) [18:21:58] (03PS2) 10BBlack: Switch CAA records to proper RR format [dns] - 10https://gerrit.wikimedia.org/r/462693 [18:22:11] (03CR) 10jerkins-bot: [V: 04-1] Switch CAA records to proper RR format [dns] - 10https://gerrit.wikimedia.org/r/462693 (owner: 10BBlack) [18:23:17] (03PS1) 10BBlack: authdns: add interface::rps and TFO [puppet] - 10https://gerrit.wikimedia.org/r/464861 [18:23:19] (03PS1) 10BBlack: gdnsd config: update for 3.x [puppet] - 10https://gerrit.wikimedia.org/r/464862 [18:23:56] (03CR) 10jerkins-bot: [V: 04-1] authdns: add interface::rps and TFO [puppet] - 10https://gerrit.wikimedia.org/r/464861 (owner: 10BBlack) [18:26:27] !log icinga - noop on all servers, no change, puppet re-enabled, operations normal [18:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:56] (03PS2) 10BBlack: authdns: add interface::rps and TFO [puppet] - 10https://gerrit.wikimedia.org/r/464861 [18:28:58] (03PS2) 10BBlack: gdnsd config: update for 3.x [puppet] - 10https://gerrit.wikimedia.org/r/464862 [18:31:17] !log gdnsd-2.99.9930-beta-1+wmf1 uploaded to stretch-wikimedia [18:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:21] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Smalyshev) > Is there a reason that all mediawiki hosts show as "localhost"? This is probably coming from Jetty, which takes it... [18:34:39] (03CR) 10Thcipriani: [C: 032] "Works as expected" [software/keyholder] - 10https://gerrit.wikimedia.org/r/458226 (owner: 10Faidon Liambotis) [18:35:14] (03PS1) 10Cwhite: hiera: comment out diamond::remove [puppet] - 10https://gerrit.wikimedia.org/r/464863 (https://phabricator.wikimedia.org/T183454) [18:35:30] (03Merged) 10jenkins-bot: Don't barf on an empty or invalid YAML config [software/keyholder] - 10https://gerrit.wikimedia.org/r/458226 (owner: 10Faidon Liambotis) [18:35:35] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Smalyshev) Also, 1,182,961 events is a lot. What's going on there? Why so many? Is it a legit scenario? I wonder also if the most... [18:36:37] (03PS1) 10Cwhite: Revert "ntp: remove diamond::collector in favor of prometheus-node-exporter" [puppet] - 10https://gerrit.wikimedia.org/r/464864 [18:37:03] (03PS2) 10Cwhite: Revert "ntp: remove diamond::collector in favor of prometheus-node-exporter" [puppet] - 10https://gerrit.wikimedia.org/r/464864 [18:37:05] !log authdns2001: upgraded gdnsd to 2.99.9930-beta [18:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:26] (03CR) 10Muehlenhoff: [C: 031] Revert "ntp: remove diamond::collector in favor of prometheus-node-exporter" [puppet] - 10https://gerrit.wikimedia.org/r/464864 (owner: 10Cwhite) [18:37:58] (03CR) 10Cwhite: [C: 032] Revert "ntp: remove diamond::collector in favor of prometheus-node-exporter" [puppet] - 10https://gerrit.wikimedia.org/r/464864 (owner: 10Cwhite) [18:38:33] 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10jayantanth) I dont understand why my mail ID is not added to admin list. [18:38:45] (03CR) 10Cwhite: [C: 032] hiera: comment out diamond::remove [puppet] - 10https://gerrit.wikimedia.org/r/464863 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [18:39:18] (03PS10) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) [18:42:25] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10colewhite) [18:51:11] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:52:57] (03CR) 10Dzahn: [C: 04-1] "it would disable the current crons in codfw, the compiler output says "noop" but because it does the reverse what would happen in prod, be" [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [18:54:21] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:54:30] (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1009 for actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464865 [18:58:59] (03CR) 10Dzahn: [C: 032] "yea, none of that supposed issue with naggen.pp happened. this was a clean noop as expected, on all production servers" [puppet] - 10https://gerrit.wikimedia.org/r/463405 (owner: 10Dzahn) [18:59:34] (03PS2) 10Bstorm: labstore: we really only want to know about prolonged load issues [puppet] - 10https://gerrit.wikimedia.org/r/464838 (https://phabricator.wikimedia.org/T206144) [19:02:12] (03PS1) 10Cwhite: ntp: move diamond::collector to where it will only apply to ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) [19:04:14] (03PS2) 10Cwhite: ntp: move diamond::collector to where it will only apply to ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) [19:06:26] (03CR) 10Dzahn: [C: 031] "yep, planned for Monday morning though, avoiding Friday" [puppet] - 10https://gerrit.wikimedia.org/r/464088 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:10:15] (03PS1) 10Cwhite: Revert "hiera: comment out diamond::remove" [puppet] - 10https://gerrit.wikimedia.org/r/464867 [19:10:48] (03PS2) 10Cwhite: Revert "hiera: comment out diamond::remove" [puppet] - 10https://gerrit.wikimedia.org/r/464867 [19:10:56] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Opened Juniper case 2018-1005-0549 about the ND issue. [19:23:42] (03CR) 10Gehel: [C: 04-1] prometheus-blazegraph-exporter: added Query and Concurrency related counters (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe) [19:30:51] PROBLEM - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.002 second response time [19:33:50] (03PS11) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) [19:33:58] gehel: ^ known or should we restart blazegraph? (the other day i talked to stas about doing that if wdqs has the issue again [19:34:26] mutante: there's something wrong with setup on 1009, I am looking into it [19:34:40] SMalyshev: cool:) thanks [19:39:35] (03CR) 10Mathew.onipe: prometheus-blazegraph-exporter: added Query and Concurrency related counters (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe) [19:42:02] (03CR) 10Smalyshev: [C: 04-1] wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [19:42:40] mutante: ok, I know what the problem is - it got old binaries deployed. I will fix it [19:44:10] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@f8776de]: Redeploy 1009 [19:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:13] SMalyshev: great :) [19:44:36] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@f8776de]: Redeploy 1009 (duration: 00m 26s) [19:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:02] RECOVERY - WDQS HTTP Port on wdqs1009 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.062 second response time [19:45:17] nice [19:46:23] turns out scap git handling is a bit trickier than I thought... so that autodeploy patch probably needs some work. but it should be back to normal now [19:48:10] !log repooling labsdb1009 (T195747) [19:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:15] T195747: Create views for the schema change for refactored actor storage - https://phabricator.wikimedia.org/T195747 [19:48:38] (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool labsdb1009 for actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464865 (owner: 10Bstorm) [19:49:03] (03CR) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [19:49:17] (03PS2) 10Banyek: Revert "wiki replicas: depool labsdb1009 for actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464865 (owner: 10Bstorm) [19:49:20] (03CR) 10Banyek: [V: 032 C: 032] Revert "wiki replicas: depool labsdb1009 for actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464865 (owner: 10Bstorm) [19:50:43] mutante, SMalyshev: thanks for taking care of that! I was mostly off already [19:51:08] (03CR) 10Smalyshev: [C: 031] wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [19:52:19] mutante: you can also ping onimisionipe on wdqs issues, he should have some idea about what's going on :) [19:54:52] (03CR) 10Gehel: [C: 04-1] prometheus-blazegraph-exporter: added Query and Concurrency related counters (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe) [19:55:01] gehel: oh, of course! yes, the new wdqs-roots group even. will do. i basically picked you randomly [19:55:47] mutante: I'm not complaining (well, almost not)! It's just a good idea to give Matt some more exposure as well :) [19:56:00] *nod* yep [19:57:28] (03CR) 10Dzahn: [C: 032] "compiler output is the opposite of production because active_dc is set to eqiad in labs https://puppet-compiler.wmflabs.org/compiler1002/1" [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [19:57:43] (03PS12) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) [20:03:26] (03CR) 10Gehel: [C: 04-1] wdqs: auto deployment of wdqs on wdqs1009 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [20:04:27] (03CR) 10Thcipriani: [C: 032] Drop legacy SSHv1 support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458227 (owner: 10Faidon Liambotis) [20:05:04] (03CR) 10Smalyshev: [C: 031] wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [20:05:08] (03CR) 10Smalyshev: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [20:05:18] (03Merged) 10jenkins-bot: Drop legacy SSHv1 support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458227 (owner: 10Faidon Liambotis) [20:07:14] (03CR) 10Dzahn: [C: 032] "double checked on all. no change. before and after this the crons are running on mwmaint2001 and don't exist on either mwmaint1001 or mwma" [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [20:08:03] * banyek off [20:09:02] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) This server will become active when we switch Mediawiki back to eqiad on October 10th. [20:10:37] 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) [20:14:57] (03CR) 10Mathew.onipe: prometheus-blazegraph-exporter: added Query and Concurrency related counters (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe) [20:19:04] 10Operations, 10Patch-For-Review: releases servers: set rsync direction based on active dc, add warning motd on inactive server - https://phabricator.wikimedia.org/T205037 (10Dzahn) 05Open>03Resolved closing this. 2/3 things are done and the remaining one deserves a more broad solution. it is not specific... [20:19:48] 10Operations, 10Patch-For-Review: releases servers: add warning motd on inactive server - https://phabricator.wikimedia.org/T205037 (10Dzahn) [20:24:56] (03PS4) 10Cwhite: profile, graphite: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464389 (https://phabricator.wikimedia.org/T183454) [20:27:42] (03PS18) 10Dzahn: Gerrit: Hook up gerrit.wmfusercontent.org to apache [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [20:33:25] (03CR) 10Dzahn: [C: 032] Gerrit: Hook up gerrit.wmfusercontent.org to apache [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [20:34:16] (03PS1) 10Paladox: Gerrit: Lower gerrit.wmfusercontent.org priority in apache [puppet] - 10https://gerrit.wikimedia.org/r/464877 [20:34:37] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12799/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [20:35:22] (03PS2) 10Paladox: Gerrit: Lower gerrit.wmfusercontent.org priority in apache [puppet] - 10https://gerrit.wikimedia.org/r/464877 [20:36:43] (03PS3) 10Dzahn: Gerrit: Lower gerrit.wmfusercontent.org priority in apache [puppet] - 10https://gerrit.wikimedia.org/r/464877 (owner: 10Paladox) [20:37:32] (03CR) 10Dzahn: [C: 032] "yep, thanks. per IRC, it would have also been second to load without this but only because "wiki" is before "wmf" in the alphabet. this is" [puppet] - 10https://gerrit.wikimedia.org/r/464877 (owner: 10Paladox) [20:44:16] !log gerrit - adding gerrit.wmfusercontent.org virtual host for avatars. applied first on gerrit2001, then on cobalt (T191183) [20:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:22] T191183: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 [20:48:52] !log T191183 - it's still showing the error page as before but that isn't due to apache issues, it just needs additional ferm rules [20:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:30] (03PS1) 10Rxy: Remove the "reviewer" group at ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464890 (https://phabricator.wikimedia.org/T205997) [21:01:48] (03PS1) 10Cwhite: hiera: enable ntp collector on role::recursor [puppet] - 10https://gerrit.wikimedia.org/r/464905 (https://phabricator.wikimedia.org/T183454) [21:02:38] (03Abandoned) 10Cwhite: hiera: enable ntp collector on labcontrol to replace ntpd diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/464843 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [21:10:24] (03PS1) 10Paladox: Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907 [21:10:52] (03PS2) 10Paladox: Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907 [21:12:02] (03PS3) 10Paladox: Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907 [21:15:50] (03CR) 10Dzahn: [C: 031] "[cp1079:~] $ curl -H "Host: gerrit.wmfusercontent.org" http://cobalt.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/464907 (owner: 10Paladox) [21:20:20] (03CR) 10Dzahn: [C: 031] "yes, the gerrit server has 2 IPs, server (208.80.154.81, cobalt) and service (208.80.154.85, gerrit) on the same interface and 2 Apache vi" [puppet] - 10https://gerrit.wikimedia.org/r/464907 (owner: 10Paladox) [21:22:31] (03PS3) 10Thcipriani: Drop MD5 (pre-6.8) digest support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458228 (owner: 10Faidon Liambotis) [21:26:00] (03PS4) 10Dzahn: Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [21:27:16] (03PS5) 10Dzahn: Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [21:30:25] paladox: i think it's still confusing because we are both making it sound as if Gerrit itself ever was or is behind caches [21:30:36] paladox: but yea.. it's still the correct title [21:30:37] (03CR) 10Thcipriani: [C: 032] Drop MD5 (pre-6.8) digest support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458228 (owner: 10Faidon Liambotis) [21:30:41] yep [21:30:57] should have called the director "avatars" :p [21:31:02] lol [21:31:05] not also gerrit [21:31:20] (03Merged) 10jenkins-bot: Drop MD5 (pre-6.8) digest support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458228 (owner: 10Faidon Liambotis) [21:31:37] well, i edited the commit message a bunch to expalin [21:31:42] :) [21:35:33] (03PS5) 10Dzahn: Gerrit: fix login screen css [puppet] - 10https://gerrit.wikimedia.org/r/464418 (owner: 10Thcipriani) [21:36:06] (03CR) 10Dzahn: [C: 032] Gerrit: fix login screen css [puppet] - 10https://gerrit.wikimedia.org/r/464418 (owner: 10Thcipriani) [21:39:01] (03CR) 10Dzahn: [C: 032] "confirmed now the new text is not on the side anymore but below and centered. nice" [puppet] - 10https://gerrit.wikimedia.org/r/464418 (owner: 10Thcipriani) [21:45:11] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Dzahn) Yea, it's not ferm, it's the wrong backend IP per change above. [21:49:51] (03PS1) 10Cwhite: nutcracker: ensure absent nutcracker.py [puppet] - 10https://gerrit.wikimedia.org/r/464917 (https://phabricator.wikimedia.org/T183454) [22:01:06] 10Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442 (10Dzahn) re: partman recipes rdb100[1-4]) echo partman/mw.cfg ;; \ rdb100[5-6]) echo partman/raid1-lvm-ext4-srv-noswap.cfg ;; \ rdb100[7-9]|rdb1010) echo partman/raid1-lvm-ext4-srv.cfg ;; \... [22:02:33] (03PS1) 10Cwhite: nutcracker: set diamond::remove on all roles containing nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/464918 (https://phabricator.wikimedia.org/T183454) [22:03:48] (03PS2) 10Cwhite: nutcracker: set diamond::remove on all roles containing nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/464918 (https://phabricator.wikimedia.org/T183454) [22:04:37] (03PS1) 10Dzahn: partman: let rdb1005/1006 also use recipe with swap [puppet] - 10https://gerrit.wikimedia.org/r/464919 (https://phabricator.wikimedia.org/T140442) [22:04:41] (03PS2) 10Cwhite: nutcracker: ensure absent nutcracker.py [puppet] - 10https://gerrit.wikimedia.org/r/464917 (https://phabricator.wikimedia.org/T183454) [22:05:42] (03PS2) 10Cwhite: hiera: enable ntp collector on role::recursor [puppet] - 10https://gerrit.wikimedia.org/r/464905 (https://phabricator.wikimedia.org/T183454) [22:09:23] (03CR) 10RobH: [C: 031] partman: let rdb1005/1006 also use recipe with swap [puppet] - 10https://gerrit.wikimedia.org/r/464919 (https://phabricator.wikimedia.org/T140442) (owner: 10Dzahn) [22:09:55] (03PS2) 10Dzahn: partman: let rdb1005/1006 also use recipe with swap [puppet] - 10https://gerrit.wikimedia.org/r/464919 (https://phabricator.wikimedia.org/T140442) [22:10:31] (03CR) 10Dzahn: [C: 032] partman: let rdb1005/1006 also use recipe with swap [puppet] - 10https://gerrit.wikimedia.org/r/464919 (https://phabricator.wikimedia.org/T140442) (owner: 10Dzahn) [22:13:14] 10Operations, 10Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562 (10Dzahn) [22:13:19] 10Operations, 10Patch-For-Review: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442 (10Dzahn) 05Open>03Resolved >>! In T140442#4637702, @faidon wrote: > As far as this task goes, I'd recommend fixing the partman recipe on Puppet (so that the next install gets it right) and resolvin... [22:32:39] 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10Aklapper) I guess because the address was not listed in the task description and the original task author has not seconded the request. Note that any of the curre... [22:37:05] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80% [22:37:08] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [22:38:04] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% [22:52:56] (03CR) 10BBlack: [C: 032] Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [22:53:05] (03PS6) 10BBlack: Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [22:56:59] XioNoX: still around? dallas port utilization stuff above ^ [22:57:27] (also, I guess we didn't repool ulsfo yet, probably should've earlier!) [22:58:21] (03PS1) 10BBlack: Revert "Depool ulsfo for DC move" [dns] - 10https://gerrit.wikimedia.org/r/464922 [22:58:30] bblack thanks! Works https://gerrit.wmfusercontent.org/paladox.png [22:58:36] cc mutante and thcipriani ^^ [22:59:19] and yeah, I think they are getting default cache TTLs of 86400 [22:59:32] that may be good or bad depending on your POV, as nothing will explicitly purge these on change I doubt :) [22:59:45] yup [22:59:58] we are using https://gerrit.wikimedia.org/r/admin/projects/All-Avatars for avatars [23:01:31] bblack: on my way to a festival, it's possible that Facebook or similar changed something [23:01:51] last time it was a brief spike [23:06:23] yeah I'm actually seeing a depression in remote dns caches preferring dallas, too [23:06:28] I'm check the IX portal to figure out who that is [23:06:33] which may indicate they're seeing latency increase in dallas, or loss [23:07:05] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80% [23:07:08] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [23:07:11] XioNoX: we're still ok with the basic plan of repooling ulsfo right? because that may alleviate some ulsfo load too [23:07:20] can we look at the logs to see if there is an increase of requests? [23:07:22] err, alleviate some dallas load [23:07:31] yeah, we can, but haven't yet [23:07:38] yeah, ulsfo is ready to be repooled [23:07:48] but keep an eye in case [23:08:05] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% [23:08:17] yes, there's an increase in requests, and it's a curious one as its almost all HEAD reqs [23:08:52] to cache_upload even [23:08:59] it's too soon for the data to appear in the IX dashboard [23:09:04] see here: [23:09:06] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?orgId=1&from=now-1h&to=now&var-site=codfw&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5 [23:09:09] https://ix.equinix.com/ixp/trafficStats [23:09:19] gonna push ulsfo first, then look at that [23:09:37] (03CR) 10BBlack: [C: 032] Revert "Depool ulsfo for DC move" [dns] - 10https://gerrit.wikimedia.org/r/464922 (owner: 10BBlack) [23:10:15] https://librenms.wikimedia.org/graphs/to=1538780700/id=16721/type=port_bits/from=1538694300/ [23:10:46] if we disable the IX port the spike will impact one of our uplink... [23:11:02] right [23:11:17] so no easy way to get out other than banning those specific source IP after identifying them [23:11:19] I'm surprised nothing else is alerting on this, really, but I guess it can make sense [23:13:07] I'm assuming with that flattening at ~9.5G we're basically fully saturated [23:14:21] yeah the ix port is at full steam [23:14:40] it's facebook traffic heh [23:14:41] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:15:03] with new UAs: Go-http-client/2.0 + Go-http-client/1.1 [23:15:37] the client IPs are all over, but they all put FB's standard identifier stuff in the lower 64 [23:16:30] "2a03:2880:ff:2::face:b00c" [23:16:37] and similar with the :2: changing variously [23:16:52] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:17:12] who at fb do a prod change Friday evening... [23:17:36] yeah go figure [23:17:49] this seems eerily familiar to the incident like this we had with alexa :P [23:18:16] ulsfo it starting to offload some of the traffic, but I donno if it's enough to really fix things [23:19:03] another thing we could do, is repool eqiad's front edge too [23:19:26] (the reqs would still flow eqiad->codfw before reaching into the applayer, and thus still respect the DC switch in that sense, but it would shift edge link traffic around) [23:19:35] bblack: set a larger cache value for this specific new UA? [23:19:53] no ide aif that will even help, or if they're even sending duplicate reqs [23:19:59] but it's HEAD requests anyways, not GET [23:20:06] so probably it wouldn't help :) [23:21:17] rate limit is possible? [23:21:32] at the cache layer [23:21:51] maybe [23:21:59] it's going to break something of course, somewhere [23:22:06] sample of what the reqs look like in oxygen logs: [23:22:09] {"hostname":"cp2026.codfw.wmnet","sequence":2823883557,"dt":"2018-10-05T23:12:33","time_firstbyte":0.000425,"ip":"2a03:2880:ff:b::face:b00c","cache_status":"hit-local","http_status":"404","response_size":0,"http_method":"HEAD","uri_host":"upload.wikimedia.org","uri_path":"/wikipedia/en/1/14/Iliana_Iotova_-_Bulgarian_part-_Citizens%E2%80%99_Corner_debate-_With_or_without_Schengen","uri_query":" [23:22:15] ?_(26448779462).png","content_type":"text/html; charset=UTF-8","referer":"http://upload.wikimedia.org/wikipedia/en/1/14/Iliana_Iotova_-_Bulgarian_part-_Citizens%E2%80%99_Corner_debate-_With_or_without_Schengen?_(26448779462).png","user_agent":"Go-http-client/2.0","accept_language":"-","x_analytics":"https=1;nocookies=1","range":"-","x_cache":"cp2011 hit/6, cp2026 miss"} [23:22:21] {"hostname":"cp2017.codfw.wmnet","sequence":3016365346,"dt":"2018-10-05T23:12:33","time_firstbyte":0.000043,"ip":"2a03:2880:ff:a::face:b00c","cache_status":"int-front","http_status":"301","response_size":0,"http_method":"HEAD","uri_host":"upload.wikimedia.org","uri_path":"/wikipedia/commons/9/9f/%C3%83%8Dsis_Val%C3%A9ria_Gomes,_G%C3%B6teborg_Book_Fair_2014_2.png","uri_query":"","content_type": [23:22:27] "-","referer":"-","user_agent":"Go-http-client/1.1","accept_language":"-","x_analytics":"-","range":"-","x_cache":"cp2017 int"} [23:22:47] they don't even have response bodies as HEAD responses, we're just saturating at the high volume of metadata flowing heh [23:23:10] oh wait, why is neither one of my random samples a 200? :) [23:24:07] bblack@oxygen:~$ jq .http_status fb-headreqs |sort|uniq -c|sort -rn|head -20 [23:24:10] 5135 "301" [23:24:13] 4426 "412" [23:24:15] 373 "404" [23:24:17] 292 "200" [23:24:51] so the 412s are precondition failures, like If-None-Match mismatches or whatever, to check caching [23:25:45] but the 301's... apparently they're hitting port 80 for those :P [23:25:55] which is ~ half of the traffic [23:26:21] ridiculous, but also I don't think 429 ratelimiters will be much less impactful than the redirects or 412s [23:26:59] of course, the HEAD reqs are the most-notable thing in the traffic window, but there's just no way they account for all that port saturation [23:27:04] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary outbound port utilisation over 80% [23:27:08] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [23:27:25] there's probably also GETs for images in the mix too, just at a much lower rate [23:27:27] ah ^ [23:27:33] recoveries [23:28:05] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% [23:28:10] or in other words: we're observing the HEADs in reqstats because it's a high volume of reqs, but there's probably also a low volume of matching GET traffic for images that doesn't stand out in req counts, but is contributing significantly to bytes output [23:29:05] so I guess someone is spinning up some software and testing it for brief periods [23:29:09] yeah, on a Friday :P [23:29:50] don't they have better things to do, like sell user data or create new bugs that will compromise millions of their user accounts or something? [23:33:26] lol [23:33:36] XioNoX: my net take on this is I think we should repool the eqiad front edge. We're only ~3 work days out from switchback anyways, and it will hopefully segment away some of the competing traffic to other links. [23:34:04] because odds are this probably wasn't the last burst, and blocking it gets tricky and breaks other things for a bunch of other FB stuff I assume [23:42:04] (03PS1) 10BBlack: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/464928 [23:43:51] anyways, the DNS stats are an interesting indicator to look at during this stuff. I wouldn't normally look for it, but happened to have them up lately: [23:43:54] https://grafana.wikimedia.org/dashboard/db/dns?orgId=1&from=now-6h&to=now [23:44:40] because a lot of the recursive caches that hit us, are smart enough that they keep some latency/loss history/probing to all 3, so when you see a notable shift of stats away from a DC, it probably means they're observing worse conditions there in general [23:45:00] (and in this case, the bump there in that graph, away from codfw to the other two, aligns with this codfw traffic saturation) [23:46:24] cool, yeah [23:47:03] I was going to say that I'm not expecting changes on a weekend but last spike was a Saturday I think [23:47:34] so repooling eqiad is maybe safer [23:48:52] (03CR) 10BBlack: [C: 032] Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/464928 (owner: 10BBlack) [23:50:12] !log <<<<<<< repooling eqiad edge caches, a few days ahead of intended switchback next Weds, to alleviate some traffic engineering concerns over the weekend >>>>>> [23:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:51] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen