[00:06:58] <wikibugs>	 (03PS1) 10GTirloni: toollabs-golang - Update to Stretch and Go 1.10 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/464736 (https://phabricator.wikimedia.org/T206276)
[00:20:15] <wikibugs>	 (03CR) 10Dzahn: "this need to be amended. the actual user name has only been created now and uid is 	uid: skvjold" [puppet] - 10https://gerrit.wikimedia.org/r/460943 (https://phabricator.wikimedia.org/T204377) (owner: 10Ayounsi)
[00:24:43] <wikibugs>	 (03PS1) 10Dzahn: admins: update ldap user name of Margeigh Novotny [puppet] - 10https://gerrit.wikimedia.org/r/464740 (https://phabricator.wikimedia.org/T204377)
[00:24:59] <wikibugs>	 (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/464740/" [puppet] - 10https://gerrit.wikimedia.org/r/460943 (https://phabricator.wikimedia.org/T204377) (owner: 10Ayounsi)
[00:25:21] <wikibugs>	 (03PS2) 10Dzahn: admins: update ldap user name of Margeigh Novotny [puppet] - 10https://gerrit.wikimedia.org/r/464740 (https://phabricator.wikimedia.org/T204377)
[00:26:38] <wikibugs>	 (03CR) 10Dzahn: [C: 032] ""ldap_only" not a shell user" [puppet] - 10https://gerrit.wikimedia.org/r/464740 (https://phabricator.wikimedia.org/T204377) (owner: 10Dzahn)
[00:27:53] <wikibugs>	 (03CR) 10Dzahn: [C: 032] ""mnovotny" doesn't exist in LDAP but this one does (now)" [puppet] - 10https://gerrit.wikimedia.org/r/464740 (https://phabricator.wikimedia.org/T204377) (owner: 10Dzahn)
[00:31:29] <mutante>	 !log LDAP: added user skvjold to group wmf (T204377)
[00:31:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:31:33] <stashbot>	 T204377: LDAP Acess request for Margeigh Novotny - https://phabricator.wikimedia.org/T204377
[00:36:14] <wikibugs>	 (03CR) 10BryanDavis: "formatting comment inline. Before merging and deploying there should be some check to see if we have folks actually using the go image and" (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/464736 (https://phabricator.wikimedia.org/T206276) (owner: 10GTirloni)
[00:39:48] <wikibugs>	 (03CR) 10Legoktm: "legoktm@tools-k8s-master-01:~$ ./pods.sh" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/464736 (https://phabricator.wikimedia.org/T206276) (owner: 10GTirloni)
[00:42:09] <wikibugs>	 (03CR) 10BryanDavis: "Added Dan and Tyler as reviewers since they are apparently the current golang tools maintainers. (Thanks legoktm!)" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/464736 (https://phabricator.wikimedia.org/T206276) (owner: 10GTirloni)
[00:45:07] <wikibugs>	 (03CR) 10GTirloni: toollabs-golang - Update to Stretch and Go 1.10 (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/464736 (https://phabricator.wikimedia.org/T206276) (owner: 10GTirloni)
[00:46:49] <wikibugs>	 (03PS2) 10GTirloni: toollabs-golang - Update to Stretch and Go 1.10 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/464736 (https://phabricator.wikimedia.org/T206276)
[00:52:24] <icinga-wm>	 PROBLEM - HHVM rendering on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:53:23] <icinga-wm>	 RECOVERY - HHVM rendering on mw1324 is OK: HTTP OK: HTTP/1.1 200 OK - 80831 bytes in 0.181 second response time
[01:43:43] <icinga-wm>	 PROBLEM - MD RAID on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[01:43:53] <icinga-wm>	 PROBLEM - mathoid endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[01:44:04] <icinga-wm>	 PROBLEM - SSH on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:44:04] <icinga-wm>	 PROBLEM - apertium apy on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:44:23] <icinga-wm>	 PROBLEM - cpjobqueue endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[01:44:33] <icinga-wm>	 PROBLEM - eventstreams on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:44:33] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[01:44:43] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title}{/revision}{/tid} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before 
[01:44:43] <icinga-wm>	 ived: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received
[01:44:44] <icinga-wm>	 PROBLEM - pdfrender on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:44:53] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[01:45:04] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[01:45:13] <icinga-wm>	 PROBLEM - Check systemd state on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[01:45:14] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[01:45:24] <icinga-wm>	 PROBLEM - Disk space on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[01:45:33] <icinga-wm>	 RECOVERY - eventstreams on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1043 bytes in 2.393 second response time
[01:45:43] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy
[01:45:43] <icinga-wm>	 RECOVERY - pdfrender on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.074 second response time
[01:45:44] <icinga-wm>	 RECOVERY - MD RAID on scb2005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[01:46:03] <icinga-wm>	 RECOVERY - mathoid endpoints health on scb2005 is OK: All endpoints are healthy
[01:46:14] <icinga-wm>	 RECOVERY - apertium apy on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.077 second response time
[01:46:14] <icinga-wm>	 RECOVERY - SSH on scb2005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[01:46:14] <icinga-wm>	 RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational
[01:46:15] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy
[01:46:24] <icinga-wm>	 RECOVERY - Disk space on scb2005 is OK: DISK OK
[01:46:33] <icinga-wm>	 RECOVERY - cpjobqueue endpoints health on scb2005 is OK: All endpoints are healthy
[01:46:43] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb2005 is OK: All endpoints are healthy
[01:47:14] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb2005 is OK: All endpoints are healthy
[02:14:44] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on scb2005 is OK: OK: synced at Fri 2018-10-05 02:14:41 UTC.
[02:49:23] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[02:55:44] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[03:27:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 730.68 seconds
[03:42:50] <wikibugs>	 (03PS2) 10Andrew Bogott: labs-ip-alias-dump.py: remove an unused variable [puppet] - 10https://gerrit.wikimedia.org/r/464721
[03:43:02] <wikibugs>	 (03PS2) 10Andrew Bogott: labs-ip-alias-dump.py: Fix enumerating IPs in Neutron [puppet] - 10https://gerrit.wikimedia.org/r/464722
[03:43:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labs-ip-alias-dump.py: remove an unused variable [puppet] - 10https://gerrit.wikimedia.org/r/464721 (owner: 10Andrew Bogott)
[03:44:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labs-ip-alias-dump.py: Fix enumerating IPs in Neutron [puppet] - 10https://gerrit.wikimedia.org/r/464722 (owner: 10Andrew Bogott)
[03:59:23] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 294.74 seconds
[04:16:29] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/CirrusSearch/includes/DataSender.php: I0769c50c (duration: 01m 01s)
[04:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:18:03] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/includes/libs/filebackend/FileBackendStore.php: T205567 - I75f1eb6dc2cb (duration: 00m 56s)
[04:18:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:18:08] <stashbot>	 T205567: PHP Warning "Unable to delete stat cache" from file uploads - https://phabricator.wikimedia.org/T205567
[04:32:18] <wikibugs>	 (03PS1) 10Varnent: Update  for Governance Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464748
[04:38:15] <wikibugs>	 (03PS2) 10Varnent: Update to sitename for Governance Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464748 (https://phabricator.wikimedia.org/T205599)
[05:03:50] <wikibugs>	 (03PS1) 10Varnent: Additional namespaces for Governance Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464752 (https://phabricator.wikimedia.org/T206173)
[05:08:27] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Clarify db1092 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464753 (https://phabricator.wikimedia.org/T205514)
[05:10:27] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Clarify db1092 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464753 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui)
[05:11:18] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db1073 is CRITICAL: cluster=mysql device=megaraid,3 instance=db1073:9100 job=node site=eqiad Marostegui T206254 - The acknowledgement expires at: 2018-10-10 05:10:51. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops
[05:11:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui)
[05:12:18] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Clarify db1092 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464753 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui)
[05:13:05] <wikibugs>	 (03PS3) 10Varnent: Update to sitename for Governance Wiki to reflect new name of site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464748 (https://phabricator.wikimedia.org/T205599)
[05:13:41] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Clarify db1092 status - T205514 (duration: 00m 57s)
[05:13:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:13:46] <stashbot>	 T205514: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514
[05:25:20] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Clarify db1092 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464753 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui)
[05:27:54] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational
[05:46:12] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Update to sitename for Governance Wiki to reflect new name of site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464748 (https://phabricator.wikimedia.org/T205599) (owner: 10Varnent)
[05:47:52] <wikibugs>	 (03Merged) 10jenkins-bot: Update to sitename for Governance Wiki to reflect new name of site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464748 (https://phabricator.wikimedia.org/T205599) (owner: 10Varnent)
[05:49:26] <Krinkle>	 varnent: If you're here, do you want to verify the change on staging?
[05:51:48] <Krinkle>	 It's a two-step process, I can walk you through it if you haven't done it before.
[05:53:39] <_joe_>	 !log upgrading python-etcd on conf1004-6, restarting etcdmirror
[05:53:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:54:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Fix TLS connections to etcdv3 on stretch [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/464602 (owner: 10Giuseppe Lavagetto)
[05:54:27] <wikibugs>	 (03CR) 10jenkins-bot: Update to sitename for Governance Wiki to reflect new name of site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464748 (https://phabricator.wikimedia.org/T205599) (owner: 10Varnent)
[05:54:30] * Krinkle verified on mwdebug2002
[05:55:19] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T205599 - Ic28e00c30 (duration: 00m 57s)
[05:55:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:26] <stashbot>	 T205599: Change $wgSitename for Governance Wiki - https://phabricator.wikimedia.org/T205599
[05:56:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: wmfSetupEtcd: Correctly initialize the local cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto)
[06:00:43] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370)
[06:01:01] <_joe_>	 I'll never wrap my head around Mediawiki's use of whitespace around parentheses
[06:05:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "You also need to change the records in template/wikimedia.org and anywhere else they might appear." [dns] - 10https://gerrit.wikimedia.org/r/464503 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey)
[06:05:48] <wikibugs>	 (03CR) 10Krinkle: mediawiki::web::prod_sites: convert wiktionary.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[06:06:46] <_joe_>	 Krinkle: yeah, I found so many of these small WTFs in our virtual hosts
[06:07:18] <Krinkle>	 _joe_: so which one do we generate, or did you add support for the broken variant?
[06:07:21] * Krinkle runs compiler to check
[06:07:35] <_joe_>	 we fix it
[06:07:48] <_joe_>	 the generated one will have the correct rewrite
[06:07:57] <Krinkle>	 cool, Yeah, I see it now
[06:08:05] <_joe_>	 if we want to support that broken version (should we?) we need to add it manuall 
[06:08:08] <_joe_>	 *manually
[06:08:36] <Krinkle>	 No, it makes no sense.
[06:08:38] <_joe_>	 that's one of the advantages of letting computers do the formatting, it's harder to make mistakes :P
[06:08:51] <Krinkle>	 Even if (big if) we really stored them there in the past, we don't currently, so they'd just be 404
[06:09:13] <_joe_>	 ack
[06:09:28] <Krinkle>	 _joe_: Hm.. retry=0 is being added. What's the default currently?
[06:09:43] <_joe_>	 we planned to have retry=0 everywhere
[06:10:07] <_joe_>	 we don't want apache to retry something that timed out on hhvm because of a bad db query for instance
[06:10:24] <_joe_>	 the retries are done by varnish
[06:10:49] <_joe_>	 so no point in retrying locally (and blindly, while varnish IIRC only retries GET requests)
[06:11:56] <Krinkle>	 Yeah that makes sense
[06:12:07] <Krinkle>	 But does that mean it currently retries where it doesn't say that
[06:12:09] <_joe_>	 so wherever it was missing, it was because or.i and I had too many things to do around HHVM and forgot to standardize on it
[06:12:15] <Krinkle>	 or are we just subting the default for clarity
[06:12:16] <_joe_>	 yes
[06:12:26] <_joe_>	 I don't remember what's the default on apache 2.4.x
[06:12:30] <_joe_>	 for mod_proxy
[06:13:11] <_joe_>	 sorry, I got off track completely
[06:13:26] <_joe_>	 what I said is for the proxy balancers
[06:13:28] <Krinkle>	 np.
[06:13:30] <Krinkle>	 Just one last thing
[06:13:35] <Krinkle>	 -    RewriteRule ^/w/$ /w/index.php
[06:13:44] <_joe_>	 in our case, "retry" is the time apache waits to repoll the backend
[06:13:47] <Krinkle>	 Looks like that one isn't +'ed again.
[06:13:51] <_joe_>	 yes
[06:14:02] <_joe_>	 because we have in the root configuration 
[06:14:09] <_joe_>	 DirectoryIndex index.php
[06:14:18] <Krinkle>	 Interesting
[06:14:22] <Krinkle>	 I guess that works
[06:14:30] <_joe_>	 so if you require /w/, apache will try /w/index.php
[06:14:38] <Krinkle>	 what about http://en.wikipedia.org/w/?foo
[06:14:42] <_joe_>	 I checked because half of our vhosts had that rewrite
[06:14:48] <_joe_>	 half didn't
[06:14:54] <_joe_>	 that should work
[06:14:59] <_joe_>	 lemme test
[06:15:32] <_joe_>	 interestingly, doesn't get redirected to /wiki/Main_Page
[06:15:39] <_joe_>	 which is what we'd want, right?
[06:15:55] <_joe_>	 while http://en.wikipedia.org/w/ does
[06:15:55] <Krinkle>	 No, it should get handled by MW the same way as /?
[06:16:07] <_joe_>	 yes, I'm talking about Mediawiki handling it
[06:16:09] <_joe_>	 :)
[06:16:22] <Krinkle>	  plain /w/  redirects because MW decides to normalize the url given no query string to dictate otherwise
[06:16:26] <_joe_>	 MW sends out a redirect for /w/
[06:16:28] <_joe_>	 oh I see
[06:16:50] <Krinkle>	 Im guessing meta-wiki is on the new format because https://meta.wikimedia.org/w/?foo redirects and loses the query
[06:17:13] <_joe_>	 https://en.wikivoyage.org/w/?foo works, and is in the new format
[06:17:34] <Krinkle>	 cool
[06:17:45] <_joe_>	 I don't know about meta, but I see the urls is localized upon redirect
[06:17:48] <_joe_>	 lemme try via curl
[06:18:35] <Krinkle>	 nvm, I don't know why ?foo is lost on Meta-Wiki, but other queries are not
[06:18:36] <Krinkle>	 https://meta.wikimedia.org/w/?diff=18447142
[06:18:40] <Krinkle>	 So I guess that's fine.
[06:20:37] <Krinkle>	 Interesting, the new rules make ENV:RW_PROTO optional for the /math/ redirect. Most of our new config just assumes it is set though, looks interesting, but fine either way
[06:20:53] <wikibugs>	 (03CR) 10Krinkle: [C: 031] "Looks like it fixes that, cool." [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[06:21:24] <_joe_>	 we always set it at the top of the file
[06:21:43] <Krinkle>	 +    RewriteCond %{ENV:RW_PROTO} !=""
[06:21:44] <Krinkle>	 +    RewriteRule ^/math/(.*) %{ENV:RW_PROTO}://upload.wikimedia.org/math/$1 [R=301]
[06:21:44] <Krinkle>	 +    RewriteRule ^/math/(.*) https://upload.wikimedia.org/math/$1 [R=301]
[06:21:54] <_joe_>	 oh right the conditional, yes
[06:21:56] <Krinkle>	 I assume you didn't introduce it but just happens to a variation you used as basis
[06:22:01] <_joe_>	 also, it's redundant :)
[06:22:05] <Krinkle>	 Yeah
[06:22:07] <_joe_>	 it was the most common
[06:22:10] <Krinkle>	 Cool
[06:22:36] <_joe_>	 ultimately we should, IMO, move all sites to redirect to https
[06:22:48] <_joe_>	 and thus we can just remove RW_PROTO
[06:22:50] <Krinkle>	 Yeah, there's no point in keeping it variable everywhere.
[06:25:44] <wikibugs>	 10Operations, 10WMF-JobQueue, 10Core Platform Team Kanban (Watching / External), 10User-ArielGlenn: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle)
[06:26:15] <wikibugs>	 10Operations, 10HHVM, 10User-ArielGlenn: Run all maintenance scripts on PHP7 or HHVM - https://phabricator.wikimedia.org/T195393 (10Krinkle)
[06:27:52] <wikibugs>	 10Operations, 10WMF-JobQueue, 10Core Platform Team Kanban (Watching / External), 10User-ArielGlenn: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle) Thanks. I'll purpose it for maintenance hosts (CLI maintenance scripts from cron).  For job runners, we us...
[06:29:01] <wikibugs>	 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle)
[06:31:03] <icinga-wm>	 PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/nf_conntrack.conf]
[06:31:44] <icinga-wm>	 PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean]
[06:31:44] <icinga-wm>	 PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh]
[06:33:14] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/confd-lint-wrap]
[06:42:45] <wikibugs>	 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10jcrespo) >  would you have time to chat on IRC some time today / this week / next week (or the week after  Let's...
[06:45:55] <wikibugs>	 (03CR) 10Krinkle: [C: 031] wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto)
[06:56:23] <icinga-wm>	 RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[06:57:13] <icinga-wm>	 RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:57:13] <icinga-wm>	 RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:33] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[07:09:26] <moritzm>	 !log installing python3.4/2.7 security updates
[07:09:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:02] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Create component/hhvm324 [puppet] - 10https://gerrit.wikimedia.org/r/439548 (owner: 10Muehlenhoff)
[07:10:18] <wikibugs>	 (03PS5) 10Muehlenhoff: Print group memberships which granted Hadoop access to check for HDFS cleanups [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312)
[07:18:49] <jynus>	 !log stopping s5 replication on db1070
[07:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:14] <jynus>	 !log stopping s3 replication on db1075
[07:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:29] <godog>	 !log temporarily stop prometheus on bast4001 to finalize data transfer - T179050
[07:20:33] <jynus>	 !log stopping x1 replication on db1069
[07:20:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:35] <stashbot>	 T179050: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050
[07:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:41] <jynus>	 !log stopping s3 replication on db1070
[07:28:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Cumin aliases for new eqiad1 roles [puppet] - 10https://gerrit.wikimedia.org/r/464762
[07:32:34] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on bast4001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=bast4001&var-datasource=ulsfo%2520prometheus%252Fops
[07:33:00] <jynus>	 !log chaning s3 master for db1070
[07:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:14] <wikibugs>	 (03PS1) 10Ema: Backport ATS 8.0.0 to stretch-wikimedia [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/464764 (https://phabricator.wikimedia.org/T204232)
[07:35:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Backport ATS 8.0.0 to stretch-wikimedia [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/464764 (https://phabricator.wikimedia.org/T204232) (owner: 10Ema)
[07:38:11] <wikibugs>	 10Operations, 10Patch-For-Review: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10Volans) I checked the current code and I see that in the same try/except that raises that error it does also: ``` from napalm_base.exceptions import ConnectAuthError, ModuleImportError ``` and `n...
[07:38:15] <wikibugs>	 (03CR) 10Elukey: "elukey@conf1005:~$ curl conf1005.eqiad.wmnet:8000/lag -i" [puppet] - 10https://gerrit.wikimedia.org/r/464573 (owner: 10Giuseppe Lavagetto)
[07:40:56] <wikibugs>	 (03PS2) 10Volans: ircecho: log exception on exit [puppet] - 10https://gerrit.wikimedia.org/r/463749 (https://phabricator.wikimedia.org/T205522)
[07:42:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "Nevermind, brainfart." [dns] - 10https://gerrit.wikimedia.org/r/464503 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey)
[07:45:26] <wikibugs>	 (03PS4) 10Mathew.onipe: icinga::monitor::elasticsearch: throttle alerts notifications [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187)
[07:46:52] <volans>	 this is me ^^^ deployed a code change, it should rejoin in few seconds
[07:47:12] <volans>	 welcome back icinga-wm :)
[07:47:39] <wikibugs>	 (03PS5) 10Gehel: icinga::monitor::elasticsearch: throttle alerts notifications [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe)
[07:47:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] changing prometheus.svc.ulsfo.wmnet entry to bast4002 [dns] - 10https://gerrit.wikimedia.org/r/464369 (https://phabricator.wikimedia.org/T179050) (owner: 10RobH)
[07:48:47] <wikibugs>	 (03Abandoned) 10Volans: Custom fields: fix field type [software/netbox] - 10https://gerrit.wikimedia.org/r/462860 (https://phabricator.wikimedia.org/T199083) (owner: 10Volans)
[07:48:49] <wikibugs>	 (03CR) 10Gehel: [C: 032] icinga::monitor::elasticsearch: throttle alerts notifications [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe)
[07:50:50] <jynus>	 !log stopping dbstore1001:x1
[07:50:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:03] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on db1070 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: Error: connecting slave requested to start from GTID 0-171966669-4075108480, which is not in the masters binlog. Since the masters binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to execut
[07:52:03] <icinga-wm>	  transactions
[07:54:22] <jynus>	 !log starting replicatios on db1075; db1070, db1070:s3 with disabled gtid
[07:54:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:26] <icinga-wm>	 PROBLEM - DPKG on bast3002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[07:56:26] <icinga-wm>	 RECOVERY - DPKG on bast3002 is OK: All packages OK
[07:58:06] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on db1070 is OK: OK slave_io_state Slave_IO_Running: Yes
[08:00:40] <wikibugs>	 (03PS2) 10Muehlenhoff: Add Cumin aliases for new eqiad1 roles [puppet] - 10https://gerrit.wikimedia.org/r/464762
[08:01:07] <wikibugs>	 (03PS2) 10Elukey: profile::etcd::replication: fix regex in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/464573 (owner: 10Giuseppe Lavagetto)
[08:01:22] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 (10fgiunchedi) Data transfer and CNAME flip completed. I've documented the data transfer itself at https://wikitech.wikimedia.org/wiki/Prometheus#Sync_data_from_an_existing_Prometheu...
[08:01:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add Cumin aliases for new eqiad1 roles [puppet] - 10https://gerrit.wikimedia.org/r/464762 (owner: 10Muehlenhoff)
[08:02:29] <wikibugs>	 (03PS6) 10Muehlenhoff: Print group memberships which granted Hadoop access to check for HDFS cleanups [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312)
[08:02:50] <jynus>	 !log start replication on db1069 (x1)
[08:02:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:53] <wikibugs>	 (03CR) 10Elukey: [C: 032] "There seems to be a temporary inconsistency in the lag reported causing the negative numbers, but we decided to go forward with this chang" [puppet] - 10https://gerrit.wikimedia.org/r/464573 (owner: 10Giuseppe Lavagetto)
[08:03:02] <wikibugs>	 (03PS3) 10Elukey: profile::etcd::replication: fix regex in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/464573 (owner: 10Giuseppe Lavagetto)
[08:03:09] <elukey>	 ah! merge sniped :D
[08:03:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Print group memberships which granted Hadoop access to check for HDFS cleanups [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) (owner: 10Muehlenhoff)
[08:03:56] <wikibugs>	 (03PS4) 10Elukey: profile::etcd::replication: fix regex in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/464573 (owner: 10Giuseppe Lavagetto)
[08:03:59] <wikibugs>	 (03CR) 10Elukey: [V: 032 C: 032] profile::etcd::replication: fix regex in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/464573 (owner: 10Giuseppe Lavagetto)
[08:04:38] <elukey>	 moritzm: feel free to merge mine :)
[08:08:09] <moritzm>	 elukey: done!
[08:08:14] <elukey>	 thanks!
[08:13:28] <wikibugs>	 10Operations, 10Services, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Add wdqs-updater to scap target in puppet - https://phabricator.wikimedia.org/T206303 (10Mathew.onipe)
[08:16:06] <wikibugs>	 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Krenair) labsaliaser is working now: ```krenair@bastion-eqiad1-01:~$ dig eqiad1.bastion.wmflabs.org @8.8.8.8 +short 185.15.56.13 krenair@bastion-eqiad...
[08:16:27] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Services (doing): Add sudo rules for wdqs-updater in puppet - https://phabricator.wikimedia.org/T206303 (10mobrovac) a:03mobrovac
[08:16:39] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Services (doing): Add sudo rules for wdqs-updater in puppet - https://phabricator.wikimedia.org/T206303 (10mobrovac) p:05Triage>03Normal
[08:19:49] <wikibugs>	 (03PS1) 10Mobrovac: WDQS: Add sudo rules for wdqs-updater [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303)
[08:20:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WDQS: Add sudo rules for wdqs-updater [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) (owner: 10Mobrovac)
[08:21:40] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805)
[08:22:12] <wikibugs>	 (03PS2) 10Mobrovac: WDQS: Add sudo rules for wdqs-updater [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303)
[08:23:08] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 2 others: Add sudo rules for wdqs-updater in puppet - https://phabricator.wikimedia.org/T206303 (10Gehel) @mobrovac thanks for the fast response!  I was wondering if we had a cleaner way to declare that a scap::targe...
[08:26:11] <wikibugs>	 10Operations, 10cloud-services-team: Use of wrapper script in prometheus-openstack-exporter prevents automated restarts - https://phabricator.wikimedia.org/T206304 (10MoritzMuehlenhoff)
[08:27:21] <wikibugs>	 (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/compiler1002/12779/" [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) (owner: 10Mobrovac)
[08:28:59] <wikibugs>	 10Operations, 10cloud-services-team: Use of wrapper script in prometheus-openstack-exporter prevents automated restarts - https://phabricator.wikimedia.org/T206304 (10MoritzMuehlenhoff) p:05Triage>03Normal
[08:30:47] <wikibugs>	 (03CR) 10Mathew.onipe: "Jenkins dry run:" [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) (owner: 10Mobrovac)
[08:33:14] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 031] WDQS: Add sudo rules for wdqs-updater [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) (owner: 10Mobrovac)
[08:35:56] <_joe_>	 !log reenabling notifications for etcdmirror on conf1005
[08:35:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:38] <wikibugs>	 (03CR) 10Banyek: [C: 032] wmf-pt-kill: WMF patched version 2 [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek)
[08:37:50] <wikibugs>	 (03CR) 10Banyek: [V: 032 C: 032] wmf-pt-kill: WMF patched version 2 [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek)
[08:39:12] <wikibugs>	 (03CR) 10ArielGlenn: "I'll merge this through then when this week's wd dumps are complete.It's recompressing all-BETA-nt right now, which means there's still th" [puppet] - 10https://gerrit.wikimedia.org/r/461862 (https://phabricator.wikimedia.org/T202830) (owner: 10Smalyshev)
[08:44:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464769 (https://phabricator.wikimedia.org/T135991)
[08:45:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464769 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:46:31] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464769 (https://phabricator.wikimedia.org/T135991)
[08:49:51] <wikibugs>	 (03PS2) 10Elukey: Move _etcd._tcp* SRV records to etcd codfw [dns] - 10https://gerrit.wikimedia.org/r/464503 (https://phabricator.wikimedia.org/T205814)
[08:52:18] <_joe_>	 elukey: I have the cache wipe ready, just for the RW records, which are more important
[08:52:34] <_joe_>	 I'll also wipe the rest afterwards
[08:52:47] <_joe_>	 so just tell me when you've distributed the change
[08:52:59] <_joe_>	 I'll do the cache wipes, then we can verify
[08:53:17] <_joe_>	 and we will start the long process of chasing down long-connected clients
[08:53:24] <_joe_>	 to the old eqiad cluster
[08:53:25] <wikibugs>	 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10fgiunchedi) a:05fgiunchedi>03None >>! In T170817#4643270, @kaldari wrote: > @fgiunchedi - Are you still working on this or shou...
[08:53:42] <elukey>	 all right, merging then
[08:53:49] <wikibugs>	 (03CR) 10Elukey: [C: 032] Move _etcd._tcp* SRV records to etcd codfw [dns] - 10https://gerrit.wikimedia.org/r/464503 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey)
[08:54:38] <elukey>	 _joe_ authdns updated
[08:54:46] <_joe_>	 elukey: ok, wiping caches now
[08:55:06] <wikibugs>	 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, and 2 others: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Banyek) I'd happily get involved in this
[08:55:36] <elukey>	 for everybody reading - we are moving etcd srv records from conf100[1-3] to codfw
[08:55:46] <elukey>	 probably better logging it
[08:55:46] <_joe_>	 elukey: done
[08:56:02] <_joe_>	 !log read-write connections to etcd only go to codfw now
[08:56:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:15] <elukey>	 good
[08:56:23] <elukey>	 so now we chase down the long lived conns
[08:56:24] <elukey>	 right?
[08:56:34] <_joe_>	 elukey: while I keep cleaning caches, would you stop replication eqiad => codfw?
[08:56:37] <_joe_>	 via puppet
[08:56:40] <elukey>	 ack
[08:56:44] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1315 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:56:54] <_joe_>	 oh right, meh
[08:57:05] <_joe_>	 hehe we might get a ton of those now
[08:57:08] <_joe_>	 sorry
[08:57:14] <_joe_>	 but it's not true
[08:57:20] <_joe_>	 it's just different clusters
[08:57:35] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1224 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:57:44] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1348 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:57:45] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1331 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:57:54] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1272 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:57:54] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1315 is OK: etcd last index (302926) matches the master one (302926)
[08:57:56] <jynus>	 so no outage, right?
[08:58:02] <volans>	 _joe_: mmmh in theory just few of them
[08:58:11] <volans>	 given the index on einsteinium is updated every 30s IIRC
[08:58:25] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1247 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:58:34] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1343 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:58:34] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1249 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:58:35] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1288 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:58:35] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1324 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:58:35] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1289 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:58:35] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1281 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:58:40] <_joe_>	 !log wiped cached values for the read-only etcd SRV record
[08:58:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:45] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1245 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:58:45] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1228 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:58:48] <_joe_>	 volans: well a few is still a lot
[08:58:51] <_joe_>	 and yes, no outage
[08:58:54] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1332 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:58:54] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1269 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:58:55] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1278 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:58:55] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1272 is OK: etcd last index (302926) matches the master one (302926)
[08:58:55] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1287 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:58:57] <volans>	 they seem too many to me
[08:58:58] <_joe_>	 sorry for the noise
[08:59:00] <volans>	 checking the check
[08:59:04] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1339 is CRITICAL: etcd last index (212133) is outdated compared to the master one (302926)
[08:59:19] <_joe_>	 volans: all mediawikis still reading from eqiad will have that error
[08:59:27] <_joe_>	 einsteinium is updated
[08:59:33] <volans>	 _joe_: the check does dig +short SRV "_etcd._tcp.${dc}.wmnet"
[08:59:35] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1247 is OK: etcd last index (302926) matches the master one (302926)
[08:59:44] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1343 is OK: etcd last index (302926) matches the master one (302926)
[08:59:44] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1249 is OK: etcd last index (302926) matches the master one (302926)
[08:59:44] <volans>	 on einsteinium
[08:59:45] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1288 is OK: etcd last index (302926) matches the master one (302926)
[08:59:45] <_joe_>	 see the recoveries?
[08:59:45] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1324 is OK: etcd last index (302926) matches the master one (302926)
[08:59:45] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1289 is OK: etcd last index (302926) matches the master one (302926)
[08:59:45] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1281 is OK: etcd last index (302926) matches the master one (302926)
[08:59:51] <volans>	 yeah ok,
[08:59:54] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1224 is OK: etcd last index (302926) matches the master one (302926)
[08:59:54] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1245 is OK: etcd last index (302926) matches the master one (302926)
[08:59:55] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1228 is OK: etcd last index (302926) matches the master one (302926)
[08:59:55] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1348 is OK: etcd last index (302926) matches the master one (302926)
[09:00:03] <_joe_>	 it's expected, it was just a few of them
[09:00:04] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1331 is OK: etcd last index (302926) matches the master one (302926)
[09:00:04] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1332 is OK: etcd last index (302926) matches the master one (302926)
[09:00:04] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1269 is OK: etcd last index (302926) matches the master one (302926)
[09:00:04] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1278 is OK: etcd last index (302926) matches the master one (302926)
[09:00:05] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1287 is OK: etcd last index (302926) matches the master one (302926)
[09:00:14] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1339 is OK: etcd last index (302926) matches the master one (302926)
[09:00:15] <_joe_>	 actually this proves the check works well :P
[09:00:30] <wikibugs>	 (03PS1) 10Elukey: Stop replicating etcd data from conf100[1-3] to codfw [puppet] - 10https://gerrit.wikimedia.org/r/464772 (https://phabricator.wikimedia.org/T205814)
[09:00:36] <elukey>	 there you go --^
[09:00:38] <volans>	 ahahah
[09:00:48] <_joe_>	 elukey: downtime conf2002, merge and apply
[09:00:52] <elukey>	 ack
[09:01:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Stop replicating etcd data from conf100[1-3] to codfw [puppet] - 10https://gerrit.wikimedia.org/r/464772 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey)
[09:02:18] <elukey>	 downtimed
[09:02:31] <_joe_>	 ok
[09:02:33] <elukey>	 merging and applying
[09:02:46] <_joe_>	 cool tnx
[09:04:24] <_joe_>	 elukey: I'll now restart confd on a server in esams to check the connections go away from conf1001
[09:05:17] <_joe_>	 cool works as expected
[09:05:26] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo)
[09:05:36] <elukey>	 Notice: /Stage[main]/Profile::Etcd::Replication/Etcdmirror::Instance[/conftool@eqiad.wmnet]/Systemd::Service[etcdmirror-conftool-eqiad-wmnet]/Service[etcdmirror-conftool-eqiad-wmnet]/ensure: ensure changed 'running' to 'stopped'
[09:05:49] <_joe_>	 !log restarting confd on all nodes in eqiad and esams
[09:05:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:08] <elukey>	 !log stop etcdmirror replication on conf2002
[09:06:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:55] <wikibugs>	 (03PS1) 10Volans: Upstream release v0.0.9 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464773
[09:07:09] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo)
[09:08:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Upstream release v0.0.9 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464773 (owner: 10Volans)
[09:09:06] <wikibugs>	 (03CR) 10Volans: [C: 032] Upstream release v0.0.9 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464773 (owner: 10Volans)
[09:10:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Upstream release v0.0.9 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464773 (owner: 10Volans)
[09:10:48] <volans>	 bad jenkins... what's wrong
[09:11:18] <volans>	 damn again the randomly failing test from a static checker, I need to dig into this
[09:11:21] <_joe_>	 volans: why can't cumin connect to tegmen?
[09:11:27] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo)
[09:11:33] <_joe_>	 elukey: now can you go to like conf1001 and check what's still connecting there?
[09:11:34] <volans>	 _joe_: checking
[09:11:52] <volans>	 _joe_: I think because we had a re-occurrence of
[09:12:00] <volans>	 T199413
[09:12:01] <stashbot>	 T199413: Systemd restart loop of timer filled the disk on tegmen - https://phabricator.wikimedia.org/T199413
[09:12:07] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Move some wikis for s3 to s5 (duration: 00m 56s)
[09:12:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:18] <elukey>	 _joe_ already doing it
[09:12:41] <wikibugs>	 (03PS1) 10Banyek: Builder: add dh-sysuser for builders > stretch [puppet] - 10https://gerrit.wikimedia.org/r/464775
[09:12:42] <volans>	 _joe_: I'm checking the console
[09:12:52] <volans>	 godog: FYI ^^^ (tegmen)
[09:13:57] <_joe_>	 elukey: it should only be pybals
[09:14:15] <_joe_>	 but, since it's also esams (we should've moved it before the switch)
[09:14:19] <_joe_>	 let's start there
[09:14:28] <_joe_>	 I'm merging the puppet changes in a moment
[09:14:45] <elukey>	 _joe_ yep it seems so, I grepped in netstat and I can see other stuff but not related to etcd 
[09:15:35] <elukey>	 also conf100[2,3] looks good afaics
[09:15:36] <volans>	 tegmen pretty unresponsive on console, I'll reboot it
[09:16:22] <volans>	 !log rebooting tegmen, console stuck, possible re-occurrence of T199413 (to be confirmed)
[09:16:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:42] <godog>	 volans: uuughhhh, looks like fs was at 30% tho
[09:16:55] <godog>	 I was looking at https://grafana.wikimedia.org/dashboard/db/host-overview-grafanalib?refresh=300s&orgId=1&panelId=12&fullscreen&var-datasource=codfw%20prometheus%2Fops&var-server=tegmen&var-cluster=misc
[09:17:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for PowerDNS Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/464778 (https://phabricator.wikimedia.org/T135991)
[09:17:31] <volans>	 godog: maybe was stuck for other reasons, let's see now that comes back ;)
[09:17:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for PowerDNS Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/464778 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:19:09] <volans>	 godog: confirmed, disk space is ok, seems unreleated
[09:19:13] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for PowerDNS Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/464778 (https://phabricator.wikimedia.org/T135991)
[09:19:40] <wikibugs>	 10Operations, 10Core Platform Team Kanban (Watching / External), 10User-ArielGlenn: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10mobrovac)
[09:19:54] <volans>	 godog: but funnily enough, it's now in the restart loop
[09:20:03] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: pybal: stop reading from the old etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/464779 (https://phabricator.wikimedia.org/T205814)
[09:20:15] <_joe_>	 elukey: ^^
[09:20:21] <_joe_>	 let's do this quick
[09:22:06] <_joe_>	 ok, merging it
[09:22:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: stop reading from the old etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/464779 (https://phabricator.wikimedia.org/T205814) (owner: 10Giuseppe Lavagetto)
[09:23:14] <elukey>	 ack
[09:23:15] <_joe_>	 now I'll apply puppet on the lvs servers in eqiad and esams
[09:23:31] <elukey>	 does it require a rolling restart of pybal or just a puppet run?
[09:23:45] <_joe_>	 both ofc
[09:23:48] <elukey>	 ok
[09:24:07] <_joe_>	 we're not foolish enough to let puppet control pybal that much
[09:24:18] <elukey>	 yep 
[09:24:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good. Given that our package build host is now also running stretch, you can also simply add it (and the golang one) to the package " [puppet] - 10https://gerrit.wikimedia.org/r/464775 (owner: 10Banyek)
[09:24:38] <vgutierrez>	 s/foolish/crazy/g :P
[09:25:23] <_joe_>	 elukey: can you do the dance in eqiad?
[09:25:29] <_joe_>	 remember, first the secondary pybals
[09:25:32] <_joe_>	 then the primaries
[09:26:10] <elukey>	 me restarting pybals? I'd prefer not :)
[09:26:22] <_joe_>	 why?
[09:26:31] <vgutierrez>	 :_( he doesn't trust pybal
[09:26:33] * vgutierrez sad
[09:26:37] <elukey>	 never done it and I don't want to make things exploding
[09:26:39] <_joe_>	 somehow my fingers give magical juice to systemctl?
[09:26:40] <_joe_>	 :P
[09:26:44] <_joe_>	 it's eqiad
[09:26:53] <_joe_>	 you can play in the sand, kid
[09:26:55] <_joe_>	 :P
[09:27:04] <akosiaris>	 there is no problem in restart pybals
[09:27:13] <akosiaris>	 as long as you don't restart both of a group at the same time
[09:27:21] <vgutierrez>	 even in that case
[09:27:35] <akosiaris>	 well, there is gonna be a spike in 5xx then
[09:27:37] <_joe_>	 that's why I said "first secondaries, then primaries"
[09:27:38] <akosiaris>	 as I 've found out
[09:27:49] <vgutierrez>	 the default static routers would trigger and route the traffic via the primary lvs instances
[09:27:49] <_joe_>	 so the problem right now is
[09:27:53] <vgutierrez>	 s/routers/routes/g
[09:27:57] <_joe_>	 puppet changes an npre check
[09:28:05] <_joe_>	 that will start alarming if we don't restart pybal
[09:28:10] <akosiaris>	 vgutierrez: actually it would cause a BGP flap and traffic going back and forth
[09:28:12] <_joe_>	 which is a nice reminder :)
[09:28:28] <akosiaris>	 it would coalesce pretty soon but still
[09:29:01] <_joe_>	 elukey: ok I wil restart pybal on the lvs servers
[09:29:53] <_joe_>	 can someone check lvs1002?
[09:30:04] <elukey>	 thanks, it takes me a bit of time to find 1) the lvs 2) understand who is primary/secondary 3) restart
[09:30:45] <_joe_>	 vgutierrez: I see an alert about bgp sessions on lvs3004
[09:31:01] <vgutierrez>	 hmmm
[09:31:04] * vgutierrez checking
[09:31:33] <_joe_>	 just webnt away
[09:31:34] <_joe_>	 sorry
[09:31:50] <vgutierrez>	 yup.. seems happy :)
[09:32:00] <vgutierrez>	 don't worry
[09:32:04] <_joe_>	 ok so lvs1002 seems to be farting since 15 hours
[09:32:13] <_joe_>	 how ocme I didn't notice it?
[09:32:42] <_joe_>	  Puppet is disabled. bblack
[09:32:51] <_joe_>	 nice, informative comment brandon :P
[09:33:35] <elukey>	 from the logs it might be due to the network maintenance that we did yesterday
[09:33:49] <_joe_>	 pybal is dead too there, so there isn't much to worry about
[09:36:14] <banyek>	 !log adding wmf-pt-kill_2.2.20-1+wmf2 package for stretch
[09:36:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:38] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=42)
[09:36:38] <wikibugs>	 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Systemd restart loop of timer filled the disk on tegmen - https://phabricator.wikimedia.org/T199413 (10Volans) Sorry, false alarm, it was unrelated.
[09:37:11] <banyek>	 !log disabling puppet on labsdb1009,labsdb1010,labsdb1011 (T203674)
[09:37:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:15] <stashbot>	 T203674: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674
[09:37:17] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer
[09:37:30] <elukey>	 this is me --^
[09:37:48] <elukey>	 !log restart rsyslog on lithium - broken connection to tegmen - T199406
[09:37:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:52] <stashbot>	 T199406: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406
[09:38:07] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1114 days)
[09:38:28] <_joe_>	 elukey: can you confirm there are no more external connections to 1003?
[09:38:38] <elukey>	 checking
[09:38:45] <elukey>	 godog: Cc for rsyslog --^
[09:39:11] <wikibugs>	 (03CR) 10Volans: [C: 032] "recheck" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464773 (owner: 10Volans)
[09:39:30] <elukey>	 _joe_ I can see established to lvs3002.esams.wmn
[09:39:33] <godog>	 thanks elukey!
[09:40:00] <_joe_>	 elukey: uhm
[09:40:08] <elukey>	 tcp        0      0 conf1003.eqiad.wmn:2379 lvs3002.esams.wmn:58102 ESTABLISHED -
[09:40:23] <_joe_>	 can you look which applicaiton is doing that?
[09:40:27] <_joe_>	 that's pretty strange
[09:40:37] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v0.0.9 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464773 (owner: 10Volans)
[09:41:03] <elukey>	 nginx     31412               www-data   20u     IPv4         1040953863      0t0        TCP conf1003.eqiad.wmnet:2379->lvs3002.esams.wmnet:58096 (ESTABLISHED)
[09:41:12] <elukey>	 there  you go :)
[09:41:20] <_joe_>	 no I mean on lvs3002 :P
[09:41:42] <_joe_>	 I knew it was nginx on the conf1003 side
[09:42:30] <elukey>	 pybal
[09:42:47] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on bast4001 is CRITICAL: cluster=misc device=sdc instance=bast4001:9100 job=node site=ulsfo https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=bast4001&var-datasource=ulsfo%2520prometheus%252Fops
[09:43:00] <_joe_>	 meh
[09:43:48] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1001 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=4)
[09:44:37] <elukey>	 there are two crits for 1006 and 1016 as well
[09:44:43] <_joe_>	 yes, expected
[09:44:57] <elukey>	 ack
[09:45:08] <_joe_>	 1006 should recover soon
[09:45:27] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=42)
[09:46:38] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 42 connections established with conf2001.codfw.wmnet:2379 (min=42)
[09:47:51] <_joe_>	 labpuppetmaster1001.wikimedia.org is still connected to conf1001
[09:48:57] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1001 is OK: OK: 4 connections established with conf2001.codfw.wmnet:2379 (min=4)
[09:49:56] <elukey>	 I can see webperf1001.eqiad also on conf1003
[09:50:28] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 42 connections established with conf2001.codfw.wmnet:2379 (min=42)
[09:50:33] <elukey>	 /srv/deployment/performance/navtiming/run_navtiming.py
[09:52:04] <_joe_>	 elukey: wat
[09:52:28] <_joe_>	 what is that reading?
[09:52:32] <_joe_>	 it's not documented
[09:52:59] <elukey>	 I am trying to understand
[09:53:35] <elukey>	 ah yes it is in /srv/deployment/performance/navtiming/config.ini
[09:53:47] <elukey>	 /conftool/v1/mediawiki-config/common/WMFMasterDatacenter
[09:53:58] <_joe_>	 ok
[09:54:10] <_joe_>	 how can we change the configuration?
[09:54:37] <elukey>	 lemme check the navtiming puppet config
[09:54:46] <elukey>	 or possibly the repo
[09:54:51] <_joe_>	 also, if it used python-etcd, a simple restart would suffice
[09:56:05] <_joe_>	 I'll submit a patch in case
[09:57:45] <elukey>	 yes python-etcd is deployed
[10:00:10] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Update dblists to move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805)
[10:03:17] <_joe_>	 ok navtiming uses python-etcd AFAICT
[10:03:50] <_joe_>	 !log restarting navtiming.service on webperf1001 to pick up the dns change for etcd
[10:03:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:29] <wikibugs>	 (03PS1) 10Banyek: wikirepicas: wmf-pt-kill template data only from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464781 (https://phabricator.wikimedia.org/T203674)
[10:04:51] <_joe_>	 elukey: it's now connected to 2002
[10:05:02] <elukey>	 _joe_ yep I was about to say that, I just finished reading the code
[10:05:05] <elukey>	 super
[10:05:15] <_joe_>	 elukey: it uses python-etcd or conftool?
[10:05:20] <_joe_>	 the latter would be better
[10:05:57] <elukey>	 https://github.com/wikimedia/performance-navtiming/blob/master/navtiming/__init__.py#L46
[10:06:09] <_joe_>	 heh
[10:06:16] <_joe_>	 ok, I will send a patch then
[10:07:22] <_joe_>	 elukey: now if you want to be double sure, check the grafana data for etcd in codfw and some of the logs
[10:07:35] <elukey>	 so grepping for 2379 I can only see labspuppetmaster still connected to conf1001
[10:07:53] <_joe_>	 yeah, we need to understand that too
[10:08:05] <_joe_>	 I thought it was confd, but that should be restarted already
[10:09:23] <elukey>	 it is confd indeed
[10:09:33] <elukey>	 confd       974                   root    3u     IPv4          446566539       0t0        TCP labpuppetmaster1001.wikimedia.org:54070->conf1001.eqiad.wmnet:2379 (ESTABLISHED)
[10:09:41] <_joe_>	 ok, just restart it :)
[10:10:05] <wikibugs>	 (03PS1) 10Volans: Add missing build dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/464783
[10:10:11] <elukey>	 connected to conf2001 now :)
[10:10:15] <_joe_>	 cool
[10:10:25] <wikibugs>	 (03Abandoned) 10Volans: Add missing build dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/464783 (owner: 10Volans)
[10:10:35] <elukey>	 !log restart confd on labs-puppetmaster to pick up new etcd settings (eqiad -> codfw)
[10:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:52] <_joe_>	 elukey: let's take a break, then we can play with some hosts in eqiad and connect them to the new cluster
[10:11:23] <_joe_>	 well done!
[10:11:26] <wikibugs>	 (03PS1) 10Volans: Add missing build dependency [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464784
[10:11:30] <elukey>	 well you did all the work :D
[10:11:41] <elukey>	 glad that we are close to nuke conf100[1,3] :)
[10:13:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add missing build dependency [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464784 (owner: 10Volans)
[10:13:49] <wikibugs>	 (03CR) 10Volans: "recheck" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464784 (owner: 10Volans)
[10:13:58] <icinga-wm>	 RECOVERY - MegaRAID on helium is OK: OK: optimal, 1 logical, 12 physical
[10:15:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add missing build dependency [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464784 (owner: 10Volans)
[10:16:24] <wikibugs>	 (03CR) 10Volans: [C: 032] "recheck" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464784 (owner: 10Volans)
[10:17:14] <moritzm>	 !log rearmed keyholder on netmon2001
[10:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:28] <icinga-wm>	 RECOVERY - Keyholder SSH agent on netmon2001 is OK: OK: Keyholder is armed with all configured keys.
[10:17:53] <wikibugs>	 (03Merged) 10jenkins-bot: Add missing build dependency [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/464784 (owner: 10Volans)
[10:21:25] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: decommission/replace bast4001.wikimedia.org - https://phabricator.wikimedia.org/T178592 (10MoritzMuehlenhoff) Note that this host also emits SMART errors since two days, not worth investigating further as it's going to be decommed.
[10:21:42] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on bast4001 is CRITICAL: cluster=misc device=sdc instance=bast4001:9100 job=node site=ulsfo Muehlenhoff T178592 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=bast4001&var-datasource=ulsfo%2520prometheus%252Fops
[10:26:48] <wikibugs>	 10Operations, 10Traffic: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Vgutierrez) p:05Triage>03Normal
[10:34:32] <wikibugs>	 10Operations, 10Traffic, 10vm-requests: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Krenair)
[10:35:43] <wikibugs>	 (03PS2) 10Banyek: wikirepicas: wmf-pt-kill template data only from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464781 (https://phabricator.wikimedia.org/T203674)
[10:36:33] <wikibugs>	 10Operations, 10Traffic, 10vm-requests: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Vgutierrez) a:03Vgutierrez
[10:37:22] <wikibugs>	 (03PS3) 10Banyek: wikirepicas: wmf-pt-kill template data only from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464781 (https://phabricator.wikimedia.org/T203674)
[10:37:54] <volans>	 !log uploaded spicerack_0.0.9-1{,+deb9u1} to apt.wikimedia.org {jessie,stretch}-wikimedia - T199079
[10:37:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:59] <stashbot>	 T199079: Refactor the switchdc script - https://phabricator.wikimedia.org/T199079
[10:37:59] <wikibugs>	 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) >>! In T203786#4629141, @elukey wrote:= > The throttling mechanism that memca...
[10:39:20] <wikibugs>	 (03CR) 10Banyek: [C: 032] wikirepicas: wmf-pt-kill template data only from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464781 (https://phabricator.wikimedia.org/T203674) (owner: 10Banyek)
[10:40:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for ircecho [puppet] - 10https://gerrit.wikimedia.org/r/464791 (https://phabricator.wikimedia.org/T135991)
[10:40:59] <jynus>	 !log restarting replication on labsdb1010/1 on s3 and s5
[10:41:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:36] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/464791 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:46:08] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[10:50:27] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[10:51:01] * banyek|away lunch
[10:56:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/464389 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[11:02:32] <wikibugs>	 (03PS1) 10Vgutierrez: dns entries for certcentral[12]001 [dns] - 10https://gerrit.wikimedia.org/r/464795 (https://phabricator.wikimedia.org/T206308)
[11:09:32] <wikibugs>	 (03PS2) 10Vgutierrez: dns entries for certcentral[12]001 [dns] - 10https://gerrit.wikimedia.org/r/464795 (https://phabricator.wikimedia.org/T206308)
[11:12:08] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[11:14:18] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[11:17:20] <wikibugs>	 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) So the wikis have been loaded into s5, and they are the primary place to read them (and eventually, write them), the only think pending is, some...
[11:17:40] <wikibugs>	 (03PS1) 10Jcrespo: site.pp: Comment fixes due to dewiki no longer being the only s5 wiki [puppet] - 10https://gerrit.wikimedia.org/r/464797 (https://phabricator.wikimedia.org/T184805)
[11:22:21] <wikibugs>	 (03CR) 10Alex Monk: [C: 031] dns entries for certcentral[12]001 [dns] - 10https://gerrit.wikimedia.org/r/464795 (https://phabricator.wikimedia.org/T206308) (owner: 10Vgutierrez)
[11:24:27] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] dns entries for certcentral[12]001 [dns] - 10https://gerrit.wikimedia.org/r/464795 (https://phabricator.wikimedia.org/T206308) (owner: 10Vgutierrez)
[11:25:29] <wikibugs>	 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) This has to be done https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replica_DNS **after **the dblists are updated (without an...
[11:29:14] <moritzm>	 !log rebooting ruthenium for kernel security update
[11:29:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:48] <icinga-wm>	 PROBLEM - MegaRAID on db1072 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[11:30:51] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1072 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206313
[11:30:56] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10ops-monitoring-bot)
[11:31:14] <wikibugs>	 (03PS1) 10Elukey: Add matomo1001's IPv6 PTR [dns] - 10https://gerrit.wikimedia.org/r/464799 (https://phabricator.wikimedia.org/T202962)
[11:31:49] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add matomo1001's IPv6 PTR [dns] - 10https://gerrit.wikimedia.org/r/464799 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey)
[11:33:02] <wikibugs>	 10Operations, 10Scap, 10Services, 10Discovery-Search (Current work), 10Release: Modify scap::target to define sudo rules for multiple services - https://phabricator.wikimedia.org/T206314 (10Mathew.onipe)
[11:33:09] <wikibugs>	 10Operations, 10Scap, 10Services, 10Discovery-Search (Current work), 10Release: Modify scap::target to define sudo rules for multiple services - https://phabricator.wikimedia.org/T206314 (10Mathew.onipe) a:03Mathew.onipe
[11:33:40] <wikibugs>	 (03PS6) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809
[11:34:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk)
[11:37:27] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban: Decommission bohrium - https://phabricator.wikimedia.org/T206315 (10elukey) p:05Triage>03Normal
[11:42:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 031] Enable base::service_auto_restart for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464769 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[11:42:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 031] Enable base::service_auto_restart for PowerDNS Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/464778 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[11:43:10] <moritzm>	 !log rebooting wezen for kernel security update
[11:43:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:27] <wikibugs>	 (03PS1) 10Elukey: Decommission bohrium [puppet] - 10https://gerrit.wikimedia.org/r/464802 (https://phabricator.wikimedia.org/T206315)
[11:44:45] <wikibugs>	 10Operations, 10Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 (10aborrero) Could you please also patch `profile::prometheus::openstack_exporter`?
[11:46:16] <wikibugs>	 (03CR) 10Elukey: [C: 032] Decommission bohrium [puppet] - 10https://gerrit.wikimedia.org/r/464802 (https://phabricator.wikimedia.org/T206315) (owner: 10Elukey)
[11:48:53] <wikibugs>	 10Operations, 10Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 (10MoritzMuehlenhoff) >>! In T135991#4645190, @aborrero wrote: > Could you please also patch `profile::prometheus::openstack_exporter`?  I actually looked into that earl...
[11:49:54] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Decommission bohrium - https://phabricator.wikimedia.org/T206315 (10ops-monitoring-bot) wmf-decommission-host was executed by elukey for bohrium.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from Pu...
[11:50:40] <elukey>	 volans: just used wmf-decommission-hostm - is it working with ganeti instances?
[11:50:50] <elukey>	 because at some points it asks for the mgmt interface
[11:53:50] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service: wdqs1009 - cannot create /var/log/wdqs/wdqs_autodeployment.log - https://phabricator.wikimedia.org/T206318 (10Krenair)
[11:53:53] <elukey>	 the output in the task is awesome btw
[11:57:18] <wikibugs>	 10Operations, 10cloud-services-team: Use of wrapper script in prometheus-openstack-exporter prevents automated restarts - https://phabricator.wikimedia.org/T206304 (10aborrero) Do you mean writing to a file somewhere all the env vars produced by `/root/novaenv.sh` and then loading them in `EnvironmentFile=`? I...
[11:59:42] <elukey>	 !log deleted bohrium from ganeti via gnt-instance
[11:59:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:39] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Decommission bohrium - https://phabricator.wikimedia.org/T206315 (10elukey) ``` elukey@ganeti1001:~$ sudo gnt-instance remove bohrium.eqiad.wmnet This will remove the volumes of the instance bohrium.eqiad.wmnet (including mirrors), thus rem...
[12:00:57] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Decommission bohrium - https://phabricator.wikimedia.org/T206315 (10elukey)
[12:02:08] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Decommission bohrium - https://phabricator.wikimedia.org/T206315 (10elukey) 05Open>03Resolved
[12:05:00] <wikibugs>	 10Operations, 10Cloud-Services, 10Mail, 10User-herron: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10aborrero) We have a mechanisms called `dmz_cidr` which we can use to exclude NATs between certain IP ranges. See a more detailed explanation h...
[12:06:41] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10jcrespo) p:05Triage>03Normal a:03Cmjohnson m3 master, hopefully you will have still a couple of 600GB disks for this and for T206254
[12:09:56] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Marostegui) Not surprising after a reboot :(
[12:12:03] <vgutierrez>	 !log Creating certcentral2001.codfw.wmnet in ganeti - T206308
[12:12:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:07] <stashbot>	 T206308: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308
[12:13:41] <vgutierrez>	 !log Creating certcentral1001.eqiad.wmnet in ganeti - T206308
[12:13:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:16] <icinga-wm>	 PROBLEM - cpjobqueue endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[12:16:26] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-html/{title}{/revision}{/tid} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retr
[12:16:26] <icinga-wm>	  data for April 29, 2016) timed out before a response was received: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received
[12:16:27] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[12:16:36] <icinga-wm>	 PROBLEM - eventstreams on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:16:48] <icinga-wm>	 PROBLEM - Check size of conntrack table on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[12:16:56] <icinga-wm>	 PROBLEM - SSH on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:16:57] <icinga-wm>	 PROBLEM - Disk space on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[12:17:06] <icinga-wm>	 PROBLEM - apertium apy on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:17:06] <icinga-wm>	 PROBLEM - pdfrender on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:17:17] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[12:17:26] <icinga-wm>	 PROBLEM - nutcracker process on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[12:17:27] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy
[12:17:47] <icinga-wm>	 PROBLEM - dhclient process on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[12:17:56] <icinga-wm>	 PROBLEM - configured eth on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[12:18:26] <elukey>	 can't connect via ssh --^
[12:18:26] <icinga-wm>	 RECOVERY - nutcracker process on scb2001 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker
[12:18:36] <icinga-wm>	 RECOVERY - eventstreams on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1043 bytes in 0.076 second response time
[12:18:42] <elukey>	 ah yes now I can, oom killer party
[12:18:46] <icinga-wm>	 RECOVERY - dhclient process on scb2001 is OK: PROCS OK: 0 processes with command name dhclient
[12:18:47] <icinga-wm>	 RECOVERY - Check size of conntrack table on scb2001 is OK: OK: nf_conntrack is 6 % full
[12:18:47] <icinga-wm>	 RECOVERY - SSH on scb2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[12:18:47] <icinga-wm>	 RECOVERY - configured eth on scb2001 is OK: OK - interfaces up
[12:18:50] <wikibugs>	 10Operations, 10cloud-services-team: Use of wrapper script in prometheus-openstack-exporter prevents automated restarts - https://phabricator.wikimedia.org/T206304 (10MoritzMuehlenhoff) >>! In T206304#4645232, @aborrero wrote: > Do you mean writing to a file somewhere all the env vars produced by `/root/novaen...
[12:18:57] <icinga-wm>	 RECOVERY - Disk space on scb2001 is OK: DISK OK
[12:18:57] <icinga-wm>	 RECOVERY - pdfrender on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.073 second response time
[12:18:57] <icinga-wm>	 RECOVERY - apertium apy on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.074 second response time
[12:19:14] <elukey>	 mobrovac: --^
[12:19:15] <moritzm>	 pdfrender?
[12:19:16] <icinga-wm>	 RECOVERY - cpjobqueue endpoints health on scb2001 is OK: All endpoints are healthy
[12:19:17] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on scb2001 is OK: OK ferm input default policy is set
[12:19:17] <godog>	 what's the load avg elukey? I can see it skyrocketed up
[12:19:27] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy
[12:19:40] <mobrovac>	 sigh
[12:19:48] <elukey>	 elukey@scb2001:~$ uptime 12:19:42 up 220 days, 22:01,  1 user,  load average: 1780.46, 3441.59, 1778.26
[12:19:51] <mobrovac>	 disk space? what happened?
[12:20:00] <mobrovac>	 ah no, just all critical
[12:20:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Further WMCS Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/464805
[12:20:09] <elukey>	 mobrovac: a nodejs app was killed by the oom
[12:20:53] <mobrovac>	 elukey: which one?
[12:21:43] <elukey>	 well I can only see
[12:21:43] <elukey>	 [Fri Oct  5 12:36:43 2018] Out of memory: Kill process 7903 (nodejs) score 24 or sacrifice child
[12:21:46] <elukey>	 [Fri Oct  5 12:36:43 2018] Killed process 7903 (nodejs) total-vm:10924168kB, anon-rss:810452kB, file-rss:0kB, shmem-rss:0kB
[12:22:01] <wikibugs>	 (03PS3) 10Gehel: WDQS: Add sudo rules for wdqs-updater [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) (owner: 10Mobrovac)
[12:22:40] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Add certcentral[12]001 to the DHCP configuration [puppet] - 10https://gerrit.wikimedia.org/r/464806 (https://phabricator.wikimedia.org/T206308)
[12:23:00] <wikibugs>	 (03CR) 10Gehel: [C: 032] WDQS: Add sudo rules for wdqs-updater [puppet] - 10https://gerrit.wikimedia.org/r/464767 (https://phabricator.wikimedia.org/T206303) (owner: 10Mobrovac)
[12:23:17] <mobrovac>	 hm
[12:24:51] <wikibugs>	 10Operations, 10Traffic, 10vm-requests, 10Patch-For-Review: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Vgutierrez) certcentral1001 created with the following cmd: ``` sudo gnt-instance add -t drbd -I hail --net 0:link=private --hypervisor-parameters=kvm:boot_order=netwo...
[12:25:22] <elukey>	 https://grafana.wikimedia.org/dashboard/db/eventstreams?refresh=1m&orgId=1&from=now-1h&to=now&var-stream=All&var-topic=All&var-scb_host=scb2001 looks suspicious
[12:25:42] <wikibugs>	 (03PS2) 10Muehlenhoff: Further WMCS Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/464805
[12:25:57] <mobrovac>	 elukey: the time stamp of the log is weird, it's in the future
[12:26:47] <elukey>	 ah yes it is dmesg -T, might not be accurate IIRC
[12:27:10] <elukey>	 some ES worker died recently mobrovac 
[12:27:58] <wikibugs>	 (03PS1) 10Mathew.onipe: scap::target: added services_names param [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314)
[12:29:59] <mobrovac>	 hm i can't see anything suspicious in the logs
[12:30:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for ircecho [puppet] - 10https://gerrit.wikimedia.org/r/464791 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[12:30:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464769 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[12:30:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for PowerDNS Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/464778 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[12:31:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Marostegui)
[12:33:49] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 (10fgiunchedi) >>! In T179050#4644666, @fgiunchedi wrote: > Data transfer and CNAME flip completed. I've documented the data transfer itself at https://wikitech.wikimedia.org/wiki/Pr...
[12:35:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "This looks substantially more code for a function that arguably envsubst seems to handle fine. Which platforms miss that tool ? On Debian " [deployment-charts] - 10https://gerrit.wikimedia.org/r/399256 (owner: 10Dduvall)
[12:36:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Further WMCS Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/464805 (owner: 10Muehlenhoff)
[12:54:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Revert "nrpe: Don't set PrivateTmp=True" [puppet] - 10https://gerrit.wikimedia.org/r/464601 (owner: 10Jcrespo)
[12:58:12] <volans>	 elukey: true, it's not explicitely supported, but should work, you can put a wrong host as mgmt and it will catch the error and just print a message but should continue
[12:58:19] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "See comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe)
[12:58:22] <volans>	 if you could though please open a task for it
[12:58:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Builder: add dh-sysuser for builders > stretch [puppet] - 10https://gerrit.wikimedia.org/r/464775 (owner: 10Banyek)
[12:58:41] <wikibugs>	 (03PS1) 10Elukey: Remove explicit hiera calls from hive/oozie mysql classes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/464808
[13:02:22] <wikibugs>	 (03PS1) 10Banyek: wmf-pt-kill: bugfix for config file [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/464810
[13:02:30] <volans>	 !log upgraded spicerack to version 0.0.9 on sarin/neodymium/cumin* - T199079
[13:02:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:35] <stashbot>	 T199079: Refactor the switchdc script - https://phabricator.wikimedia.org/T199079
[13:03:59] <wikibugs>	 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10Volans)
[13:07:20] <wikibugs>	 (03CR) 10Banyek: [C: 032] Builder: add dh-sysuser for builders > stretch [puppet] - 10https://gerrit.wikimedia.org/r/464775 (owner: 10Banyek)
[13:09:40] <wikibugs>	 (03CR) 10Muehlenhoff: "For context; do you want to ship a default in the deb or not? It's also fine to only ship one in .examples and fully rely on puppet to cre" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/464810 (owner: 10Banyek)
[13:12:28] <wikibugs>	 (03CR) 10Banyek: [C: 032] "> For context; do you want to ship a default in the deb or not? It's" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/464810 (owner: 10Banyek)
[13:12:36] <wikibugs>	 (03CR) 10Banyek: [V: 032 C: 032] wmf-pt-kill: bugfix for config file [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/464810 (owner: 10Banyek)
[13:13:03] <wikibugs>	 (03PS2) 10Elukey: Remove explicit hiera calls from hive/oozie mysql classes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/464808
[13:15:13] <wikibugs>	 (03PS2) 10Banyek: Builder: add dh-sysuser for builders > stretch [puppet] - 10https://gerrit.wikimedia.org/r/464775
[13:15:16] <wikibugs>	 (03CR) 10Banyek: [V: 032 C: 032] Builder: add dh-sysuser for builders > stretch [puppet] - 10https://gerrit.wikimedia.org/r/464775 (owner: 10Banyek)
[13:17:14] <wikibugs>	 (03PS1) 10Elukey: Retrieve hive/oozie database configurations from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464816
[13:24:38] <wikibugs>	 (03CR) 10Ema: [C: 031] Enable base::service_auto_restart for varnish-hospital [puppet] - 10https://gerrit.wikimedia.org/r/464502 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:25:52] <moritzm>	 !log installing python3.5/2.7 security updates
[13:25:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:03] <banyek>	 !log adding wmf-pt-kill_2.2.20-1+wmf3 package for stretch
[13:26:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:56] <wikibugs>	 (03CR) 10Ema: [C: 031] "Two minor comments, LGTM otherwise. Thanks!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:30:24] <wikibugs>	 (03CR) 10Ottomata: [C: 031] "This is good.  I think the motivation for this was to be able to run the mysql server separately from the oozie/hive servers, but we can d" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/464808 (owner: 10Elukey)
[13:31:13] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for varnish-slowlog [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991)
[13:31:31] <wikibugs>	 (03CR) 10Muehlenhoff: Enable base::service_auto_restart for varnish-slowlog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:32:31] <wikibugs>	 (03PS1) 10Volans: sre.switchdc.mediawiki: fix tendril host selection [cookbooks] - 10https://gerrit.wikimedia.org/r/464818
[13:35:07] <wikibugs>	 (03PS1) 10Herron: Revert "lists: deny subscriptions from blocklisted IP addresses" [puppet] - 10https://gerrit.wikimedia.org/r/464819
[13:35:22] <wikibugs>	 (03PS2) 10Herron: Revert "lists: deny subscriptions from blocklisted IP addresses" [puppet] - 10https://gerrit.wikimedia.org/r/464819
[13:36:51] <wikibugs>	 (03CR) 10Herron: [C: 032] Revert "lists: deny subscriptions from blocklisted IP addresses" [puppet] - 10https://gerrit.wikimedia.org/r/464819 (owner: 10Herron)
[13:42:05] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] sre.switchdc.mediawiki: fix tendril host selection [cookbooks] - 10https://gerrit.wikimedia.org/r/464818 (owner: 10Volans)
[13:45:06] <wikibugs>	 (03PS1) 10Banyek: wikireplicas: enable wmf-pt-kill service [puppet] - 10https://gerrit.wikimedia.org/r/464821
[13:45:58] <wikibugs>	 (03CR) 10Marostegui: "will you make sure to test it on a host with puppet disabled on the others?" [puppet] - 10https://gerrit.wikimedia.org/r/464821 (owner: 10Banyek)
[13:46:28] <banyek>	 I see that!
[13:46:32] <wikibugs>	 (03CR) 10Marostegui: "Add the task number to the patch" [puppet] - 10https://gerrit.wikimedia.org/r/464821 (owner: 10Banyek)
[13:47:27] <wikibugs>	 (03PS2) 10Banyek: wikireplicas: enable wmf-pt-kill service [puppet] - 10https://gerrit.wikimedia.org/r/464821 (https://phabricator.wikimedia.org/T203674)
[13:48:52] <wikibugs>	 (03CR) 10Banyek: "- disabling puppet on hosts" [puppet] - 10https://gerrit.wikimedia.org/r/464821 (https://phabricator.wikimedia.org/T203674) (owner: 10Banyek)
[13:50:20] <wikibugs>	 (03CR) 10Marostegui: [C: 031] "> - disabling puppet on hosts" [puppet] - 10https://gerrit.wikimedia.org/r/464821 (https://phabricator.wikimedia.org/T203674) (owner: 10Banyek)
[13:51:51] <wikibugs>	 (03CR) 10Herron: [C: 032] "reverted this as subscription spam has subsided and we're seeing some false positives" [puppet] - 10https://gerrit.wikimedia.org/r/464819 (owner: 10Herron)
[13:55:18] <wikibugs>	 (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: fix tendril host selection [cookbooks] - 10https://gerrit.wikimedia.org/r/464818 (owner: 10Volans)
[13:56:25] <wikibugs>	 (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: fix tendril host selection [cookbooks] - 10https://gerrit.wikimedia.org/r/464818 (owner: 10Volans)
[14:01:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Diamond from openldap/labs hosts [puppet] - 10https://gerrit.wikimedia.org/r/464822 (https://phabricator.wikimedia.org/T183454)
[14:07:48] <wikibugs>	 (03CR) 10Ema: [C: 04-1] "Comments inline. Also, we might want to limit this to eqiad/codfw, where varnish-be communicate directly with the applayer. I'm not sure e" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) (owner: 10Gilles)
[14:12:57] <wikibugs>	 10Operations, 10monitoring: Setup metrics monitoring for OpenLDAP/corp - https://phabricator.wikimedia.org/T206327 (10MoritzMuehlenhoff)
[14:16:15] <wikibugs>	 (03PS5) 10Gilles: Backend-Timing Varnish mtail program [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894)
[14:16:26] <wikibugs>	 (03CR) 10Gilles: Backend-Timing Varnish mtail program (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) (owner: 10Gilles)
[14:16:54] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: Collect Backend-Timing in Prometheus - https://phabricator.wikimedia.org/T131894 (10Gilles) 05stalled>03Open
[14:17:08] <wikibugs>	 (03CR) 10Jcrespo: "This is scheduled for some hours before the switch dc, so no maintenance script using these can be misslead. Labsdb dns, which uses these," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo)
[14:21:18] <wikibugs>	 (03PS6) 10Gilles: Backend-Timing Varnish mtail program [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894)
[14:23:37] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Update dblists to move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo)
[14:27:29] <wikibugs>	 (03PS4) 10Mark Bergsma: Modernize and cleanup Coordinator [debs/pybal] - 10https://gerrit.wikimedia.org/r/447775
[14:27:31] <wikibugs>	 (03PS5) 10Mark Bergsma: Extend testConfigServerRemoval test case. [debs/pybal] - 10https://gerrit.wikimedia.org/r/447770
[14:27:33] <wikibugs>	 (03PS3) 10Mark Bergsma: Don't depool pooledDownServers in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447769 (https://phabricator.wikimedia.org/T184715)
[14:28:11] <wikibugs>	 (03CR) 10Vgutierrez: Central certificates service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk)
[14:28:50] <mark>	 vgutierrez: ^ :)
[14:29:13] <vgutierrez>	 <3
[14:30:09] <wikibugs>	 (03CR) 10Elukey: [C: 032] Remove explicit hiera calls from hive/oozie mysql classes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/464808 (owner: 10Elukey)
[14:33:47] <wikibugs>	 (03PS1) 10Elukey: Add new passwords for oozie/hive to role::analytics_cluster::coordinator [labs/private] - 10https://gerrit.wikimedia.org/r/464825
[14:33:55] <wikibugs>	 (03PS4) 10Mark Bergsma: Don't depool pooledDownServers in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447769 (https://phabricator.wikimedia.org/T184715)
[14:34:45] <wikibugs>	 (03CR) 10Elukey: [V: 032 C: 032] Add new passwords for oozie/hive to role::analytics_cluster::coordinator [labs/private] - 10https://gerrit.wikimedia.org/r/464825 (owner: 10Elukey)
[14:34:48] <wikibugs>	 (03PS2) 10Elukey: Retrieve hive/oozie database configurations from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464816
[14:35:47] <wikibugs>	 10Operations, 10Cloud-Services, 10Mail, 10User-herron: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10herron) >>! In T206261#4645246, @aborrero wrote: > We have a mechanisms called `dmz_cidr` which we can use to exclude NATs between certain IP...
[14:36:45] <wikibugs>	 (03CR) 10Elukey: "Looks good: https://puppet-compiler.wmflabs.org/compiler1002/12792/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/464816 (owner: 10Elukey)
[14:36:52] <wikibugs>	 (03PS3) 10Elukey: Retrieve hive/oozie database configurations from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464816
[14:38:00] <wikibugs>	 (03CR) 10Elukey: [C: 032] Retrieve hive/oozie database configurations from hiera [puppet] - 10https://gerrit.wikimedia.org/r/464816 (owner: 10Elukey)
[14:46:40] <wikibugs>	 (03PS5) 10Mark Bergsma: Don't depool pooledDownServers in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447769 (https://phabricator.wikimedia.org/T184715)
[14:47:06] <icinga-wm>	 RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:52:28] <mateusbs17>	 Hello everyone, per discussion in https://phabricator.wikimedia.org/T163274 the CI job the run tests for JsonConfig extension needs some changes, but I don't know how to write a patch for that or even if that's something I am able to do. How can I proceed?
[14:53:01] <mateusbs17>	 Sorry if that's not the right channel to ask that, I just couldn't find more information on mediawiki and wikitech pages related to jenkins
[14:55:45] <wikibugs>	 (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1011 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464619 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm)
[14:56:27] <wikibugs>	 (03PS2) 10Banyek: wiki replicas: depool labsdb1011 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464619 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm)
[14:56:34] <wikibugs>	 (03CR) 10Banyek: [V: 032 C: 032] wiki replicas: depool labsdb1011 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464619 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm)
[14:56:54] <banyek>	 !log depooling labsdb1011
[14:56:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:07] <banyek>	 !log depooling labsdb1011 (T195747)
[14:57:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:11] <stashbot>	 T195747: Create views for the schema change for refactored actor storage - https://phabricator.wikimedia.org/T195747
[15:03:37] <wikibugs>	 10Operations, 10Performance-Team, 10HHVM: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true - https://phabricator.wikimedia.org/T173786 (10BPirkle)
[15:05:37] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10elukey) ping :)
[15:06:35] <elukey>	 AndyRussG: o/ - not sure if you saw T203669, can you comment next week whenever you have time? So we can plan it accordingly :)
[15:06:36] <stashbot>	 T203669: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669
[15:07:39] <AndyRussG>	 elukey: hi! aaaaargh i did see it, many apologies for the delay in replying
[15:08:10] <wikibugs>	 (03CR) 10Cwhite: [C: 031] Remove Diamond from openldap/labs hosts [puppet] - 10https://gerrit.wikimedia.org/r/464822 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[15:09:08] <AndyRussG>	 elukey: the plan had been to start using the EventLogging stream at 100% client-side sample rate (on Fundraising banner campaigns) in time for the end-of-the year campaigns
[15:09:14] <wikibugs>	 (03CR) 10Cwhite: [C: 031] Enable base::service_auto_restart for ircecho [puppet] - 10https://gerrit.wikimedia.org/r/464791 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:09:16] <AndyRussG>	 however other stuff has come up and I'm not sure we'll get there
[15:09:50] <AndyRussG>	 The EL stream is currently turned on globally at 1% client-side sample rate, so it should be currently usable to build the realtime feature
[15:10:23] <AndyRussG>	 What I don't know is how much Fundraising online really needs this particular stream of realtime data
[15:11:07] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[15:11:14] <AndyRussG>	 I'll comment on the task in a little while and ask dstrine (David, our PM) if he has more input during standup
[15:11:20] <elukey>	 thanks a lot :)
[15:11:32] <AndyRussG>	 elukey: likewise thanks much and apologies again for the delay!!!! :D
[15:11:57] <elukey>	 AndyRussG: np! I was reviewing the pending tasks and annoying people with pings :)
[15:15:27] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[15:17:42] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10colewhite)
[15:18:41] <wikibugs>	 (03PS4) 10Cwhite: openstack, rabbitmq: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464399 (https://phabricator.wikimedia.org/T183454)
[15:19:34] <wikibugs>	 (03PS1) 10Elukey: profile::prometheus::alerts: tune druid alerts [puppet] - 10https://gerrit.wikimedia.org/r/464828
[15:20:29] <AndyRussG>	 elukey: heheh pings much appreciated :)
[15:20:36] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: tune druid alerts [puppet] - 10https://gerrit.wikimedia.org/r/464828 (owner: 10Elukey)
[15:20:43] <wikibugs>	 (03PS2) 10Elukey: profile::prometheus::alerts: tune druid alerts [puppet] - 10https://gerrit.wikimedia.org/r/464828
[15:24:45] <wikibugs>	 (03PS1) 10Cwhite: ntp: remove diamond::collector in favor of prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464829 (https://phabricator.wikimedia.org/T183454)
[15:26:07] <wikibugs>	 (03CR) 10Cwhite: [C: 032] ntp: remove diamond::collector in favor of prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464829 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[15:27:14] <icinga-wm>	 PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:29:33] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:33:25] <wikibugs>	 (03CR) 10Muehlenhoff: "That looks wrong: These servers use ntp in server mode, i.e. acting as time servers, I don't think we currently collect the relevant metri" [puppet] - 10https://gerrit.wikimedia.org/r/464829 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[15:34:44] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] Fix PTR for cr3-ulsfo<-->cr4-ulsfo link [dns] - 10https://gerrit.wikimedia.org/r/464068 (owner: 10Ayounsi)
[15:34:48] <wikibugs>	 (03PS2) 10Ayounsi: Fix PTR for cr3-ulsfo<-->cr4-ulsfo link [dns] - 10https://gerrit.wikimedia.org/r/464068
[15:35:31] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[15:38:53] <wikibugs>	 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Addshore) It looks like there was another little flood on the 1st of October with requests being banned again:  https://logstash....
[15:41:32] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[15:46:33] <wikibugs>	 10Operations: Allow directing users to PHP7 based on a cookie - https://phabricator.wikimedia.org/T206338 (10Joe)
[15:46:55] <wikibugs>	 10Operations: SRE quarterly goal: Ability to serve a fraction of the production traffic from PHP7 - https://phabricator.wikimedia.org/T206336 (10Joe)
[15:52:22] <wikibugs>	 (03PS1) 10Herron: admin: add turnilo and superset sudo privs to analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/464831 (https://phabricator.wikimedia.org/T206217)
[15:53:32] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on db1072 is CRITICAL: cluster=mysql device=megaraid,10 instance=db1072:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1072&var-datasource=eqiad%2520prometheus%252Fops
[15:54:12] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10herron) Is https://gerrit.wikimedia.org/r/464831 what you had in mind in terms of sudo privs?
[15:55:47] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): Resolve elasticsearch shard size alert by doing an in place reindex - https://phabricator.wikimedia.org/T204362 (10debt) 05Open>03Resolved
[15:55:51] <wikibugs>	 (03PS2) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187)
[15:56:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[15:56:51] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational
[15:57:33] <wikibugs>	 (03PS1) 10Mforns: Add druid_load.pp to refinery jobs [puppet] - 10https://gerrit.wikimedia.org/r/464833
[15:58:10] <wikibugs>	 (03PS3) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187)
[15:58:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add druid_load.pp to refinery jobs [puppet] - 10https://gerrit.wikimedia.org/r/464833 (owner: 10Mforns)
[15:58:36] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Mailman issues a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T195750 (10herron) 05Open>03Resolved The RBL check that was causing 403s for subscription attempts from IPs listed on spam blacklists was reve...
[16:00:15] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: I get a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T205694 (10herron) 05Open>03Resolved a:03herron The rbl check that was causing the 403 error when attempting to subscribe from an IP in a spam blocklist was reverted this...
[16:03:40] <wikibugs>	 10Operations: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10Joe)
[16:04:35] <wikibugs>	 (03CR) 10Ottomata: [C: 031] admin: add turnilo and superset sudo privs to analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/464831 (https://phabricator.wikimedia.org/T206217) (owner: 10Herron)
[16:06:04] <wikibugs>	 (03PS1) 10Elukey: Decommission bohrium [dns] - 10https://gerrit.wikimedia.org/r/464836 (https://phabricator.wikimedia.org/T206315)
[16:06:09] <wikibugs>	 (03CR) 10Mforns: [C: 04-1] "Please, do not merge yet! Just for reference thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/464833 (owner: 10Mforns)
[16:06:25] <wikibugs>	 (03CR) 10Ottomata: Add druid_load.pp to refinery jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464833 (owner: 10Mforns)
[16:06:27] <wikibugs>	 (03CR) 10Cwhite: [C: 032] "> That looks wrong: These servers use ntp in server mode, i.e. acting" [puppet] - 10https://gerrit.wikimedia.org/r/464829 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[16:06:29] <wikibugs>	 (03CR) 10Elukey: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/464831 (https://phabricator.wikimedia.org/T206217) (owner: 10Herron)
[16:07:42] <wikibugs>	 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Raise alert level on disk space for old elasticsearch servers - https://phabricator.wikimedia.org/T204361 (10debt) 05Open>03Resolved
[16:09:32] <icinga-wm>	 PROBLEM - High lag on wdqs1010 is CRITICAL: 1.719e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:09:57] <wikibugs>	 10Operations: Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10Joe)
[16:10:42] <icinga-wm>	 RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 37 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:15:27] <wikibugs>	 (03PS1) 10Bstorm: labstore: we really only want to know about prolonged load issues [puppet] - 10https://gerrit.wikimedia.org/r/464838 (https://phabricator.wikimedia.org/T206144)
[16:16:01] <icinga-wm>	 PROBLEM - High lag on wdqs1010 is CRITICAL: 1.382e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:16:43] <gehel>	 downtime expired on wdqs1010, I'll add some...
[16:22:32] <icinga-wm>	 RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 30 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:28:38] <wikibugs>	 10Operations, 10ops-eqiad: helium (bacula) -  Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Cmjohnson) @akosiaris I found a spare 4TB SAS disk...replacing it now
[16:31:01] <wikibugs>	 (03CR) 10Elukey: [C: 032] Decommission bohrium [dns] - 10https://gerrit.wikimedia.org/r/464836 (https://phabricator.wikimedia.org/T206315) (owner: 10Elukey)
[16:32:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Cmjohnson) Failed disk has been swapped out
[16:34:27] <wikibugs>	 (03PS1) 10Cwhite: hiera: enable ntp collector on labcontrol to replace ntpd diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/464843 (https://phabricator.wikimedia.org/T183454)
[16:35:26] <wikibugs>	 (03CR) 10GTirloni: [C: 032] labstore: we really only want to know about prolonged load issues [puppet] - 10https://gerrit.wikimedia.org/r/464838 (https://phabricator.wikimedia.org/T206144) (owner: 10Bstorm)
[16:35:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Cmjohnson) Failed disk has been swapped out
[16:35:42] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Banyek) >>! In T203674#4645494, @gerritbot wrote: > Change 464821 had a related patch set uploaded (by Banyek; owner: Banyek): > [operations/pu...
[16:37:57] <wikibugs>	 (03CR) 10Ottomata: Add druid_load.pp to refinery jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464833 (owner: 10Mforns)
[16:38:26] <wikibugs>	 10Operations: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10BBlack) Strawperson proposal from IRC, in pseudocode for cache_text, assuming the magic cookie is `f2b31d03ab7`:  ``` sub recv_from_client_at_front_edge() {     unset req.http.x-use-engine;     if req.http.Co...
[16:38:49] <wikibugs>	 10Operations, 10Traffic: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10BBlack)
[16:38:56] <wikibugs>	 (03CR) 10Cwhite: [C: 032] "> That looks wrong: These servers use ntp in server mode, i.e. acting" [puppet] - 10https://gerrit.wikimedia.org/r/464829 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[16:39:57] <wikibugs>	 (03CR) 10Cwhite: "Changes look expeted: https://puppet-compiler.wmflabs.org/compiler1002/12793/" [puppet] - 10https://gerrit.wikimedia.org/r/464843 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[16:40:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1064 has disk smart error - https://phabricator.wikimedia.org/T206245 (10Cmjohnson) Swapped the failed disk
[16:42:30] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10colewhite)
[16:44:04] <wikibugs>	 10Operations, 10Patch-For-Review: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10colewhite) 05Open>03Resolved
[16:53:42] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on db1072 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1072&var-datasource=eqiad%2520prometheus%252Fops
[17:00:42] <wikibugs>	 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle)
[17:02:03] <wikibugs>	 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) I've revised the checklist based on chat with @Joe. In particular, we want to start earlier with the verifica...
[17:04:52] <icinga-wm>	 PROBLEM - MegaRAID on db1064 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[17:04:54] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1064 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206345
[17:04:58] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T206345 (10ops-monitoring-bot)
[17:06:21] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100%
[17:07:12] <wikibugs>	 (03PS1) 10Elukey: role::configcluster: set codfw to read/write and eqiad to readonly [puppet] - 10https://gerrit.wikimedia.org/r/464847 (https://phabricator.wikimedia.org/T205814)
[17:07:41] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:08:31] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[17:08:42] <_joe_>	 XioNoX: ^^
[17:08:49] <_joe_>	 expected/known?
[17:09:02] <XioNoX>	 nop, looking
[17:09:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] role::configcluster: set codfw to read/write and eqiad to readonly [puppet] - 10https://gerrit.wikimedia.org/r/464847 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey)
[17:09:31] <wikibugs>	 (03CR) 10Elukey: [C: 032] role::configcluster: set codfw to read/write and eqiad to readonly [puppet] - 10https://gerrit.wikimedia.org/r/464847 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey)
[17:09:51] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:09:59] <XioNoX>	   Last flapped   : 2018-10-05 17:09:11 UTC (00:00:41 ago)
[17:10:09] <elukey>	 _joe_ merge and run puppet in codfw gently ok?
[17:10:46] <_joe_>	 elukey: not gently
[17:10:50] <_joe_>	 but yes, go on
[17:10:55] <XioNoX>	 seems like a (brief) issue with Equinix's OOB link, low priority, but I'll keep an eye on it, thanks
[17:11:41] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 4.41 ms
[17:12:31] <elukey>	 !log set etcd in codfw as read/write (was readonly) and eqiad as readonly (was read/write)
[17:12:32] <_joe_>	 elukey: things are ok now
[17:12:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:42] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10herron) 05Open>03Resolved a:03herron Hi @Shahadat, the requested list has been created and additional details with credentials should have been emailed to y...
[17:13:42] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.91 ms
[17:13:51] <wikibugs>	 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists: Create a mailling list for Wiki Loves Love - https://phabricator.wikimedia.org/T203792 (10herron) 05Open>03Resolved
[17:17:37] <elukey>	 _joe_ puppet ran across all the conf*
[17:18:16] <_joe_>	 elukey: I successfully depooled/pooled an appserver, I think we're ok
[17:18:53] <elukey>	 super
[17:23:18] <wikibugs>	 (03PS2) 10Dzahn: icinga::web: don't include ::icinga [puppet] - 10https://gerrit.wikimedia.org/r/463405
[17:23:37] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Request new mail list for Vietnam Wikimedians User Group - https://phabricator.wikimedia.org/T204974 (10herron) 05Open>03Resolved a:03herron Hi @minhhuy, the requested list has been created and additional details with credentials should have been emailed to you di...
[17:26:03] <wikibugs>	 (03CR) 10Dzahn: "wut? compiler says syntax error in naggen.pp ? not even touching this" [puppet] - 10https://gerrit.wikimedia.org/r/463405 (owner: 10Dzahn)
[17:26:19] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/12794/einsteinium.wikimedia.org/change.einsteinium.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/463405 (owner: 10Dzahn)
[17:26:42] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on db1073 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops
[17:29:14] <wikibugs>	 (03PS1) 10Dzahn: add blank passwords::network::snmp_ro_community [labs/private] - 10https://gerrit.wikimedia.org/r/464851
[17:29:38] <wikibugs>	 (03PS2) 10Dzahn: add blank passwords::network::snmp_ro_community [labs/private] - 10https://gerrit.wikimedia.org/r/464851
[17:31:43] <wikibugs>	 (03CR) 10Cwhite: [C: 031] "> wut? compiler says syntax error in naggen.pp ? not even touching" [puppet] - 10https://gerrit.wikimedia.org/r/463405 (owner: 10Dzahn)
[17:32:52] <icinga-wm>	 RECOVERY - MegaRAID on db1072 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
[17:33:42] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on db1064 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1064&var-datasource=eqiad%2520prometheus%252Fops
[17:40:31] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Create mailing list for Bureaucrat of zh.wikipedia - https://phabricator.wikimedia.org/T202435 (10herron) 05Open>03Resolved a:03herron Hi @Wong128hk, the requested list has been created and additional details with credentials should have been em...
[17:40:32] <wikibugs>	 (03PS3) 10Dzahn: add blank passwords::network::snmp_ro_community [labs/private] - 10https://gerrit.wikimedia.org/r/464851
[17:40:59] <wikibugs>	 (03CR) 10Dzahn: [V: 032 C: 032] "labs only" [labs/private] - 10https://gerrit.wikimedia.org/r/464851 (owner: 10Dzahn)
[17:42:35] <wikibugs>	 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Krinkle)
[17:45:11] <icinga-wm>	 RECOVERY - MegaRAID on db1064 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
[17:45:39] <wikibugs>	 (03PS1) 10Mathew.onipe: prometheus-blazegraph-exporter: added Query and Concurrency related counters [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123)
[17:47:35] <wikibugs>	 (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1011 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464855
[17:48:59] <wikibugs>	 (03CR) 10Aaron Schulz: [C: 031] wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto)
[17:49:48] <wikibugs>	 (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool labsdb1011 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464855 (owner: 10Bstorm)
[17:50:37] <banyek>	 !log repooling labsdb1011 (T195747)
[17:50:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:42] <stashbot>	 T195747: Create views for the schema change for refactored actor storage - https://phabricator.wikimedia.org/T195747
[17:51:02] <wikibugs>	 (03PS2) 10Banyek: Revert "wiki replicas: depool labsdb1011 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464855 (owner: 10Bstorm)
[17:51:07] <wikibugs>	 (03CR) 10Banyek: [V: 032 C: 032] Revert "wiki replicas: depool labsdb1011 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464855 (owner: 10Bstorm)
[17:55:34] <wikibugs>	 (03PS1) 10Andrew Bogott: host monitoring: create new hiera global, host_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224)
[17:55:36] <wikibugs>	 (03PS1) 10Andrew Bogott: labvirt/cloudvirt hosts: only page when a host is down [puppet] - 10https://gerrit.wikimedia.org/r/464857 (https://phabricator.wikimedia.org/T206224)
[17:56:20] <wikibugs>	 (03PS1) 10Bstorm: wiki replicas: depool labsdb1009 for actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464859 (https://phabricator.wikimedia.org/T195747)
[17:56:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] host monitoring: create new hiera global, host_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott)
[17:58:16] <banyek>	 !log depooling labsdb1009 (T195747)
[17:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:20] <stashbot>	 T195747: Create views for the schema change for refactored actor storage - https://phabricator.wikimedia.org/T195747
[17:58:46] <wikibugs>	 (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1009 for actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464859 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm)
[18:01:41] <wikibugs>	 (03PS2) 10Andrew Bogott: host monitoring: create new hiera global, host_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224)
[18:01:43] <wikibugs>	 (03PS2) 10Andrew Bogott: labvirt/cloudvirt hosts: only page when a host is down [puppet] - 10https://gerrit.wikimedia.org/r/464857 (https://phabricator.wikimedia.org/T206224)
[18:02:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] host monitoring: create new hiera global, host_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott)
[18:07:41] <wikibugs>	 (03PS3) 10Andrew Bogott: host monitoring: create new hiera global, host_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224)
[18:07:42] <mutante>	 [4~[1~!log disabling puppet on icinga for 5 min for extra safety before a change that should be noop
[18:07:43] <wikibugs>	 (03PS3) 10Andrew Bogott: labvirt/cloudvirt hosts: only page when a host is down [puppet] - 10https://gerrit.wikimedia.org/r/464857 (https://phabricator.wikimedia.org/T206224)
[18:08:08] <wikibugs>	 (03CR) 10Dzahn: [C: 032] icinga::web: don't include ::icinga [puppet] - 10https://gerrit.wikimedia.org/r/463405 (owner: 10Dzahn)
[18:08:18] <wikibugs>	 (03PS3) 10Dzahn: icinga::web: don't include ::icinga [puppet] - 10https://gerrit.wikimedia.org/r/463405
[18:08:30] <mutante>	 !log disabling puppet on icinga for 5 min for extra safety before a change that should be noop
[18:08:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] host monitoring: create new hiera global, host_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott)
[18:21:58] <wikibugs>	 (03PS2) 10BBlack: Switch CAA records to proper RR format [dns] - 10https://gerrit.wikimedia.org/r/462693
[18:22:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Switch CAA records to proper RR format [dns] - 10https://gerrit.wikimedia.org/r/462693 (owner: 10BBlack)
[18:23:17] <wikibugs>	 (03PS1) 10BBlack: authdns: add interface::rps and TFO [puppet] - 10https://gerrit.wikimedia.org/r/464861
[18:23:19] <wikibugs>	 (03PS1) 10BBlack: gdnsd config: update for 3.x [puppet] - 10https://gerrit.wikimedia.org/r/464862
[18:23:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] authdns: add interface::rps and TFO [puppet] - 10https://gerrit.wikimedia.org/r/464861 (owner: 10BBlack)
[18:26:27] <mutante>	 !log icinga - noop on all servers, no change, puppet re-enabled, operations normal
[18:26:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:56] <wikibugs>	 (03PS2) 10BBlack: authdns: add interface::rps and TFO [puppet] - 10https://gerrit.wikimedia.org/r/464861
[18:28:58] <wikibugs>	 (03PS2) 10BBlack: gdnsd config: update for 3.x [puppet] - 10https://gerrit.wikimedia.org/r/464862
[18:31:17] <bblack>	 !log gdnsd-2.99.9930-beta-1+wmf1 uploaded to stretch-wikimedia
[18:31:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:21] <wikibugs>	 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Smalyshev) > Is there a reason that all mediawiki hosts show as "localhost"?  This is probably coming from Jetty, which takes it...
[18:34:39] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "Works as expected" [software/keyholder] - 10https://gerrit.wikimedia.org/r/458226 (owner: 10Faidon Liambotis)
[18:35:14] <wikibugs>	 (03PS1) 10Cwhite: hiera: comment out diamond::remove [puppet] - 10https://gerrit.wikimedia.org/r/464863 (https://phabricator.wikimedia.org/T183454)
[18:35:30] <wikibugs>	 (03Merged) 10jenkins-bot: Don't barf on an empty or invalid YAML config [software/keyholder] - 10https://gerrit.wikimedia.org/r/458226 (owner: 10Faidon Liambotis)
[18:35:35] <wikibugs>	 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Smalyshev) Also, 1,182,961 events is a lot. What's going on there? Why so many? Is it a legit scenario? I wonder also if the most...
[18:36:37] <wikibugs>	 (03PS1) 10Cwhite: Revert "ntp: remove diamond::collector in favor of prometheus-node-exporter" [puppet] - 10https://gerrit.wikimedia.org/r/464864
[18:37:03] <wikibugs>	 (03PS2) 10Cwhite: Revert "ntp: remove diamond::collector in favor of prometheus-node-exporter" [puppet] - 10https://gerrit.wikimedia.org/r/464864
[18:37:05] <bblack>	 !log authdns2001: upgraded gdnsd to 2.99.9930-beta
[18:37:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] Revert "ntp: remove diamond::collector in favor of prometheus-node-exporter" [puppet] - 10https://gerrit.wikimedia.org/r/464864 (owner: 10Cwhite)
[18:37:58] <wikibugs>	 (03CR) 10Cwhite: [C: 032] Revert "ntp: remove diamond::collector in favor of prometheus-node-exporter" [puppet] - 10https://gerrit.wikimedia.org/r/464864 (owner: 10Cwhite)
[18:38:33] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10jayantanth) I dont understand why my mail ID is not added to admin list.
[18:38:45] <wikibugs>	 (03CR) 10Cwhite: [C: 032] hiera: comment out diamond::remove [puppet] - 10https://gerrit.wikimedia.org/r/464863 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[18:39:18] <wikibugs>	 (03PS10) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343)
[18:42:25] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10colewhite)
[18:51:11] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[18:52:57] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "it would disable the current crons in codfw, the compiler output says "noop" but because it does the reverse what would happen in prod, be" [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)
[18:54:21] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[18:54:30] <wikibugs>	 (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1009 for actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464865
[18:58:59] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "yea, none of that supposed issue with naggen.pp happened. this was a clean noop as expected, on all production servers" [puppet] - 10https://gerrit.wikimedia.org/r/463405 (owner: 10Dzahn)
[18:59:34] <wikibugs>	 (03PS2) 10Bstorm: labstore: we really only want to know about prolonged load issues [puppet] - 10https://gerrit.wikimedia.org/r/464838 (https://phabricator.wikimedia.org/T206144)
[19:02:12] <wikibugs>	 (03PS1) 10Cwhite: ntp: move diamond::collector to where it will only apply to ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454)
[19:04:14] <wikibugs>	 (03PS2) 10Cwhite: ntp: move diamond::collector to where it will only apply to ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454)
[19:06:26] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "yep, planned for Monday morning though, avoiding Friday" [puppet] - 10https://gerrit.wikimedia.org/r/464088 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[19:10:15] <wikibugs>	 (03PS1) 10Cwhite: Revert "hiera: comment out diamond::remove" [puppet] - 10https://gerrit.wikimedia.org/r/464867
[19:10:48] <wikibugs>	 (03PS2) 10Cwhite: Revert "hiera: comment out diamond::remove" [puppet] - 10https://gerrit.wikimedia.org/r/464867
[19:10:56] <wikibugs>	 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Opened Juniper case 2018-1005-0549 about the ND issue.
[19:23:42] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] prometheus-blazegraph-exporter: added Query and Concurrency related counters (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe)
[19:30:51] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.002 second response time
[19:33:50] <wikibugs>	 (03PS11) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343)
[19:33:58] <mutante>	 gehel: ^ known or should we restart blazegraph? (the other day i talked to stas about doing that if wdqs has the issue again
[19:34:26] <SMalyshev>	 mutante: there's something wrong with setup on 1009, I am looking into it
[19:34:40] <mutante>	 SMalyshev: cool:) thanks
[19:39:35] <wikibugs>	 (03CR) 10Mathew.onipe: prometheus-blazegraph-exporter: added Query and Concurrency related counters (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe)
[19:42:02] <wikibugs>	 (03CR) 10Smalyshev: [C: 04-1] wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[19:42:40] <SMalyshev>	 mutante: ok, I know what the problem is - it got old binaries deployed. I will fix it
[19:44:10] <logmsgbot>	 !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@f8776de]: Redeploy 1009
[19:44:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:13] <mutante>	 SMalyshev: great :)
[19:44:36] <logmsgbot>	 !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@f8776de]: Redeploy 1009 (duration: 00m 26s)
[19:44:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:45:02] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs1009 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.062 second response time
[19:45:17] <mutante>	 nice
[19:46:23] <SMalyshev>	 turns out scap git handling is a bit trickier than I thought... so that autodeploy patch probably needs some work. but it should be back to normal now
[19:48:10] <banyek>	 !log repooling labsdb1009 (T195747)
[19:48:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:15] <stashbot>	 T195747: Create views for the schema change for refactored actor storage - https://phabricator.wikimedia.org/T195747
[19:48:38] <wikibugs>	 (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool labsdb1009 for actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464865 (owner: 10Bstorm)
[19:49:03] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[19:49:17] <wikibugs>	 (03PS2) 10Banyek: Revert "wiki replicas: depool labsdb1009 for actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464865 (owner: 10Bstorm)
[19:49:20] <wikibugs>	 (03CR) 10Banyek: [V: 032 C: 032] Revert "wiki replicas: depool labsdb1009 for actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464865 (owner: 10Bstorm)
[19:50:43] <gehel>	 mutante, SMalyshev: thanks for taking care of that! I was mostly off already
[19:51:08] <wikibugs>	 (03CR) 10Smalyshev: [C: 031] wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[19:52:19] <gehel>	 mutante: you can also ping onimisionipe on wdqs issues, he should have some idea about what's going on :)
[19:54:52] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] prometheus-blazegraph-exporter: added Query and Concurrency related counters (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe)
[19:55:01] <mutante>	 gehel: oh, of course! yes, the new wdqs-roots group even. will do. i basically picked you randomly 
[19:55:47] <gehel>	 mutante: I'm not complaining (well, almost not)! It's just a good idea to give Matt some more exposure as well :)
[19:56:00] <mutante>	 *nod* yep
[19:57:28] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "compiler output is the opposite of production because active_dc is set to eqiad in labs https://puppet-compiler.wmflabs.org/compiler1002/1" [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)
[19:57:43] <wikibugs>	 (03PS12) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343)
[20:03:26] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] wdqs: auto deployment of wdqs on wdqs1009 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[20:04:27] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] Drop legacy SSHv1 support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458227 (owner: 10Faidon Liambotis)
[20:05:04] <wikibugs>	 (03CR) 10Smalyshev: [C: 031] wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[20:05:08] <wikibugs>	 (03CR) 10Smalyshev: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[20:05:18] <wikibugs>	 (03Merged) 10jenkins-bot: Drop legacy SSHv1 support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458227 (owner: 10Faidon Liambotis)
[20:07:14] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "double checked on all. no change. before and after this the crons are running on mwmaint2001 and don't exist on either mwmaint1001 or mwma" [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)
[20:08:03] * banyek off
[20:09:02] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) This server will become active when we switch Mediawiki back to eqiad on October 10th.
[20:10:37] <wikibugs>	 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn)
[20:14:57] <wikibugs>	 (03CR) 10Mathew.onipe: prometheus-blazegraph-exporter: added Query and Concurrency related counters (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe)
[20:19:04] <wikibugs>	 10Operations, 10Patch-For-Review: releases servers: set rsync direction based on active dc, add warning motd on inactive server - https://phabricator.wikimedia.org/T205037 (10Dzahn) 05Open>03Resolved closing this. 2/3 things are done and the remaining one deserves a more broad solution. it is not specific...
[20:19:48] <wikibugs>	 10Operations, 10Patch-For-Review: releases servers: add warning motd on inactive server - https://phabricator.wikimedia.org/T205037 (10Dzahn)
[20:24:56] <wikibugs>	 (03PS4) 10Cwhite: profile, graphite: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464389 (https://phabricator.wikimedia.org/T183454)
[20:27:42] <wikibugs>	 (03PS18) 10Dzahn: Gerrit: Hook up gerrit.wmfusercontent.org to apache [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[20:33:25] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: Hook up gerrit.wmfusercontent.org to apache [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[20:34:16] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Lower gerrit.wmfusercontent.org priority in apache [puppet] - 10https://gerrit.wikimedia.org/r/464877
[20:34:37] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12799/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[20:35:22] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Lower gerrit.wmfusercontent.org priority in apache [puppet] - 10https://gerrit.wikimedia.org/r/464877
[20:36:43] <wikibugs>	 (03PS3) 10Dzahn: Gerrit: Lower gerrit.wmfusercontent.org priority in apache [puppet] - 10https://gerrit.wikimedia.org/r/464877 (owner: 10Paladox)
[20:37:32] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "yep, thanks. per IRC, it would have also been second to load without this but only because "wiki" is before "wmf" in the alphabet. this is" [puppet] - 10https://gerrit.wikimedia.org/r/464877 (owner: 10Paladox)
[20:44:16] <mutante>	 !log gerrit - adding gerrit.wmfusercontent.org virtual host for avatars. applied first on gerrit2001, then on cobalt (T191183)
[20:44:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:22] <stashbot>	 T191183: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183
[20:48:52] <mutante>	 !log T191183 - it's still showing the error page as before but that isn't due to apache issues, it just needs additional ferm rules
[20:48:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:30] <wikibugs>	 (03PS1) 10Rxy: Remove the "reviewer" group at ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464890 (https://phabricator.wikimedia.org/T205997)
[21:01:48] <wikibugs>	 (03PS1) 10Cwhite: hiera: enable ntp collector on role::recursor [puppet] - 10https://gerrit.wikimedia.org/r/464905 (https://phabricator.wikimedia.org/T183454)
[21:02:38] <wikibugs>	 (03Abandoned) 10Cwhite: hiera: enable ntp collector on labcontrol to replace ntpd diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/464843 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[21:10:24] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907
[21:10:52] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907
[21:12:02] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907
[21:15:50] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "[cp1079:~] $ curl -H "Host: gerrit.wmfusercontent.org" http://cobalt.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/464907 (owner: 10Paladox)
[21:20:20] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "yes, the gerrit server has 2 IPs, server (208.80.154.81, cobalt) and service (208.80.154.85, gerrit) on the same interface and 2 Apache vi" [puppet] - 10https://gerrit.wikimedia.org/r/464907 (owner: 10Paladox)
[21:22:31] <wikibugs>	 (03PS3) 10Thcipriani: Drop MD5 (pre-6.8) digest support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458228 (owner: 10Faidon Liambotis)
[21:26:00] <wikibugs>	 (03PS4) 10Dzahn: Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[21:27:16] <wikibugs>	 (03PS5) 10Dzahn: Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[21:30:25] <mutante>	 paladox: i think it's still confusing because we are both making it sound as if Gerrit itself ever was or is behind caches
[21:30:36] <mutante>	 paladox: but yea.. it's still the correct title
[21:30:37] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] Drop MD5 (pre-6.8) digest support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458228 (owner: 10Faidon Liambotis)
[21:30:41] <paladox>	 yep
[21:30:57] <mutante>	 should have called the director "avatars" :p
[21:31:02] <paladox>	 lol
[21:31:05] <mutante>	 not also gerrit
[21:31:20] <wikibugs>	 (03Merged) 10jenkins-bot: Drop MD5 (pre-6.8) digest support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458228 (owner: 10Faidon Liambotis)
[21:31:37] <mutante>	 well, i edited the commit message a bunch to expalin
[21:31:42] <paladox>	 :)
[21:35:33] <wikibugs>	 (03PS5) 10Dzahn: Gerrit: fix login screen css [puppet] - 10https://gerrit.wikimedia.org/r/464418 (owner: 10Thcipriani)
[21:36:06] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: fix login screen css [puppet] - 10https://gerrit.wikimedia.org/r/464418 (owner: 10Thcipriani)
[21:39:01] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "confirmed now the new text is not on the side anymore but below and centered. nice" [puppet] - 10https://gerrit.wikimedia.org/r/464418 (owner: 10Thcipriani)
[21:45:11] <wikibugs>	 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Dzahn) Yea, it's not ferm, it's the wrong backend IP per change above.
[21:49:51] <wikibugs>	 (03PS1) 10Cwhite: nutcracker: ensure absent nutcracker.py [puppet] - 10https://gerrit.wikimedia.org/r/464917 (https://phabricator.wikimedia.org/T183454)
[22:01:06] <wikibugs>	 10Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442 (10Dzahn) re: partman recipes          rdb100[1-4]) echo partman/mw.cfg ;; \         rdb100[5-6]) echo partman/raid1-lvm-ext4-srv-noswap.cfg ;; \         rdb100[7-9]|rdb1010) echo partman/raid1-lvm-ext4-srv.cfg ;; \...
[22:02:33] <wikibugs>	 (03PS1) 10Cwhite: nutcracker: set diamond::remove on all roles containing nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/464918 (https://phabricator.wikimedia.org/T183454)
[22:03:48] <wikibugs>	 (03PS2) 10Cwhite: nutcracker: set diamond::remove on all roles containing nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/464918 (https://phabricator.wikimedia.org/T183454)
[22:04:37] <wikibugs>	 (03PS1) 10Dzahn: partman: let rdb1005/1006 also use recipe with swap [puppet] - 10https://gerrit.wikimedia.org/r/464919 (https://phabricator.wikimedia.org/T140442)
[22:04:41] <wikibugs>	 (03PS2) 10Cwhite: nutcracker: ensure absent nutcracker.py [puppet] - 10https://gerrit.wikimedia.org/r/464917 (https://phabricator.wikimedia.org/T183454)
[22:05:42] <wikibugs>	 (03PS2) 10Cwhite: hiera: enable ntp collector on role::recursor [puppet] - 10https://gerrit.wikimedia.org/r/464905 (https://phabricator.wikimedia.org/T183454)
[22:09:23] <wikibugs>	 (03CR) 10RobH: [C: 031] partman: let rdb1005/1006 also use recipe with swap [puppet] - 10https://gerrit.wikimedia.org/r/464919 (https://phabricator.wikimedia.org/T140442) (owner: 10Dzahn)
[22:09:55] <wikibugs>	 (03PS2) 10Dzahn: partman: let rdb1005/1006 also use recipe with swap [puppet] - 10https://gerrit.wikimedia.org/r/464919 (https://phabricator.wikimedia.org/T140442)
[22:10:31] <wikibugs>	 (03CR) 10Dzahn: [C: 032] partman: let rdb1005/1006 also use recipe with swap [puppet] - 10https://gerrit.wikimedia.org/r/464919 (https://phabricator.wikimedia.org/T140442) (owner: 10Dzahn)
[22:13:14] <wikibugs>	 10Operations, 10Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562 (10Dzahn)
[22:13:19] <wikibugs>	 10Operations, 10Patch-For-Review: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442 (10Dzahn) 05Open>03Resolved >>! In T140442#4637702, @faidon wrote: > As far as this task goes, I'd recommend fixing the partman recipe on Puppet (so that the next install gets it right) and resolvin...
[22:32:39] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10Aklapper) I guess because the address was not listed in the task description and the original task author has not seconded the request. Note that any of the curre...
[22:37:05] <librenms-wmf>	 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80%
[22:37:08] <librenms-wmf>	 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80%
[22:38:04] <librenms-wmf>	 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80%
[22:52:56] <wikibugs>	 (03CR) 10BBlack: [C: 032] Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[22:53:05] <wikibugs>	 (03PS6) 10BBlack: Gerrit: Change backend for gerrit in varnish [puppet] - 10https://gerrit.wikimedia.org/r/464907 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[22:56:59] <bblack>	 XioNoX: still around? dallas port utilization stuff above ^
[22:57:27] <bblack>	 (also, I guess we didn't repool ulsfo yet, probably should've earlier!)
[22:58:21] <wikibugs>	 (03PS1) 10BBlack: Revert "Depool ulsfo for DC move" [dns] - 10https://gerrit.wikimedia.org/r/464922
[22:58:30] <paladox>	 bblack thanks! Works https://gerrit.wmfusercontent.org/paladox.png
[22:58:36] <paladox>	 cc mutante and thcipriani ^^
[22:59:19] <bblack>	 and yeah, I think they are getting default cache TTLs of 86400
[22:59:32] <bblack>	 that may be good or bad depending on your POV, as nothing will explicitly purge these on change I doubt :)
[22:59:45] <paladox>	 yup
[22:59:58] <paladox>	 we are using https://gerrit.wikimedia.org/r/admin/projects/All-Avatars for avatars
[23:01:31] <XioNoX>	 bblack: on my way to a festival, it's possible that Facebook or similar changed something
[23:01:51] <XioNoX>	 last time it was a brief spike
[23:06:23] <bblack>	 yeah I'm actually seeing a depression in remote dns caches preferring dallas, too
[23:06:28] <XioNoX>	 I'm check the IX portal to figure out who that is
[23:06:33] <bblack>	 which may indicate they're seeing latency increase in dallas, or loss
[23:07:05] <librenms-wmf>	 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80%
[23:07:08] <librenms-wmf>	 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80%
[23:07:11] <bblack>	 XioNoX: we're still ok with the basic plan of repooling ulsfo right? because that may alleviate some ulsfo load too
[23:07:20] <XioNoX>	 can we look at the logs to see if there is an increase of requests?
[23:07:22] <bblack>	 err, alleviate some dallas load
[23:07:31] <bblack>	 yeah, we can, but haven't yet
[23:07:38] <XioNoX>	 yeah, ulsfo is ready to be repooled
[23:07:48] <XioNoX>	 but keep an eye in case
[23:08:05] <librenms-wmf>	 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80%
[23:08:17] <bblack>	 yes, there's an increase in requests, and it's a curious one as its almost all HEAD reqs
[23:08:52] <bblack>	 to cache_upload even
[23:08:59] <XioNoX>	 it's too soon for the data to appear in the IX dashboard
[23:09:04] <bblack>	 see here:
[23:09:06] <bblack>	 https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?orgId=1&from=now-1h&to=now&var-site=codfw&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5
[23:09:09] <XioNoX>	 https://ix.equinix.com/ixp/trafficStats
[23:09:19] <bblack>	 gonna push ulsfo first, then look at that
[23:09:37] <wikibugs>	 (03CR) 10BBlack: [C: 032] Revert "Depool ulsfo for DC move" [dns] - 10https://gerrit.wikimedia.org/r/464922 (owner: 10BBlack)
[23:10:15] <XioNoX>	 https://librenms.wikimedia.org/graphs/to=1538780700/id=16721/type=port_bits/from=1538694300/
[23:10:46] <XioNoX>	 if we disable the IX port the spike will impact one of our uplink...
[23:11:02] <bblack>	 right
[23:11:17] <XioNoX>	 so no easy way to get out other than banning those specific source IP after identifying them
[23:11:19] <bblack>	 I'm surprised nothing else is alerting on this, really, but I guess it can make sense
[23:13:07] <bblack>	 I'm assuming with that flattening at ~9.5G we're basically fully saturated
[23:14:21] <XioNoX>	 yeah the ix port is at full steam
[23:14:40] <bblack>	 it's facebook traffic heh
[23:14:41] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[23:15:03] <bblack>	 with new UAs: Go-http-client/2.0 + Go-http-client/1.1
[23:15:37] <bblack>	 the client IPs are all over, but they all put FB's standard identifier stuff in the lower 64
[23:16:30] <bblack>	 "2a03:2880:ff:2::face:b00c"
[23:16:37] <bblack>	 and similar with the :2: changing variously
[23:16:52] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[23:17:12] <XioNoX>	 who at fb do a prod change Friday evening...
[23:17:36] <bblack>	 yeah go figure
[23:17:49] <bblack>	 this seems eerily familiar to the incident like this we had with alexa :P
[23:18:16] <bblack>	 ulsfo it starting to offload some of the traffic, but I donno if it's enough to really fix things
[23:19:03] <bblack>	 another thing we could do, is repool eqiad's front edge too
[23:19:26] <bblack>	 (the reqs would still flow eqiad->codfw before reaching into the applayer, and thus still respect the DC switch in that sense, but it would shift edge link traffic around)
[23:19:35] <XioNoX>	 bblack: set a larger cache value for this specific new UA?
[23:19:53] <bblack>	 no ide aif that will even help, or if they're even sending duplicate reqs
[23:19:59] <bblack>	 but it's HEAD requests anyways, not GET
[23:20:06] <bblack>	 so probably it wouldn't help :)
[23:21:17] <XioNoX>	 rate limit is possible?
[23:21:32] <XioNoX>	 at the cache layer
[23:21:51] <bblack>	 maybe
[23:21:59] <bblack>	 it's going to break something of course, somewhere
[23:22:06] <bblack>	 sample of what the reqs look like in oxygen logs:
[23:22:09] <bblack>	 {"hostname":"cp2026.codfw.wmnet","sequence":2823883557,"dt":"2018-10-05T23:12:33","time_firstbyte":0.000425,"ip":"2a03:2880:ff:b::face:b00c","cache_status":"hit-local","http_status":"404","response_size":0,"http_method":"HEAD","uri_host":"upload.wikimedia.org","uri_path":"/wikipedia/en/1/14/Iliana_Iotova_-_Bulgarian_part-_Citizens%E2%80%99_Corner_debate-_With_or_without_Schengen","uri_query":"
[23:22:15] <bblack>	 ?_(26448779462).png","content_type":"text/html; charset=UTF-8","referer":"http://upload.wikimedia.org/wikipedia/en/1/14/Iliana_Iotova_-_Bulgarian_part-_Citizens%E2%80%99_Corner_debate-_With_or_without_Schengen?_(26448779462).png","user_agent":"Go-http-client/2.0","accept_language":"-","x_analytics":"https=1;nocookies=1","range":"-","x_cache":"cp2011 hit/6, cp2026 miss"}
[23:22:21] <bblack>	 {"hostname":"cp2017.codfw.wmnet","sequence":3016365346,"dt":"2018-10-05T23:12:33","time_firstbyte":0.000043,"ip":"2a03:2880:ff:a::face:b00c","cache_status":"int-front","http_status":"301","response_size":0,"http_method":"HEAD","uri_host":"upload.wikimedia.org","uri_path":"/wikipedia/commons/9/9f/%C3%83%8Dsis_Val%C3%A9ria_Gomes,_G%C3%B6teborg_Book_Fair_2014_2.png","uri_query":"","content_type":
[23:22:27] <bblack>	 "-","referer":"-","user_agent":"Go-http-client/1.1","accept_language":"-","x_analytics":"-","range":"-","x_cache":"cp2017 int"}
[23:22:47] <bblack>	 they don't even have response bodies as HEAD responses, we're just saturating at the high volume of metadata flowing heh
[23:23:10] <bblack>	 oh wait, why is neither one of my random samples a 200? :)
[23:24:07] <bblack>	 bblack@oxygen:~$ jq .http_status fb-headreqs |sort|uniq -c|sort -rn|head -20
[23:24:10] <bblack>	    5135 "301"
[23:24:13] <bblack>	    4426 "412"
[23:24:15] <bblack>	     373 "404"
[23:24:17] <bblack>	     292 "200"
[23:24:51] <bblack>	 so the 412s are precondition failures, like If-None-Match mismatches or whatever, to check caching
[23:25:45] <bblack>	 but the 301's... apparently they're hitting port 80 for those :P
[23:25:55] <bblack>	 which is ~ half of the traffic
[23:26:21] <bblack>	 ridiculous, but also I don't think 429 ratelimiters will be much less impactful than the redirects or 412s
[23:26:59] <bblack>	 of course, the HEAD reqs are the most-notable thing in the traffic window, but there's just no way they account for all that port saturation
[23:27:04] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary outbound port utilisation over 80%
[23:27:08] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80%
[23:27:25] <bblack>	 there's probably also GETs for images in the mix too, just at a much lower rate
[23:27:27] <XioNoX>	 ah ^
[23:27:33] <XioNoX>	 recoveries
[23:28:05] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80%
[23:28:10] <bblack>	 or in other words: we're observing the HEADs in reqstats because it's a high volume of reqs, but there's probably also a low volume of matching GET traffic for images that doesn't stand out in req counts, but is contributing significantly to bytes output
[23:29:05] <bblack>	 so I guess someone is spinning up some software and testing it for brief periods
[23:29:09] <bblack>	 yeah, on a Friday :P
[23:29:50] <bblack>	 don't they have better things to do, like sell user data or create new bugs that will compromise millions of their user accounts or something?
[23:33:26] <paladox>	 lol
[23:33:36] <bblack>	 XioNoX: my net take on this is I think we should repool the eqiad front edge.  We're only ~3 work days out from switchback anyways, and it will hopefully segment away some of the competing traffic to other links.
[23:34:04] <bblack>	 because odds are this probably wasn't the last burst, and blocking it gets tricky and breaks other things for a bunch of other FB stuff I assume
[23:42:04] <wikibugs>	 (03PS1) 10BBlack: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/464928
[23:43:51] <bblack>	 anyways, the DNS stats are an interesting indicator to look at during this stuff.  I wouldn't normally look for it, but happened to have them up lately:
[23:43:54] <bblack>	 https://grafana.wikimedia.org/dashboard/db/dns?orgId=1&from=now-6h&to=now
[23:44:40] <bblack>	 because a lot of the recursive caches that hit us, are smart enough that they keep some latency/loss history/probing to all 3, so when you see a notable shift of stats away from a DC, it probably means they're observing worse conditions there in general
[23:45:00] <bblack>	 (and in this case, the bump there in that graph, away from codfw to the other two, aligns with this codfw traffic saturation)
[23:46:24] <XioNoX>	 cool, yeah
[23:47:03] <XioNoX>	 I was going to say that I'm not expecting changes on a weekend but last spike was a Saturday I think
[23:47:34] <XioNoX>	 so repooling eqiad is maybe safer
[23:48:52] <wikibugs>	 (03CR) 10BBlack: [C: 032] Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/464928 (owner: 10BBlack)
[23:50:12] <bblack>	 !log <<<<<<< repooling eqiad edge caches, a few days ahead of intended switchback next Weds, to alleviate some traffic engineering concerns over the weekend >>>>>>
[23:50:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:58:51] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen