[00:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191206T0000).
[00:00:04] <jouncebot>	 urandom and Zoranzoki21: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:12] <urandom>	 o/
[00:00:21] <Zoranzoki21>	 \o
[00:02:11] <urandom>	 I have a request for whoever is handling SWAT: Prior to deploying r554910, I need an entry added to /srv/deployment/mediawiki/mediawiki/private/PrivateSettings.php on the deploy host.  I need $wmgSessionStoreHMACKey set to something secret-y (long(ish) and random(ish))
[00:02:33] <urandom>	  I don't have perms to edit it.
[00:03:34] <Zoranzoki21>	 RoanKattouw: can you SWAT?
[00:03:42] <RoanKattouw>	 Sure
[00:03:53] <Zoranzoki21>	 And can you deploy my patch first, it no needs mwdebug?
[00:04:29] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Add *.archives.go.jp to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553875 (https://phabricator.wikimedia.org/T238476) (owner: 10Zoranzoki21)
[00:04:56] <wikibugs>	 (03PS3) 10Zoranzoki21: Add *.archives.go.jp to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553875 (https://phabricator.wikimedia.org/T238476)
[00:05:04] <Zoranzoki21>	 Oh, it needed rebase
[00:05:10] <Zoranzoki21>	 Can you reapply +2?
[00:05:16] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Add *.archives.go.jp to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553875 (https://phabricator.wikimedia.org/T238476) (owner: 10Zoranzoki21)
[00:06:08] <wikibugs>	 (03Merged) 10jenkins-bot: Add *.archives.go.jp to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553875 (https://phabricator.wikimedia.org/T238476) (owner: 10Zoranzoki21)
[00:08:38] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add *.archives.go.jp to $wgCopyUploadsDomains (T238476) (duration: 01m 00s)
[00:08:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:08:45] <stashbot>	 T238476: Add *.archives.go.jp to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T238476
[00:08:57] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Papaul)
[00:09:24] <Zoranzoki21>	 Ty so much RoanKattouw
[00:11:27] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2067.codfw.wmnet - https://phabricator.wikimedia.org/T233185 (10Papaul)
[00:14:15] <urandom>	 RoanKattouw: are you going to have time to do the other patch?  If so — did you see my earlier about PrivateSettings.php?
[00:14:38] <wikibugs>	 10Operations, 10Epic, 10Maps (Kartotherian), 10Patch-For-Review: Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10Jdforrester-WMF) >>! In T216826#5640424, @MSantos wrote: > @Mathew.onipe and @Jdforrester-WMF just FYI: I have tested kartotherian with debian buster...
[00:14:45] <icinga-wm>	 RECOVERY - Old JVM GC check - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1
[00:15:10] <wikibugs>	 (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2062 and db2067 [dns] - 10https://gerrit.wikimedia.org/r/554972
[00:16:56] <wikibugs>	 (03PS3) 10Papaul: DNS: Add mgmt and production DNS for frdb2002 [dns] - 10https://gerrit.wikimedia.org/r/554640
[00:17:29] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for frdb2002 [dns] - 10https://gerrit.wikimedia.org/r/554640 (owner: 10Papaul)
[00:19:56] <wikibugs>	 (03PS2) 10Papaul: DNS: Remove mgmt DNS for db2062 and db2067 [dns] - 10https://gerrit.wikimedia.org/r/554972
[00:20:21] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for db2062 and db2067 [dns] - 10https://gerrit.wikimedia.org/r/554972 (owner: 10Papaul)
[00:21:21] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Papaul)
[00:21:37] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Papaul)
[00:21:39] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Papaul) 05Open→03Resolved Complete
[00:22:13] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2067.codfw.wmnet - https://phabricator.wikimedia.org/T233185 (10Papaul)
[00:22:44] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2067.codfw.wmnet - https://phabricator.wikimedia.org/T233185 (10Papaul) 05Open→03Resolved Complete
[00:22:46] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Papaul)
[00:23:19] <RoanKattouw>	 urandom: Sorry, got distracted with something else, back here now
[00:23:32] <RoanKattouw>	 Yes I saw. How would you like me to generate that value?
[00:23:39] <wikibugs>	 10Operations, 10Epic, 10Maps (Kartotherian), 10Patch-For-Review: Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10Mholloway) >>! In T216826#5640424, @MSantos wrote: > @Mathew.onipe and @Jdforrester-WMF just FYI: I have tested kartotherian with debian buster and u...
[00:24:05] <RoanKattouw>	 Or maybe you could generate it yourself and put it in your homedir or another place on the deployment host where I can see it?
[00:25:01] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Update session serialization (Kask) to PHP w/ HMAC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554910 (https://phabricator.wikimedia.org/T222099) (owner: 10Eevans)
[00:25:05] <urandom>	 I don't think it matters a lot so long as it's reasonably unguessable
[00:25:09] <urandom>	 RoanKattouw: date +%s | sha256sum | base64 | head -c 64 ; echo
[00:25:10] <urandom>	 ?
[00:25:49] <wikibugs>	 (03Merged) 10jenkins-bot: Update session serialization (Kask) to PHP w/ HMAC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554910 (https://phabricator.wikimedia.org/T222099) (owner: 10Eevans)
[00:25:50] <Reedy>	 Well, if I know what time it gets run.. Or around that time...
[00:26:00] <Reedy>	 Wouldn't take much brute forcing...
[00:26:02] <urandom>	 </dev/urandom tr -dc '12345!@#$%qwertQWERTasdfgASDFGzxcvbZXCVB' | head -c64; echo ""
[00:26:33] <RoanKattouw>	 OK that works, thank you
[00:26:36] <RoanKattouw>	 Sorry for being lazy
[00:26:41] <urandom>	 no worries
[00:26:55] <Reedy>	 00:26:28 00:26:28 scap failed: RuntimeError Scap failed!: Call to mwscript eval.php stderr: Notice: Undefined variable: wmgSessionStoreHMACKey in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 474 (duration: 00m 00s)
[00:27:00] <Reedy>	 Beta is unhappy :P
[00:27:15] <urandom>	 ls
[00:27:17] <urandom>	 crap
[00:27:44] <urandom>	 I guess a value needs to be added to deployment-deploy01.deployment-prep.eqiad.wmflabs too?
[00:27:45] <Reedy>	 But beta is easily fixed
[00:27:50] <Reedy>	 Yeah exactly
[00:28:30] <urandom>	 I can read these files, but not write to them
[00:29:36] <Reedy>	 I'll fix beta
[00:30:51] <Reedy>	 ok, value staged
[00:30:58] <Reedy>	 jenkins next run will fix beta scap
[00:31:09] <urandom>	 thanks!
[00:36:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Papaul) papaul@asw2-a-eqiad# show | compare  [edit interfaces] -   ge-8/0/8 { -       description labstore1003; -   }
[00:41:39] <RoanKattouw>	 Oh whoops
[00:41:41] <RoanKattouw>	 yes thanks Reedy 
[00:42:16] <RoanKattouw>	 OK I've pulled this onto mwdebug1001
[00:42:32] <wikibugs>	 (03CR) 10Reedy: ".gitattributes is in the next release!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 (owner: 10Jforrester)
[00:42:42] <urandom>	 RoanKattouw: ??
[00:43:30] <RoanKattouw>	 urandom: The test server
[00:43:52] <urandom>	 the key doesn't need to be on deploy1001?
[00:44:00] * urandom had been watching that file
[00:44:17] <Reedy>	 It presumably is
[00:44:29] <Reedy>	 But Roan has pulled it onto the test server for.. testing?
[00:44:39] <Reedy>	 Dunno if there's anything you can actually test though :D
[00:44:40] <urandom>	 eevans@deploy1001:/srv/deployment/mediawiki/mediawiki/private$ grep -c wmgSessionStoreHMACKey PrivateSettings.php
[00:44:40] <urandom>	 0
[00:44:53] <RoanKattouw>	 urandom: I put it in /srv/deployment/mediawiki-staging
[00:44:59] <RoanKattouw>	 Which is where things get synced from
[00:45:06] <Reedy>	 It'll be in /srv/deployment when it's sync'd everywhere
[00:45:13] <RoanKattouw>	 My session is still working on enwiki using the test server
[00:45:40] <RoanKattouw>	 Not seeing any errors in logstash or when running eval.php, and if I dump out the kask config I see the right secret value
[00:45:42] <RoanKattouw>	 So, let's roll
[00:46:02] <urandom>	 RoanKattouw: this is only deployed to testwiki ATM, but it looks good there, too
[00:46:33] <icinga-wm>	 RECOVERY - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1
[00:47:15] <urandom>	 where is /srv/deployment/mediawiki-staging ?
[00:47:25] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Use PHP serialization with HMAC for Kask session serialization (T222099) (duration: 01m 01s)
[00:47:28] <urandom>	 for my own edification
[00:47:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:47:32] <stashbot>	 T222099: Staging release of RESTBagOStuff using Kask - https://phabricator.wikimedia.org/T222099
[00:48:05] <Reedy>	 it's /srv/mediawiki-staging
[00:48:09] <RoanKattouw>	 Oh it's /srv/mediawiki-staging, my apologies
[00:48:35] <Reedy>	 /srv/deployment/mediawiki/mediawiki is a symlink to /srv/mediawiki
[00:52:05] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 7000 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:52:15] <wikibugs>	 (03CR) 10Jforrester: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 (owner: 10Jforrester)
[00:53:20] <wikibugs>	 (03CR) 10Reedy: "You can still use git archive or similar which will follow gitattributes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 (owner: 10Jforrester)
[00:53:53] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 10 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:57:28] <urandom>	 RoanKattouw: LGTM; Thanks!
[00:58:01] <urandom>	 FYI, I still don't see the value in /srv/deployment/mediawiki-staging/PrivateSettings.php
[00:58:28] <Reedy>	 reedy@deploy1001:/srv/deployment/mediawiki$ ls -al /srv/deployment/mediawiki-staging
[00:58:28] <Reedy>	 ls: cannot access '/srv/deployment/mediawiki-staging': No such file or directory
[00:58:33] <Reedy>	 remove the deployment
[00:58:43] <urandom>	 yeah, sorry, mispaste
[00:58:59] <urandom>	 I meant /srv/mediawiki/private
[00:59:24] <urandom>	 eevans@deploy1001:/srv/mediawiki/private$ grep wmgSessionStoreHMACKey PrivateSettings.php
[00:59:24] <urandom>	 eevans@deploy1001:/srv/mediawiki/private$
[01:00:07] <Reedy>	 The file hasn't been sync'd yet
[01:00:26] <urandom>	 what syncs it?
[01:00:30] <Reedy>	 The deployer
[01:00:33] <urandom>	 oh
[01:00:38] <urandom>	 I thought that part was done
[01:00:46] <Reedy>	 RoanKattouw: Are you sync-file-ing PrivateSettings too? :P
[01:01:09] <Reedy>	 He's been idle for 15 minutes according to the server..
[01:01:36] <urandom>	 his last was "So lets roll", which made me think we were rolling :)
[01:02:26] <wikibugs>	 (03PS4) 10Jforrester: Variant configuration: Replace symfony/yaml with spyc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967
[01:02:28] <wikibugs>	 (03PS1) 10Jforrester: Variant configuration: Read and write variant config from conf-dir, not /tmp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554977
[01:02:30] <wikibugs>	 (03PS1) 10Jforrester: Stop setting wgSpamBlacklistEventLogging, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554978
[01:02:31] <Reedy>	 I'm slightly confused as he should've really sync'd that file before he did CommonSettings.php
[01:02:32] <wikibugs>	 (03PS1) 10Jforrester: Drop wgMediaInfoEnableOtherStatements and wgDepictsQualifierProperties, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554979
[01:02:34] <wikibugs>	 (03PS1) 10Jforrester: Drop wgDisableRollbackConfirmationFeature, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554980
[01:02:40] <RoanKattouw>	 Oh uhm
[01:02:41] <James_F>	 I just shouted at RoanKattouw IRL.
[01:02:43] <RoanKattouw>	 Yes I really should have, yikes
[01:02:44] <logmsgbot>	 !log reedy@deploy1001 Synchronized private/PrivateSettings.php: wmgSessionStoreHMACKey T222099 (duration: 01m 07s)
[01:02:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:02:50] <stashbot>	 T222099: Staging release of RESTBagOStuff using Kask - https://phabricator.wikimedia.org/T222099
[01:03:00] <Reedy>	 I just did it with the idle time :P
[01:03:37] <Reedy>	 I'm guessing there was a spam of errors in logstash relating to it being undefined.. Luckily only on testwiki so should've been minimal
[01:03:57] <RoanKattouw>	 Oh lol I just did it and didn't hit the lock because yours finished right before mine started
[01:03:58] <wikibugs>	 (03PS2) 10Jforrester: Drop wgDisableRollbackConfirmationFeature, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554980
[01:04:09] <logmsgbot>	 !log catrope@deploy1001 Synchronized private/PrivateSettings.php: HMAC value for Kask config (T222099) (duration: 00m 59s)
[01:04:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:04:23] <RoanKattouw>	 Sorry for being so scatterbrained today
[01:04:25] <Reedy>	 :)
[01:04:55] <Reedy>	 only 45K errors in logstash apparently :P
[01:04:59] <wikibugs>	 (03PS2) 10Jforrester: Stop setting wgSpamBlacklistEventLogging, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554978
[01:05:01] <wikibugs>	 (03PS2) 10Jforrester: Drop wgMediaInfoEnableOtherStatements and wgDepictsQualifierProperties, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554979
[01:05:03] <wikibugs>	 (03PS3) 10Jforrester: Drop wgDisableRollbackConfirmationFeature, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554980
[01:05:05] <wikibugs>	 (03PS5) 10Jforrester: Variant configuration: Replace symfony/yaml with spyc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967
[01:05:07] <wikibugs>	 (03PS2) 10Jforrester: Variant configuration: Read and write variant config from conf-dir, not /tmp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554977
[01:05:38] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash7-codfw,logstash7-eqiad} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource
[01:05:38] <icinga-wm>	 /ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[01:06:22] <Reedy>	 And they stopped when the file was sync'd
[01:06:25] <Reedy>	 So should be all good now
[01:06:29] <urandom>	 \o/
[01:06:40] <Reedy>	 Oh, that isn't guarded
[01:07:00] <Reedy>	 So... yeah, it'll have been an undefined on every execution...
[01:08:17] <urandom>	 Reedy, RoanKattouw: thanks for the help!
[01:16:21] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[01:18:40] <wikibugs>	 10Operations, 10ops-eqiad, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Papaul) @Jclark-ctr  Please see below for available mgmt  IP's that you can use for those servers. Once you have the asset tags please update the table with the...
[01:25:25] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/SecurePoll/cli/dump.php: T239968 (duration: 01m 01s)
[01:25:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:25:31] <stashbot>	 T239968: `cli/dump.php` does not accept the --votes modifier - https://phabricator.wikimedia.org/T239968
[01:34:48] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/SecurePoll/cli/dump.php: T239968 (duration: 01m 00s)
[01:34:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:34:53] <stashbot>	 T239968: `cli/dump.php` does not accept the --votes modifier - https://phabricator.wikimedia.org/T239968
[01:52:16] <wikibugs>	 (03CR) 10Ebe123: [C: 04-1] Upload HD logos for aawiki, aawikibooks, aawiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21)
[02:00:21] <wikibugs>	 (03CR) 10Zoranzoki21: "Wikis are closed, I will do this for another 3 projects." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21)
[02:12:30] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/SecurePoll/cli/dump.php: T239968 (duration: 01m 04s)
[02:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:12:36] <stashbot>	 T239968: `cli/dump.php` does not accept the --votes modifier - https://phabricator.wikimedia.org/T239968
[02:13:11] <wikibugs>	 (03PS4) 10Zoranzoki21: Upload HD logos for en, fi and nl arbcom wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618)
[02:13:59] <wikibugs>	 (03PS5) 10Zoranzoki21: Upload HD logos for en, fi and nl arbcom wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618)
[03:10:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack codfw1dev: everything is ocata now [puppet] - 10https://gerrit.wikimedia.org/r/554842 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott)
[03:34:21] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:37:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:53:52] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.downtime
[03:53:55] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.downtime
[03:53:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:54:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:55:57] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[03:56:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:58:09] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[03:58:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:10:45] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5388 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:12:07] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 18 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:24:44] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (contint1001, ...), Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring
[04:36:12] <wikibugs>	 (03PS5) 10DannyS712: InitialiseSettings - clean up groupOverrides layout / spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554392 (https://phabricator.wikimedia.org/T231178)
[04:43:10] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5348 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:44:50] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 2 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:12:24] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:13:22] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:16:56] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:17:44] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:30:14] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:31:08] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:56:14] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:57:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:08:46] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:09:36] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:59:28] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:00:22] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:09:00] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] secret: dummy credentials for airflow [labs/private] - 10https://gerrit.wikimedia.org/r/544993 (owner: 10EBernhardson)
[07:11:46] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:12:33] <wikibugs>	 10Operations, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review: Replace ircd-ratbox with something newer/maintained - https://phabricator.wikimedia.org/T134271 (10elukey)
[07:12:54] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:13:06] <wikibugs>	 10Operations, 10Analytics, 10Code-Stewardship-Reviews, 10Tools, 10Wikimedia-IRC-RC-Server: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319 (10elukey) >>! In T185319#5716701, @Dzahn wrote: > Is this really replacing the IRCd from T134271 ?  Yep! Closed it as du...
[07:15:20] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:16:28] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:16:31] <elukey>	 the eqiad - eqord link seems again under scheduled telia maintenance
[07:18:20] <elukey>	 ah and the cr3-ulsfo is to eqord as well, telia maintenance for both links
[07:21:12] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 9595 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:22:57] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 8 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:26:36] <elukey>	 lovely
[07:37:32] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@1ac26da]: (no justification provided)
[07:37:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:39] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@1ac26da]: (no justification provided) (duration: 00m 07s)
[07:37:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:15] <moritzm>	 !log installing libav security updates
[07:38:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:26] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@1ac26da]: (no justification provided)
[07:38:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:25] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "arbcom_fiwiki logos aren't optipng'ed. Could you run optipng -o7 on them, please?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21)
[07:41:50] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@1ac26da]: (no justification provided) (duration: 03m 23s)
[07:41:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:55] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@1ac26da]: (no justification provided)
[07:41:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:02] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@1ac26da]: (no justification provided) (duration: 00m 08s)
[07:42:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:20] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@a8c759e]: (no justification provided)
[07:43:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:31] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@a8c759e]: (no justification provided) (duration: 03m 11s)
[07:46:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:36] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.55 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[07:55:29] <moritzm>	 !log installing libonig security updates
[07:55:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:30] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@a8c759e]: (no justification provided)
[07:59:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:00] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0875 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[08:01:33] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@a8c759e]: (no justification provided) (duration: 02m 03s)
[08:01:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:56] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@a8c759e]: (no justification provided)
[08:02:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:24] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@a8c759e]: (no justification provided) (duration: 01m 28s)
[08:03:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:38] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@a8c759e]: (no justification provided)
[08:04:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:46] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@a8c759e]: (no justification provided) (duration: 00m 07s)
[08:04:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:49] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: add config files for 'train' [puppet] - 10https://gerrit.wikimedia.org/r/555030 (https://phabricator.wikimedia.org/T239974)
[08:20:51] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: update some horizon settings for Train [puppet] - 10https://gerrit.wikimedia.org/r/555031 (https://phabricator.wikimedia.org/T239974)
[08:20:53] <wikibugs>	 (03PS1) 10Andrew Bogott: codfw1dev: move to Horizon version 'train' [puppet] - 10https://gerrit.wikimedia.org/r/555032 (https://phabricator.wikimedia.org/T239974)
[08:21:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: add config files for 'train' [puppet] - 10https://gerrit.wikimedia.org/r/555030 (https://phabricator.wikimedia.org/T239974) (owner: 10Andrew Bogott)
[08:22:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update some horizon settings for Train [puppet] - 10https://gerrit.wikimedia.org/r/555031 (https://phabricator.wikimedia.org/T239974) (owner: 10Andrew Bogott)
[08:22:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: move to Horizon version 'train' [puppet] - 10https://gerrit.wikimedia.org/r/555032 (https://phabricator.wikimedia.org/T239974) (owner: 10Andrew Bogott)
[08:24:46] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:25:30] <moritzm>	 !log installing libgd2 security updates on stretch
[08:25:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:52] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:30:20] <wikibugs>	 (03PS1) 10Andrew Bogott: Add openstack client packages for train [puppet] - 10https://gerrit.wikimedia.org/r/555080
[08:30:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add openstack client packages for train [puppet] - 10https://gerrit.wikimedia.org/r/555080 (owner: 10Andrew Bogott)
[08:34:24] <wikibugs>	 (03PS2) 10Andrew Bogott: Add openstack client packages for train [puppet] - 10https://gerrit.wikimedia.org/r/555080
[08:35:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add openstack client packages for train [puppet] - 10https://gerrit.wikimedia.org/r/555080 (owner: 10Andrew Bogott)
[08:36:35] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@1911591]: (no justification provided)
[08:36:36] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:36:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:20] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:38:20] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/553369 (https://phabricator.wikimedia.org/T236017) (owner: 10Giuseppe Lavagetto)
[08:38:30] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@1911591]: (no justification provided) (duration: 01m 55s)
[08:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:41] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: add 'train' versions of designate and neutron policy.json [puppet] - 10https://gerrit.wikimedia.org/r/555258 (https://phabricator.wikimedia.org/T239974)
[08:40:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: add 'train' versions of designate and neutron policy.json [puppet] - 10https://gerrit.wikimedia.org/r/555258 (https://phabricator.wikimedia.org/T239974) (owner: 10Andrew Bogott)
[08:41:32] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@1911591]: (no justification provided)
[08:41:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:31] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@1911591]: (no justification provided) (duration: 01m 59s)
[08:43:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:32] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@1911591]: (no justification provided)
[08:46:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:40] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@1911591]: (no justification provided) (duration: 00m 08s)
[08:46:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:13] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki)
[09:07:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1002: SMART/disk error - https://phabricator.wikimedia.org/T230088 (10Mathew.onipe) 05Open→03Resolved
[09:08:44] <wikibugs>	 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki)
[09:20:45] <wikibugs>	 10Operations, 10Packaging, 10serviceops: Build and upload envoy 1.12.0 package. - https://phabricator.wikimedia.org/T237235 (10Joe) 05Open→03Resolved
[09:20:48] <wikibugs>	 10Operations, 10RESTBase, 10Traffic: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe)
[09:21:21] <wikibugs>	 10Operations, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10Joe)
[09:23:22] <wikibugs>	 (03PS1) 10Andrew Bogott: remove a dangling comma [labs/private] - 10https://gerrit.wikimedia.org/r/555276
[09:24:28] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] remove a dangling comma [labs/private] - 10https://gerrit.wikimedia.org/r/555276 (owner: 10Andrew Bogott)
[09:26:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: blubberoid: break tls fucntionality into an helper (0311 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554832 (owner: 10Giuseppe Lavagetto)
[09:26:13] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: blubberoid: break TLS functionality into a helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/554832 (https://phabricator.wikimedia.org/T235411)
[09:26:15] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: scaffold: import the blubberoid tls helpers in scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/554833 (https://phabricator.wikimedia.org/T235411)
[09:26:17] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: eventgate: convert to use the common tls templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/554834 (https://phabricator.wikimedia.org/T235411)
[09:37:02] <wikibugs>	 10Operations, 10Maps: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Mathew.onipe)
[09:39:46] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[09:40:30] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[09:40:41] <dcausse>	 this is me ^
[09:41:06] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1010 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[09:42:36] <wikibugs>	 (03PS3) 10Muehlenhoff: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832)
[09:43:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Setup apt pinning for puppet 5 / facter 3 on stretch/jessie [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff)
[09:49:12] <wikibugs>	 (03PS4) 10Muehlenhoff: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832)
[09:54:00] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1010 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[09:54:44] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[09:55:20] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1010 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[10:02:24] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1010 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[10:02:30] <wikibugs>	 (03PS5) 10Hashar: contint: role for CI package_builder instances [puppet] - 10https://gerrit.wikimedia.org/r/554642 (https://phabricator.wikimedia.org/T224943)
[10:02:52] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[10:03:23] <gehel>	 onimisionipe: is it you with wdqs1010 ?
[10:03:30] <gehel>	 Or dcausse ?
[10:03:34] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[10:04:37] <dcausse>	 gehel: it's me
[10:04:46] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) As a way to identify more specifically where the TTFB regression comes from, in particular to understand precisely how much ats-be co...
[10:05:06] <dcausse>	 I need to stop blazegraph to run some journal tools
[10:05:56] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1010 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[10:06:24] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1010 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[10:06:34] <gehel>	 dcausse: ok, I'll downtime
[10:07:02] <dcausse>	 thanks
[10:07:06] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[10:18:58] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) Could you generate a separate stats table for misses and passthroughs?
[10:24:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] contint: role for CI package_builder instances [puppet] - 10https://gerrit.wikimedia.org/r/554642 (https://phabricator.wikimedia.org/T224943) (owner: 10Hashar)
[10:25:08] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 (10hashar) > pristine-tar: delta is version 3, newer than maximum supported version 2  @ema the CI debian-glue jobs are now running on Buster instances and thus come wit...
[10:25:25] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#5717652, @Gilles wrote: > Could you generate a separate stats table for misses and passthroughs?  Certainly. Non-hits...
[10:27:22] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) I meant specifically misses (ATS/Varnish did a lookup a didn't find the object) vs passthroughs (ATS/Varnish merely acted as a pro...
[10:27:31] <wikibugs>	 (03PS1) 10Elukey: Add fake keytab for analytics-search on stat1007 [labs/private] - 10https://gerrit.wikimedia.org/r/555360
[10:27:54] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: prometheus::k8s: drop envoy metrics about the admin interface [puppet] - 10https://gerrit.wikimedia.org/r/553246
[10:27:56] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake keytab for analytics-search on stat1007 [labs/private] - 10https://gerrit.wikimedia.org/r/555360 (owner: 10Elukey)
[10:36:08] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[10:37:56] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[10:38:36] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#5717657, @Gilles wrote: > I meant specifically misses (ATS/Varnish did a lookup a didn't find the object) vs passthrou...
[10:41:51] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) Yes, from a cache application perspective they are different tasks and therefore the issues affecting each could have different ca...
[10:59:21] <wikibugs>	 (03PS4) 10Elukey: statistics::discovery: move cron to timer and add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/554528
[11:00:42] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5292 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[11:02:26] <wikibugs>	 (03PS5) 10Elukey: statistics::discovery: move cron to timer and add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/554528
[11:04:16] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.07083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[11:08:22] <wikibugs>	 (03PS1) 10Ema: ATS: mark uncacheable responses as 'pass' in X-Cache-Int [puppet] - 10https://gerrit.wikimedia.org/r/555396 (https://phabricator.wikimedia.org/T227432)
[11:08:47] <wikibugs>	 (03CR) 10Elukey: "Sent another version, I realized that I was missing some stuff, the code was not correct." [puppet] - 10https://gerrit.wikimedia.org/r/554528 (owner: 10Elukey)
[11:10:06] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[11:11:54] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[11:21:30] <wikibugs>	 (03PS6) 10Zoranzoki21: Upload HD logos for en, fi and nl arbcom wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618)
[11:21:49] <wikibugs>	 (03CR) 10Zoranzoki21: "> arbcom_fiwiki logos aren't optipng'ed. Could you run optipng -o7 on" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21)
[11:46:52] <wikibugs>	 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10elukey) p:05Triage→03High
[11:48:53] <wikibugs>	 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10elukey)
[12:07:56] <icinga-wm>	 PROBLEM - Disk space on netflow2001 is CRITICAL: DISK CRITICAL - free space: / 302 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops
[12:18:53] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Jenkins, 10Release-Engineering-Team (CI & Testing services): Add latest jenkins debian packages to apt.wikimedia.org and upgrade jenkins to latest LTS (2.190.3) - https://phabricator.wikimedia.org/T239586 (10hashar) Thanks for the new Jenkins packages :]  Fo...
[12:27:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21)
[12:52:34] <wikibugs>	 10Operations, 10Epic, 10Maps (Kartotherian), 10Patch-For-Review: Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10MSantos) >>! In T216826#5717180, @Jdforrester-WMF wrote: >>>! In T216826#5640424, @MSantos wrote: >> @Mathew.onipe and @Jdforrester-WMF just FYI: I h...
[12:59:48] <wikibugs>	 (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/554849 (https://phabricator.wikimedia.org/T236080) (owner: 10KartikMistry)
[13:15:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Hmm those machines have 2 disks after all. I would swear I thought they had 4. Anyway, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554961 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron)
[13:23:46] <wikibugs>	 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170 (10BBlack) 05Open→03Resolved a:03BBlack I'm not sure how long it's been fixed in our infra, but it definitely works correctly now in our new...
[13:31:19] <gehel>	 !log starting transfer of blazegraph journal from wdqs1007 to stat1004 - T239898
[13:31:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:26] <stashbot>	 T239898: Investigate triple counts difference between dumps and what blazegraph reports - https://phabricator.wikimedia.org/T239898
[13:32:10] <wikibugs>	 (03PS3) 10Ema: ATS: pass uncacheable requests [puppet] - 10https://gerrit.wikimedia.org/r/553132 (https://phabricator.wikimedia.org/T238494)
[13:34:54] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: pass uncacheable requests [puppet] - 10https://gerrit.wikimedia.org/r/553132 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema)
[13:35:04] <icinga-wm>	 PROBLEM - traffic_server tls process restarted on cp3064 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3064&var-layer=tls
[13:41:47] <ema>	 !log cp2004: adding do_global_ doesn't seem to work with reload, restart ats-be T238494
[13:41:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:53] <stashbot>	 T238494: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494
[13:44:20] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1002 - https://phabricator.wikimedia.org/T239957 (10Jclark-ctr) closing due to duplicate .
[13:52:01] <wikibugs>	 10Operations, 10Traffic: Implement machine-local forwarding DNS caches - https://phabricator.wikimedia.org/T171498 (10BBlack) In these past couple of weeks we've had a real about-face on this issue, and I think there's a pretty strong consensus and rationale to pursue some kind of host-level caching, but there...
[13:54:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] blubberoid: break TLS functionality into a helper (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554832 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto)
[14:00:49] <wikibugs>	 10Operations, 10Traffic: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BBlack)
[14:01:11] <wikibugs>	 10Operations, 10Pybal, 10Traffic: DNS recursors TCP retransmits - https://phabricator.wikimedia.org/T211131 (10BBlack) 05Open→03Declined These are still present AFAIK, and we're fairly certain it's just due to pybal healthchecks using blank/broken TCP connections to monitor them.  That will be cleaned up...
[14:01:37] <wikibugs>	 (03Abandoned) 10Ema: ATS: mark uncacheable responses as 'pass' in X-Cache-Int [puppet] - 10https://gerrit.wikimedia.org/r/555396 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema)
[14:02:46] <wikibugs>	 10Operations, 10Traffic: Make authdns-update compatible with local emergency changes - https://phabricator.wikimedia.org/T219400 (10BBlack) Sorry I hadn't remember we had this existing ticket.  Will merge into the other newer one since it has patches already and some deeper context, and copy the main text over.
[14:03:53] <wikibugs>	 10Operations, 10Traffic: Make DNS operations resilient against predictable failures - https://phabricator.wikimedia.org/T239711 (10BBlack)
[14:03:55] <wikibugs>	 10Operations, 10Traffic: Make authdns-update compatible with local emergency changes - https://phabricator.wikimedia.org/T219400 (10BBlack)
[14:04:29] <wikibugs>	 10Operations, 10Traffic: Make DNS operations resilient against predictable failures - https://phabricator.wikimedia.org/T239711 (10BBlack) Thoughts from the main text of the merged ticket: ------------  We should improve our current [1] support of deploying an emergency DNS change when other dependent services...
[14:06:26] <wikibugs>	 10Operations, 10DNS, 10SRE-tools, 10Traffic: Include zone+subnet checks for DNS validation - https://phabricator.wikimedia.org/T238727 (10BBlack) 05Open→03Declined Declined in favor of netbox integration (  T233183 ? ) making this problem go away.
[14:08:28] <wikibugs>	 10Operations, 10DNS, 10Traffic, 10Core Platform Team Legacy (Watching / External), 10Services (watching): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818 (10BBlack)
[14:09:04] <wikibugs>	 10Operations, 10DNS, 10Traffic, 10serviceops, and 2 others: icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818 (10BBlack)
[14:09:13] <wikibugs>	 (03CR) 10Herron: [C: 03+2] install_server: switch ganeti[345]* to raid1 layout [puppet] - 10https://gerrit.wikimedia.org/r/554961 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron)
[14:11:34] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[14:11:45] <wikibugs>	 10Operations, 10DNS, 10Traffic, 10serviceops, and 2 others: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability - https://phabricator.wikimedia.org/T162818 (10BBlack)
[14:12:49] <ema>	 !log cp3050: ats-backend-restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/553132/ T238494
[14:12:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:55] <stashbot>	 T238494: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494
[14:13:20] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:13:29] <wikibugs>	 10Operations, 10DNS, 10Traffic, 10serviceops, and 2 others: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability - https://phabricator.wikimedia.org/T162818 (10BBlack) While we'll work on improvements that make this less-likely i...
[14:16:09] <wikibugs>	 10Operations, 10DNS, 10Traffic: Consider DNSSec - https://phabricator.wikimedia.org/T26413 (10BBlack) Since we haven't updated this in two years, I figured I should post again:  * DNSSEC is still awful * DNSSEC is still basically all the world has to solve certain problems, for better or worse. * DNSSEC has...
[14:20:28] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[14:21:01] <wikibugs>	 10Operations, 10Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 (10BBlack)
[14:21:40] <wikibugs>	 10Operations, 10Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 (10BBlack) a:05faidon→03None
[14:22:16] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:23:12] <wikibugs>	 10Operations, 10Traffic: Lower geodns TTLs from 600 (10min) to 300 (5min) - https://phabricator.wikimedia.org/T140365 (10BBlack)
[14:23:15] <wikibugs>	 10Operations, 10Traffic: Implement GeoDNS smooth repooling in gdnsd - https://phabricator.wikimedia.org/T228678 (10BBlack)
[14:23:18] <wikibugs>	 10Operations, 10Traffic: Set up LVS for current AuthDNS - https://phabricator.wikimedia.org/T101525 (10BBlack)
[14:23:30] <wikibugs>	 10Operations, 10Traffic: Lower geodns TTLs from 600 (10min) to 300 (5min) - https://phabricator.wikimedia.org/T140365 (10BBlack) This is still something we want to pursue, but we really need to get past the smooth repooling issue first, so I've added that as a subtask (consider it blocking this one).
[14:24:31] <wikibugs>	 10Operations, 10Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 (10BBlack)
[14:24:37] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack)
[14:27:49] <wikibugs>	 10Operations, 10Traffic: Implement DNS-over-TLS for AuthDNS - https://phabricator.wikimedia.org/T239994 (10BBlack) p:05Triage→03Normal
[14:29:06] <wikibugs>	 (03CR) 10Hashar: "From a discussion with Moritz:  the resulting doxygen package will be used in a Docker container run by a CI Job.  The resulting artifact " [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/554942 (https://phabricator.wikimedia.org/T239482) (owner: 10Hashar)
[14:29:26] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[14:31:14] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:37:46] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.167 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[14:38:20] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[14:40:06] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:40:11] <ema>	 !log text@esams: rolling ats-backend-restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/553132/ T238494
[14:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:18] <stashbot>	 T238494: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494
[14:47:10] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[14:48:56] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:52:50] <wikibugs>	 (03CR) 10Jgreen: [C: 03+1] frack: fix asset tag management records [dns] - 10https://gerrit.wikimedia.org/r/554079 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans)
[14:55:30] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.021 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[14:56:02] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[14:57:48] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[15:04:24] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.025 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[15:04:56] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[15:06:42] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[15:12:09] <elukey>	 Krinkle, AaronSchulz - o/ are you around?
[15:13:16] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.008 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[15:16:21] <wikibugs>	 (03PS1) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513
[15:18:36] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0875 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[15:19:03] <wikibugs>	 (03PS2) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513
[15:21:48] <wikibugs>	 (03PS4) 10Muehlenhoff: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978)
[15:22:48] <wikibugs>	 (03CR) 10Muehlenhoff: "Great review, thanks! I've made a PS4, comments inline" (0326 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff)
[15:23:56] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.8 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[15:24:26] <wikibugs>	 (03PS3) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513
[15:24:30] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[15:25:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513 (owner: 10CDanis)
[15:25:31] <wikibugs>	 10Operations, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Bstorm) It looks like the limit was last raised 5 years ago. I'll double check a couple things, but I suspect that's old stuff we can raise.
[15:26:16] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[15:27:10] <wikibugs>	 (03PS1) 10RLazarus: Refactor, preparatory to testing multiple hosts in parallel. [software/httpbb] - 10https://gerrit.wikimedia.org/r/555515
[15:27:30] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.07083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[15:28:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Refactor, preparatory to testing multiple hosts in parallel. [software/httpbb] - 10https://gerrit.wikimedia.org/r/555515 (owner: 10RLazarus)
[15:28:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "This version looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi)
[15:29:50] <wikibugs>	 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Papaul) a:05Papaul→03Jgreen @Jgreen  all yours
[15:32:50] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 2.108 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[15:33:24] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[15:35:10] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[15:35:39] <wikibugs>	 (03PS2) 10RLazarus: Refactor, preparatory to testing multiple hosts in parallel. [software/httpbb] - 10https://gerrit.wikimedia.org/r/555515
[15:36:35] <wikibugs>	 (03CR) 10Muehlenhoff: "A few comments inline. @Luca, let me know if you disagree with my comment on the krb* hosts." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi)
[15:38:10] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[15:38:31] <wikibugs>	 (03PS4) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513
[15:39:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513 (owner: 10CDanis)
[15:39:43] <wikibugs>	 (03PS5) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513
[15:41:48] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.8333 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[15:42:20] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[15:42:25] <wikibugs>	 (03PS6) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513
[15:43:30] <wikibugs>	 (03CR) 10CDanis: "PCC looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/555513 (owner: 10CDanis)
[15:45:52] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[15:46:30] <rlazarus>	 these spikes are so excessively weird
[15:46:53] <cdanis>	 rlazarus: it seems like el.ukey is onto the cause of these ones
[15:47:14] <rlazarus>	 yeah, reading
[15:56:58] <wikibugs>	 (03PS1) 10BBlack: lvs recdns: switch DNS aliases to anycast [dns] - 10https://gerrit.wikimedia.org/r/555520 (https://phabricator.wikimedia.org/T239993)
[15:57:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] lvs recdns: switch DNS aliases to anycast [dns] - 10https://gerrit.wikimedia.org/r/555520 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack)
[15:57:48] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09167 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[15:58:45] <bblack>	 gee thanks zone_validator :P
[15:59:19] <Vermont>	 yeah...is anything down?
[16:04:08] <wikibugs>	 (03PS2) 10BBlack: lvs recdns: switch DNS aliases to anycast [dns] - 10https://gerrit.wikimedia.org/r/555520 (https://phabricator.wikimedia.org/T239993)
[16:10:12] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.8125 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[16:10:55] <elukey>	 Vermont: hi! Are you reporting any issue?
[16:12:11] <Vermont>	 elukey: a few minutes ago i couldn’t access WMF sites, but I could to other sites
[16:12:15] <Vermont>	 but it isn’t an issue now
[16:12:18] <Vermont>	 thx :)
[16:12:34] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[16:12:34] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[16:13:17] <elukey>	 Vermont: we are experiencing some latency issues with MediaWiki Api appservers, so something is ongoing but shouldn't affect all wikis that heavily
[16:13:51] <wikibugs>	 (03PS1) 10Jhedden: ceph: remove rook.io based ceph modules [puppet] - 10https://gerrit.wikimedia.org/r/555528 (https://phabricator.wikimedia.org/T236290)
[16:14:16] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[16:17:42] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] ceph: remove rook.io based ceph modules [puppet] - 10https://gerrit.wikimedia.org/r/555528 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden)
[16:17:54] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[16:19:06] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0875 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[16:21:50] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 6390 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:22:40] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.225 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[16:23:14] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[16:23:36] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 185 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:23:46] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[16:23:52] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[16:25:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[16:25:36] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[16:28:41] <AaronSchulz>	 elukey: was it alert related?
[16:30:06] <elukey>	 AaronSchulz: good morning :)
[16:30:14] <elukey>	 yes! https://phabricator.wikimedia.org/T239983
[16:30:40] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom radium - https://phabricator.wikimedia.org/T203861 (10Papaul)
[16:30:46] <elukey>	 AaronSchulz: currently there is one single key that is causing troubles, the one for mc1026
[16:31:34] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5667 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[16:31:38] <elukey>	 we are still not sure that this is the cause of the api latency spikes, but there is a correlation
[16:32:08] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[16:32:51] <_joe_>	 !log flushing apcu on mw1339
[16:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10Papaul) ` papaul@asw2-a-eqiad# show | compare  [edit interfaces] -   ge-1/0/5 { -       description db1069; -   }
[16:34:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10Papaul)
[16:34:50] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[16:35:38] <wikibugs>	 (03PS1) 10BBlack: lvs recdns decom [puppet] - 10https://gerrit.wikimedia.org/r/555537 (https://phabricator.wikimedia.org/T239993)
[16:35:40] <wikibugs>	 (03PS7) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513
[16:35:42] <wikibugs>	 (03PS1) 10BBlack: lvs recdns post-decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/555538 (https://phabricator.wikimedia.org/T239993)
[16:36:32] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[16:37:35] <wikibugs>	 (03PS1) 10BBlack: lvs recdns: get rid of legacy recursor hostnames [dns] - 10https://gerrit.wikimedia.org/r/555539 (https://phabricator.wikimedia.org/T239993)
[16:38:30] <wikibugs>	 (03CR) 10CDanis: "Whoops, had forgotten the metric value.  Fixed:" [puppet] - 10https://gerrit.wikimedia.org/r/555513 (owner: 10CDanis)
[16:39:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Papaul) ` papaul@asw2-b-eqiad# show | compare                              [edit interfaces interface-range disabled]      member xe-7/0/41 { ... } +    member ge-2/0/19; [edi...
[16:39:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Papaul)
[16:40:22] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Papaul)
[16:41:56] <_joe_>	 !log flush acpu across the api cluster in eqiad
[16:42:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:16] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 3.737 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[16:42:50] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[16:44:10] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) 05Open→03Resolved
[16:44:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Papaul) ` papaul@asw2-b-eqiad# show | compare  [edit interfaces interface-range disabled]      member ge-2/0/19 { ... } +    member ge-3/0/26; [edit interfaces] -   ge-3/0/26...
[16:45:02] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Papaul)
[16:45:55] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) We can now distinguish between hit, miss, and pass in text@esams ATS too.  An important caveat when looking at these numbers is that Varnish supports hit-f...
[16:47:32] <_joe_>	 !log acpu flush finished
[16:47:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:38] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[16:48:38] <logmsgbot>	 !log jeh@cumin1001 START - Cookbook sre.hosts.downtime
[16:48:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:48] <logmsgbot>	 !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:07] <wikibugs>	 (03PS1) 10Papaul: DNS: Remove mgmt DNS for radium,db1069,db1072 and db1073 [dns] - 10https://gerrit.wikimedia.org/r/555542
[16:51:58] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for radium,db1069,db1072 and db1073 [dns] - 10https://gerrit.wikimedia.org/r/555542 (owner: 10Papaul)
[16:53:00] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[16:53:11] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom radium - https://phabricator.wikimedia.org/T203861 (10Papaul)
[16:53:28] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom radium - https://phabricator.wikimedia.org/T203861 (10Papaul) 05Open→03Resolved complete
[16:53:32] <wikibugs>	 10Operations, 10Patch-For-Review, 10Tor: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Papaul)
[16:53:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10Papaul)
[16:53:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10Papaul) 05Open→03Resolved complete
[16:53:54] <wikibugs>	 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Papaul)
[16:54:10] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Papaul)
[16:54:21] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Papaul) 05Open→03Resolved complete
[16:54:24] <wikibugs>	 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Papaul)
[16:54:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Papaul)
[16:55:02] <wikibugs>	 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Papaul)
[16:55:04] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Papaul)
[16:55:06] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Papaul) 05Open→03Resolved complete
[16:56:45] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BBlack) In a sample I just took across all recdns for a little over 15 minutes of sniffer time looking for requests to the legacy LVS-based recdns IPs: * ulsfo, eqsin, and esams had no traffic to them...
[16:57:14] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[17:01:12] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.664e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:02:10] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:02:28] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 448 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:03:22] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[17:03:26] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:04:17] <wikibugs>	 (03CR) 10Phamhi: wmcs: make cloudmetrics1002 the primary instead of labmon1001 (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554844 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi)
[17:04:21] <wikibugs>	 (03PS1) 10Ssingh: Update tox.ini [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/555546
[17:06:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update tox.ini [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/555546 (owner: 10Ssingh)
[17:06:15] <logmsgbot>	 !log jeh@cumin1001 START - Cookbook sre.hosts.downtime
[17:06:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:26] <icinga-wm>	 PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:07:32] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[17:08:00] <icinga-wm>	 PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:08:24] <logmsgbot>	 !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:08:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:00] <bblack>	 did someone just re-enable those install boxes?
[17:10:17] <bblack>	 (i went to one to run the agent and find the message, and it just ran)
[17:10:55] <bblack>	 same for both actually, weird
[17:11:07] <bblack>	 The last Puppet run was at Thu Dec  5 20:17:37 UTC 2019 (1252 minutes ago)
[17:11:15] <bblack>	 but agent ran fien without a re-enable on the first try
[17:11:18] <bblack>	 *fine
[17:11:36] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:11:39] <bblack>	 anyways
[17:11:46] <wikibugs>	 (03PS2) 10Ssingh: Update tox.ini [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/555546
[17:12:02] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[17:12:32] <icinga-wm>	 RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:12:38] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:13:10] <icinga-wm>	 RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:13:24] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Update tox.ini [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/555546 (owner: 10Ssingh)
[17:13:54] <wikibugs>	 (03Merged) 10jenkins-bot: Update tox.ini [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/555546 (owner: 10Ssingh)
[17:17:04] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[17:18:05] <wikibugs>	 (03PS1) 10CDanis: traffic drop: require minimum absolute rps [puppet] - 10https://gerrit.wikimedia.org/r/555550 (https://phabricator.wikimedia.org/T239039)
[17:18:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] traffic drop: require minimum absolute rps [puppet] - 10https://gerrit.wikimedia.org/r/555550 (https://phabricator.wikimedia.org/T239039) (owner: 10CDanis)
[17:19:57] <bblack>	 !log editing /e/n/i carefully with sed across the fleet via cumin, to correct legacy "dns-nameservers" line in older installs
[17:20:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:14] <wikibugs>	 (03PS2) 10CDanis: traffic drop: require minimum absolute rps [puppet] - 10https://gerrit.wikimedia.org/r/555550 (https://phabricator.wikimedia.org/T239039)
[17:21:29] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[17:21:55] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:21:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:22:03] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:22:45] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:22:52] <James_F>	 brennen, thcipriani: How'd you feel about deploying the nominal wmf.8 unblocker now (to group0)?
[17:23:03] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:23:21] <wikibugs>	 (03PS1) 10Dzahn: phabricator: limit mysql access for admins to production realm [puppet] - 10https://gerrit.wikimedia.org/r/555551
[17:23:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator: limit mysql access for admins to production realm [puppet] - 10https://gerrit.wikimedia.org/r/555551 (owner: 10Dzahn)
[17:24:11] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:24:13] <wikibugs>	 (03PS2) 10Dzahn: phabricator: limit mysql access for admins to production realm [puppet] - 10https://gerrit.wikimedia.org/r/555551
[17:24:15] <thcipriani>	 James_F: seems like it should be safe and might give us some insights about wmf.8 going to group1. I'll defer to brennen on his stomach for deployment though :)
[17:24:19] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:24:33] * James_F grins.
[17:24:39] <James_F>	 I can do the deploy.
[17:25:07] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:25:22] <brennen>	 James_F: i feel like it's friday and my nerves are shot, but if you are willing to do the deploy it seems pretty low-risk.
[17:25:35] <James_F>	 Kk, let's do it.
[17:25:46] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] lvs recdns: switch DNS aliases to anycast [dns] - 10https://gerrit.wikimedia.org/r/555520 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack)
[17:25:49] <wikibugs>	 (03PS3) 10BBlack: lvs recdns: switch DNS aliases to anycast [dns] - 10https://gerrit.wikimedia.org/r/555520 (https://phabricator.wikimedia.org/T239993)
[17:26:30] <logmsgbot>	 !log jeh@cumin1001 START - Cookbook sre.hosts.downtime
[17:26:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:01] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] phabricator: limit mysql access for admins to production realm [puppet] - 10https://gerrit.wikimedia.org/r/555551 (owner: 10Dzahn)
[17:27:57] <bblack>	 damn I missed my chance earlier to grab T240000 for something cool, it flew by a few hours ago :P
[17:27:58] <stashbot>	 T240000: Config on the RequestContext may not be the same as the main config - https://phabricator.wikimedia.org/T240000
[17:28:37] <logmsgbot>	 !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:28:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:32] <James_F>	 bblack: SRE got T234567, I think it's only fair that other teams get cool numbered tasks from time to time. ;-)
[17:29:33] <stashbot>	 T234567: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus - https://phabricator.wikimedia.org/T234567
[17:30:18] <wikibugs>	 (03PS3) 10Dzahn: phabricator: limit mysql access for admins to production realm [puppet] - 10https://gerrit.wikimedia.org/r/555551 (https://phabricator.wikimedia.org/T238425)
[17:30:21] <cdanis>	 whoa I didn't realize *I* got that task number until now, that's awesome
[17:30:34] * James_F bows before the all-mighty cdanis.
[17:30:57] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[17:31:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop https://puppet-compiler.wmflabs.org/compiler1002/19841/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/555551 (https://phabricator.wikimedia.org/T238425) (owner: 10Dzahn)
[17:32:11] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.695e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:33:01] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:33:03] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[17:33:17] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 550 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:34:03] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:37:57] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[17:41:55] <wikibugs>	 (03PS1) 10Jhedden: install_server: ceph change partman profile [puppet] - 10https://gerrit.wikimedia.org/r/555552 (https://phabricator.wikimedia.org/T236290)
[17:43:43] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[17:43:53] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.8/includes/libs/rdbms/database/Database.php: T239877 Have Database::makeWhereFrom2d assume  is string-based (duration: 01m 11s)
[17:43:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:59] <stashbot>	 T239877: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877
[17:47:04] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] install_server: ceph change partman profile [puppet] - 10https://gerrit.wikimedia.org/r/555552 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden)
[17:48:03] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[17:48:47] <wikibugs>	 (03PS1) 10CDanis: six fives is a lot like five nines, if you think about it. [puppet] - 10https://gerrit.wikimedia.org/r/555555
[17:53:13] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.333 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[17:54:28] <bblack>	 !log install2002 - restart squid3 service
[17:54:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:33] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[17:55:07] <cdanis>	 James_F: this one, though, *was* on purpose
[17:55:57] <wikibugs>	 (03Abandoned) 10CDanis: six fives is a lot like five nines, if you think about it. [puppet] - 10https://gerrit.wikimedia.org/r/555555 (owner: 10CDanis)
[17:58:21] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[18:02:19] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7458 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[18:03:03] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.242e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:04:31] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 241 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:05:15] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[18:05:31] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[18:05:35] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[18:06:39] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[18:06:59] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[18:08:35] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[18:12:23] <logmsgbot>	 !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw12.*
[18:12:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:11] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.279 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[18:13:25] <logmsgbot>	 !log jeh@cumin1001 START - Cookbook sre.hosts.downtime
[18:13:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:05] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[18:15:32] <logmsgbot>	 !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:07] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BBlack) Dug into the odd cases from `install2002` and `kraz` - the common pattern here is that there are some daemons in the world which both (a) parse `/etc/resolv.conf` for themselves because they u...
[18:16:19] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Danny Horn - https://phabricator.wikimedia.org/T239881 (10DannyH) @colewhite I'll be using this to access Turnilo and Superset. Thanks for your help, I appreciate it!
[18:17:27] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[18:20:31] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[18:22:55] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6417 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[18:23:01] <wikibugs>	 (03PS1) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585)
[18:24:27] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[18:25:43] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[18:29:01] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[18:29:13] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:29:57] <wikibugs>	 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10CDanis) As discussed with @Joe , increased the weights of the lower-weighted api_appservers in eqiad.    18:12 <cdanis@cumin2001> conftool action : set/weight=15; selector: service=ng...
[18:30:01] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[18:32:09] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.9958 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[18:34:01] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[18:34:41] <logmsgbot>	 !log jeh@cumin1001 START - Cookbook sre.hosts.downtime
[18:34:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:31] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[18:36:49] <logmsgbot>	 !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:36:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:47] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[18:41:11] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.096 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[18:41:57] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 7015 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:42:23] <wikibugs>	 (03PS1) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585)
[18:43:02] <wikibugs>	 (03Abandoned) 10Dzahn: new profile/role for IRC server using charybdis (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/345791 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn)
[18:43:07] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 824 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:50:31] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 3.26e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:51:33] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 391 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:55:49] <logmsgbot>	 !log jeh@cumin1001 START - Cookbook sre.hosts.downtime
[18:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:05] <logmsgbot>	 !log cdanis@cumin2001 conftool action : set/weight=20; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw12.*
[18:56:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:56] <wikibugs>	 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10CDanis) This shows the number of api_appservers in the pool which are close to maxing out on their php-fpm workers:  https://grafana.wikimedia.org/explore?orgId=1&left=%5B%22157550400...
[18:57:59] <logmsgbot>	 !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:58:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:47] <wikibugs>	 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10elukey)
[19:03:52] <wikibugs>	 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10elukey) It seems that both tx and rx bandwidth gets saturated by SETs/GETs for the same key on mc1026:  `   Time           eth0 HH:MM:SS   Kbps in  Kbps out  17:41:34 680050.6 980400....
[19:07:05] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[19:07:15] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:07:27] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1007 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:07:55] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:19:18] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 9837 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:19:48] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.196 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[19:20:22] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 199 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:24:26] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.06667 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[19:29:42] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.262 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[19:35:30] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[19:38:36] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7417 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[19:40:23] <wikibugs>	 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Jgreen)
[19:43:03] <wikibugs>	 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Jgreen)
[19:43:07] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen)
[19:43:30] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.07917 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[19:44:24] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) 05Resolved→03Open
[19:44:44] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) p:05High→03Normal
[19:45:17] <wikibugs>	 10Operations, 10ops-codfw: codfw: rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Jgreen)
[19:45:55] <wikibugs>	 10Operations, 10ops-codfw, 10fundraising-tech-ops: codfw: rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Jgreen)
[19:46:48] <wikibugs>	 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Jgreen)
[19:47:04] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen)
[20:08:02] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7125 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[20:11:32] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.03333 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[20:16:00] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.929e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:19:14] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 907 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:24:38] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 3.311e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:26:36] <wikibugs>	 10Operations, 10observability: Make grafana-next.wm.o HTTP 302 redirect to grafana.wm.o - https://phabricator.wikimedia.org/T240048 (10CDanis)
[20:28:14] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 685 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:32:10] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 8652 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:34:48] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.8833 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[20:34:50] <logmsgbot>	 !log jeh@cumin1001 START - Cookbook sre.hosts.downtime
[20:34:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:00] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 719 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:37:01] <logmsgbot>	 !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:37:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:27] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.08333 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[20:41:57] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.469e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:41:58] <wikibugs>	 10Operations, 10Gerrit, 10Phabricator, 10Security-Team, 10Traffic: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10Dzahn)
[20:42:45] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7125 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[20:44:21] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 487 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:50:03] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.591e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:56:05] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 628 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:00:09] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.647e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:01:39] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 415 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:01:53] <logmsgbot>	 !log jeh@cumin1001 START - Cookbook sre.hosts.downtime
[21:01:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:04:02] <logmsgbot>	 !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:08:33] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.04e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:12:11] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 658 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:12:32] <logmsgbot>	 !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1227
[21:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:36] <logmsgbot>	 !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1222
[21:12:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:48] <logmsgbot>	 !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1233
[21:12:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:52] <wikibugs>	 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10elukey) Tried to get a lot pcaps from tcpdump, to get host:timestamp combinations for mw apis with the goal of finding something interesting in their httpd access logs (for sv.wiki re...
[21:14:34] <logmsgbot>	 !log cdanis@cumin2001 conftool action : set/weight=25; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw12[789].*
[21:14:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:33] <logmsgbot>	 !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1233.eqiad.wmnet
[21:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:41] <logmsgbot>	 !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1222.eqiad.wmnet
[21:15:45] <logmsgbot>	 !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1227.eqiad.wmnet
[21:15:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:04] <logmsgbot>	 !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1231.eqiad.wmnet
[21:16:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:31] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 3.423e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:23:32] <wikibugs>	 (03PS1) 10BBlack: Switch phab SPF back to phab1001 [dns] - 10https://gerrit.wikimedia.org/r/555611 (https://phabricator.wikimedia.org/T238956)
[21:23:44] <logmsgbot>	 !log jeh@cumin1001 START - Cookbook sre.hosts.downtime
[21:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:54] <logmsgbot>	 !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:25:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:41] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.377e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:28:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Switch phab SPF back to phab1001 [dns] - 10https://gerrit.wikimedia.org/r/555611 (https://phabricator.wikimedia.org/T238956) (owner: 10BBlack)
[21:33:29] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 783 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:37:33] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.958e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:39:03] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 694 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:41:40] <bblack>	 !log mc1026: adjusting rx ring to 2047 and disabling ethernet pause (will be a minor blip of eth link state!)
[21:41:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:33] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 3.08e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:48:07] <wikibugs>	 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Dwisehaupt) bond0 interface set up and active.
[21:48:23] <wikibugs>	 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Dwisehaupt)
[21:54:04] <bblack>	 !log mc1026: add tc-fq qdisc to eth0 for tx
[21:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:35] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.07083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[21:57:09] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.5 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[21:58:21] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1002 - https://phabricator.wikimedia.org/T239957 (10wiki_willy) 05Open→03Resolved
[21:58:46] <bblack>	 cdanis: FYI/FTR, the 3 commands I've run to change things on 1026 are: ethtool -A eth0 autoneg off rx off tx off; ethtool -G eth0 rx 2047; tc qdisc add dev eth0 root fq
[21:58:51] <bblack>	 (it's the middle one that blips link)
[22:00:57] <bblack>	 reverting them would be, respectively: ethtool -A eth0 autoneg on rx on tx on; ethtool -G eth0 rx 200; tc qdisc del dev eth0 root
[22:04:15] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.9 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[22:06:23] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 98 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:08:06] <bblack>	 !log mc1033: ethernet tweaks as well (expect a short link blip)
[22:08:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:10:30] <wikibugs>	 (03PS1) 10Dwisehaupt: Adding new host frdb1003 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/555614 (https://phabricator.wikimedia.org/T239139)
[22:10:32] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/555614 (https://phabricator.wikimedia.org/T239139) (owner: 10Dwisehaupt)
[22:11:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:12:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Adding new host frdb1003 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/555614 (https://phabricator.wikimedia.org/T239139) (owner: 10Dwisehaupt)
[22:12:07] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.05833 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[22:15:30] <wikibugs>	 10Operations, 10DC-Ops, 10serviceops: mw1252 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T236190 (10Dzahn) 05Open→03Resolved a:03Dzahn I don't know why but the alert in Icinga has cleared since 14 days.
[22:16:10] <wikibugs>	 (03CR) 10Dzahn: "deployed and ran puppet on icinga1001. Here you go.. pending checks at: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_stri" [puppet] - 10https://gerrit.wikimedia.org/r/555614 (https://phabricator.wikimedia.org/T239139) (owner: 10Dwisehaupt)
[22:16:32] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=frdb1003
[22:23:00] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5125 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[22:27:58] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Dwisehaupt)
[22:28:02] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.04583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[22:33:10] <wikibugs>	 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10CDanis) @BBlack applied some NIC tweaks, which ultimately did not help:  22:08 <bblack> mc1033: ethernet tweaks as well (expect a short link blip) 21:54 <bblack> mc1026: add tc-fq qdi...
[22:40:48] <wikibugs>	 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10JHedden) 05Open→03Resolved
[22:41:24] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7417 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[22:46:18] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@c2bab5d]: Parsoid: Disable mirroring all traffic in split mode
[22:46:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:46:36] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.02083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[22:48:54] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.457e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:50:42] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 371 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:00:01] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@c2bab5d]: Parsoid: Disable mirroring all traffic in split mode (duration: 13m 43s)
[23:00:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:06:03] <wikibugs>	 (03PS1) 10Bjornskjald: Update three logos with more detailed versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618)
[23:10:12] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 04-1] "You need to update InitialiseSettings.php also, as requested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald)
[23:10:30] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@1911591]: (no justification provided)
[23:10:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:38] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@1911591]: (no justification provided) (duration: 00m 07s)
[23:10:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:42] <wikibugs>	 (03PS1) 10Ammarpad: Enable local uploads on inh.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925)
[23:11:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable local uploads on inh.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) (owner: 10Ammarpad)
[23:12:11] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@1911591]: (no justification provided)
[23:12:18] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@1911591]: (no justification provided) (duration: 00m 07s)
[23:12:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:39] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@1911591]: (no justification provided)
[23:12:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:04] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 04-1] "Looks like it needs removal from dblists/commonsuploads.dblist also" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) (owner: 10Ammarpad)
[23:17:40] <wikibugs>	 (03CR) 10Bjornskjald: "I know, the task says I should do it in another patchset." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald)
[23:18:13] <wikibugs>	 (03CR) 10Ebe123: [C: 03+1] "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald)
[23:19:31] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10colewhite) p:05Triage→03Normal
[23:19:47] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Danny Horn - https://phabricator.wikimedia.org/T239881 (10colewhite) 05Open→03Resolved
[23:21:57] <wikibugs>	 (03CR) 10Ammarpad: "> Looks like it needs removal from dblists/commonsuploads.dblist also" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) (owner: 10Ammarpad)
[23:23:30] <wikibugs>	 (03CR) 10Zoranzoki21: "> > Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald)
[23:25:22] <wikibugs>	 (03CR) 10Bjornskjald: "Hey, that's not an issue with your patch, but next time please either remove the logos you've done from the list, or add the Phabricator b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554889 (owner: 10TechneSiyam)
[23:28:16] <wikibugs>	 (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) (owner: 10Ammarpad)
[23:30:13] <wikibugs>	 (03PS2) 10Bjornskjald: Update three logos with more detailed versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618)
[23:30:44] <wikibugs>	 (03CR) 10Zoranzoki21: "> > Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) (owner: 10Ammarpad)
[23:31:33] <wikibugs>	 (03CR) 10Jforrester: "> Patch Set 1: -Code-Review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) (owner: 10Ammarpad)
[23:37:28] <wikibugs>	 (03PS1) 10Bstorm: toolforge-k8s: reduce the default terminated-pod-gc-threshold [puppet] - 10https://gerrit.wikimedia.org/r/555627 (https://phabricator.wikimedia.org/T240009)
[23:40:52] <wikibugs>	 (03PS1) 10Bjornskjald: Add new HD logos to wgLogoHD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555629 (https://phabricator.wikimedia.org/T150618)
[23:50:03] <wikibugs>	 10Operations, 10ContentSecurityPolicy, 10Gerrit, 10Phabricator, and 2 others: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10Bawolff)