[00:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191206T0000). [00:00:04] urandom and Zoranzoki21: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:12] o/ [00:00:21] \o [00:02:11] I have a request for whoever is handling SWAT: Prior to deploying r554910, I need an entry added to /srv/deployment/mediawiki/mediawiki/private/PrivateSettings.php on the deploy host. I need $wmgSessionStoreHMACKey set to something secret-y (long(ish) and random(ish)) [00:02:33] I don't have perms to edit it. [00:03:34] RoanKattouw: can you SWAT? [00:03:42] Sure [00:03:53] And can you deploy my patch first, it no needs mwdebug? [00:04:29] (03CR) 10Catrope: [C: 03+2] Add *.archives.go.jp to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553875 (https://phabricator.wikimedia.org/T238476) (owner: 10Zoranzoki21) [00:04:56] (03PS3) 10Zoranzoki21: Add *.archives.go.jp to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553875 (https://phabricator.wikimedia.org/T238476) [00:05:04] Oh, it needed rebase [00:05:10] Can you reapply +2? [00:05:16] (03CR) 10Catrope: [C: 03+2] Add *.archives.go.jp to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553875 (https://phabricator.wikimedia.org/T238476) (owner: 10Zoranzoki21) [00:06:08] (03Merged) 10jenkins-bot: Add *.archives.go.jp to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553875 (https://phabricator.wikimedia.org/T238476) (owner: 10Zoranzoki21) [00:08:38] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add *.archives.go.jp to $wgCopyUploadsDomains (T238476) (duration: 01m 00s) [00:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:45] T238476: Add *.archives.go.jp to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T238476 [00:08:57] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Papaul) [00:09:24] Ty so much RoanKattouw [00:11:27] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2067.codfw.wmnet - https://phabricator.wikimedia.org/T233185 (10Papaul) [00:14:15] RoanKattouw: are you going to have time to do the other patch? If so — did you see my earlier about PrivateSettings.php? [00:14:38] 10Operations, 10Epic, 10Maps (Kartotherian), 10Patch-For-Review: Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10Jdforrester-WMF) >>! In T216826#5640424, @MSantos wrote: > @Mathew.onipe and @Jdforrester-WMF just FYI: I have tested kartotherian with debian buster... [00:14:45] RECOVERY - Old JVM GC check - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [00:15:10] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2062 and db2067 [dns] - 10https://gerrit.wikimedia.org/r/554972 [00:16:56] (03PS3) 10Papaul: DNS: Add mgmt and production DNS for frdb2002 [dns] - 10https://gerrit.wikimedia.org/r/554640 [00:17:29] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for frdb2002 [dns] - 10https://gerrit.wikimedia.org/r/554640 (owner: 10Papaul) [00:19:56] (03PS2) 10Papaul: DNS: Remove mgmt DNS for db2062 and db2067 [dns] - 10https://gerrit.wikimedia.org/r/554972 [00:20:21] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for db2062 and db2067 [dns] - 10https://gerrit.wikimedia.org/r/554972 (owner: 10Papaul) [00:21:21] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Papaul) [00:21:37] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Papaul) [00:21:39] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Papaul) 05Open→03Resolved Complete [00:22:13] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2067.codfw.wmnet - https://phabricator.wikimedia.org/T233185 (10Papaul) [00:22:44] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2067.codfw.wmnet - https://phabricator.wikimedia.org/T233185 (10Papaul) 05Open→03Resolved Complete [00:22:46] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Papaul) [00:23:19] urandom: Sorry, got distracted with something else, back here now [00:23:32] Yes I saw. How would you like me to generate that value? [00:23:39] 10Operations, 10Epic, 10Maps (Kartotherian), 10Patch-For-Review: Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10Mholloway) >>! In T216826#5640424, @MSantos wrote: > @Mathew.onipe and @Jdforrester-WMF just FYI: I have tested kartotherian with debian buster and u... [00:24:05] Or maybe you could generate it yourself and put it in your homedir or another place on the deployment host where I can see it? [00:25:01] (03CR) 10Catrope: [C: 03+2] Update session serialization (Kask) to PHP w/ HMAC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554910 (https://phabricator.wikimedia.org/T222099) (owner: 10Eevans) [00:25:05] I don't think it matters a lot so long as it's reasonably unguessable [00:25:09] RoanKattouw: date +%s | sha256sum | base64 | head -c 64 ; echo [00:25:10] ? [00:25:49] (03Merged) 10jenkins-bot: Update session serialization (Kask) to PHP w/ HMAC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554910 (https://phabricator.wikimedia.org/T222099) (owner: 10Eevans) [00:25:50] Well, if I know what time it gets run.. Or around that time... [00:26:00] Wouldn't take much brute forcing... [00:26:02] OK that works, thank you [00:26:36] Sorry for being lazy [00:26:41] no worries [00:26:55] 00:26:28 00:26:28 scap failed: RuntimeError Scap failed!: Call to mwscript eval.php stderr: Notice: Undefined variable: wmgSessionStoreHMACKey in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 474 (duration: 00m 00s) [00:27:00] Beta is unhappy :P [00:27:15] ls [00:27:17] crap [00:27:44] I guess a value needs to be added to deployment-deploy01.deployment-prep.eqiad.wmflabs too? [00:27:45] But beta is easily fixed [00:27:50] Yeah exactly [00:28:30] I can read these files, but not write to them [00:29:36] I'll fix beta [00:30:51] ok, value staged [00:30:58] jenkins next run will fix beta scap [00:31:09] thanks! [00:36:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Papaul) papaul@asw2-a-eqiad# show | compare [edit interfaces] - ge-8/0/8 { - description labstore1003; - } [00:41:39] Oh whoops [00:41:41] yes thanks Reedy [00:42:16] OK I've pulled this onto mwdebug1001 [00:42:32] (03CR) 10Reedy: ".gitattributes is in the next release!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 (owner: 10Jforrester) [00:42:42] RoanKattouw: ?? [00:43:30] urandom: The test server [00:43:52] the key doesn't need to be on deploy1001? [00:44:00] * urandom had been watching that file [00:44:17] It presumably is [00:44:29] But Roan has pulled it onto the test server for.. testing? [00:44:39] Dunno if there's anything you can actually test though :D [00:44:40] eevans@deploy1001:/srv/deployment/mediawiki/mediawiki/private$ grep -c wmgSessionStoreHMACKey PrivateSettings.php [00:44:40] 0 [00:44:53] urandom: I put it in /srv/deployment/mediawiki-staging [00:44:59] Which is where things get synced from [00:45:06] It'll be in /srv/deployment when it's sync'd everywhere [00:45:13] My session is still working on enwiki using the test server [00:45:40] Not seeing any errors in logstash or when running eval.php, and if I dump out the kask config I see the right secret value [00:45:42] So, let's roll [00:46:02] RoanKattouw: this is only deployed to testwiki ATM, but it looks good there, too [00:46:33] RECOVERY - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [00:47:15] where is /srv/deployment/mediawiki-staging ? [00:47:25] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Use PHP serialization with HMAC for Kask session serialization (T222099) (duration: 01m 01s) [00:47:28] for my own edification [00:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:32] T222099: Staging release of RESTBagOStuff using Kask - https://phabricator.wikimedia.org/T222099 [00:48:05] it's /srv/mediawiki-staging [00:48:09] Oh it's /srv/mediawiki-staging, my apologies [00:48:35] /srv/deployment/mediawiki/mediawiki is a symlink to /srv/mediawiki [00:52:05] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 7000 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:52:15] (03CR) 10Jforrester: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 (owner: 10Jforrester) [00:53:20] (03CR) 10Reedy: "You can still use git archive or similar which will follow gitattributes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 (owner: 10Jforrester) [00:53:53] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 10 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:57:28] RoanKattouw: LGTM; Thanks! [00:58:01] FYI, I still don't see the value in /srv/deployment/mediawiki-staging/PrivateSettings.php [00:58:28] reedy@deploy1001:/srv/deployment/mediawiki$ ls -al /srv/deployment/mediawiki-staging [00:58:28] ls: cannot access '/srv/deployment/mediawiki-staging': No such file or directory [00:58:33] remove the deployment [00:58:43] yeah, sorry, mispaste [00:58:59] I meant /srv/mediawiki/private [00:59:24] eevans@deploy1001:/srv/mediawiki/private$ grep wmgSessionStoreHMACKey PrivateSettings.php [00:59:24] eevans@deploy1001:/srv/mediawiki/private$ [01:00:07] The file hasn't been sync'd yet [01:00:26] what syncs it? [01:00:30] The deployer [01:00:33] oh [01:00:38] I thought that part was done [01:00:46] RoanKattouw: Are you sync-file-ing PrivateSettings too? :P [01:01:09] He's been idle for 15 minutes according to the server.. [01:01:36] his last was "So lets roll", which made me think we were rolling :) [01:02:26] (03PS4) 10Jforrester: Variant configuration: Replace symfony/yaml with spyc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 [01:02:28] (03PS1) 10Jforrester: Variant configuration: Read and write variant config from conf-dir, not /tmp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554977 [01:02:30] (03PS1) 10Jforrester: Stop setting wgSpamBlacklistEventLogging, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554978 [01:02:31] I'm slightly confused as he should've really sync'd that file before he did CommonSettings.php [01:02:32] (03PS1) 10Jforrester: Drop wgMediaInfoEnableOtherStatements and wgDepictsQualifierProperties, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554979 [01:02:34] (03PS1) 10Jforrester: Drop wgDisableRollbackConfirmationFeature, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554980 [01:02:40] Oh uhm [01:02:41] I just shouted at RoanKattouw IRL. [01:02:43] Yes I really should have, yikes [01:02:44] !log reedy@deploy1001 Synchronized private/PrivateSettings.php: wmgSessionStoreHMACKey T222099 (duration: 01m 07s) [01:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:50] T222099: Staging release of RESTBagOStuff using Kask - https://phabricator.wikimedia.org/T222099 [01:03:00] I just did it with the idle time :P [01:03:37] I'm guessing there was a spam of errors in logstash relating to it being undefined.. Luckily only on testwiki so should've been minimal [01:03:57] Oh lol I just did it and didn't hit the lock because yours finished right before mine started [01:03:58] (03PS2) 10Jforrester: Drop wgDisableRollbackConfirmationFeature, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554980 [01:04:09] !log catrope@deploy1001 Synchronized private/PrivateSettings.php: HMAC value for Kask config (T222099) (duration: 00m 59s) [01:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:23] Sorry for being so scatterbrained today [01:04:25] :) [01:04:55] only 45K errors in logstash apparently :P [01:04:59] (03PS2) 10Jforrester: Stop setting wgSpamBlacklistEventLogging, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554978 [01:05:01] (03PS2) 10Jforrester: Drop wgMediaInfoEnableOtherStatements and wgDepictsQualifierProperties, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554979 [01:05:03] (03PS3) 10Jforrester: Drop wgDisableRollbackConfirmationFeature, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554980 [01:05:05] (03PS5) 10Jforrester: Variant configuration: Replace symfony/yaml with spyc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 [01:05:07] (03PS2) 10Jforrester: Variant configuration: Read and write variant config from conf-dir, not /tmp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554977 [01:05:38] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash7-codfw,logstash7-eqiad} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource [01:05:38] /ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [01:06:22] And they stopped when the file was sync'd [01:06:25] So should be all good now [01:06:29] \o/ [01:06:40] Oh, that isn't guarded [01:07:00] So... yeah, it'll have been an undefined on every execution... [01:08:17] Reedy, RoanKattouw: thanks for the help! [01:16:21] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [01:18:40] 10Operations, 10ops-eqiad, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Papaul) @Jclark-ctr Please see below for available mgmt IP's that you can use for those servers. Once you have the asset tags please update the table with the... [01:25:25] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/SecurePoll/cli/dump.php: T239968 (duration: 01m 01s) [01:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:31] T239968: `cli/dump.php` does not accept the --votes modifier - https://phabricator.wikimedia.org/T239968 [01:34:48] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/SecurePoll/cli/dump.php: T239968 (duration: 01m 00s) [01:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:53] T239968: `cli/dump.php` does not accept the --votes modifier - https://phabricator.wikimedia.org/T239968 [01:52:16] (03CR) 10Ebe123: [C: 04-1] Upload HD logos for aawiki, aawikibooks, aawiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [02:00:21] (03CR) 10Zoranzoki21: "Wikis are closed, I will do this for another 3 projects." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [02:12:30] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/SecurePoll/cli/dump.php: T239968 (duration: 01m 04s) [02:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:36] T239968: `cli/dump.php` does not accept the --votes modifier - https://phabricator.wikimedia.org/T239968 [02:13:11] (03PS4) 10Zoranzoki21: Upload HD logos for en, fi and nl arbcom wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) [02:13:59] (03PS5) 10Zoranzoki21: Upload HD logos for en, fi and nl arbcom wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) [03:10:06] (03CR) 10Andrew Bogott: [C: 03+2] Openstack codfw1dev: everything is ocata now [puppet] - 10https://gerrit.wikimedia.org/r/554842 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [03:34:21] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:37:53] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:53:52] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [03:53:55] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [03:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:57] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [03:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:09] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [03:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:45] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5388 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:12:07] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 18 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:24:44] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (contint1001, ...), Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [04:36:12] (03PS5) 10DannyS712: InitialiseSettings - clean up groupOverrides layout / spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554392 (https://phabricator.wikimedia.org/T231178) [04:43:10] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5348 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:44:50] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 2 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:12:24] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:13:22] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:16:56] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:44] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:30:14] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:31:08] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:56:14] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:57:06] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:08:46] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:36] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:59:28] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:22] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:00] (03CR) 10Elukey: [V: 03+2 C: 03+2] secret: dummy credentials for airflow [labs/private] - 10https://gerrit.wikimedia.org/r/544993 (owner: 10EBernhardson) [07:11:46] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:33] 10Operations, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review: Replace ircd-ratbox with something newer/maintained - https://phabricator.wikimedia.org/T134271 (10elukey) [07:12:54] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:13:06] 10Operations, 10Analytics, 10Code-Stewardship-Reviews, 10Tools, 10Wikimedia-IRC-RC-Server: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319 (10elukey) >>! In T185319#5716701, @Dzahn wrote: > Is this really replacing the IRCd from T134271 ? Yep! Closed it as du... [07:15:20] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:28] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:31] the eqiad - eqord link seems again under scheduled telia maintenance [07:18:20] ah and the cr3-ulsfo is to eqord as well, telia maintenance for both links [07:21:12] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 9595 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:22:57] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 8 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:26:36] lovely [07:37:32] !log andrew@deploy1001 Started deploy [horizon/deploy@1ac26da]: (no justification provided) [07:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:39] !log andrew@deploy1001 Finished deploy [horizon/deploy@1ac26da]: (no justification provided) (duration: 00m 07s) [07:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:15] !log installing libav security updates [07:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:26] !log andrew@deploy1001 Started deploy [horizon/deploy@1ac26da]: (no justification provided) [07:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:25] (03CR) 10Urbanecm: [C: 04-1] "arbcom_fiwiki logos aren't optipng'ed. Could you run optipng -o7 on them, please?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [07:41:50] !log andrew@deploy1001 Finished deploy [horizon/deploy@1ac26da]: (no justification provided) (duration: 03m 23s) [07:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:55] !log andrew@deploy1001 Started deploy [horizon/deploy@1ac26da]: (no justification provided) [07:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:02] !log andrew@deploy1001 Finished deploy [horizon/deploy@1ac26da]: (no justification provided) (duration: 00m 08s) [07:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:20] !log andrew@deploy1001 Started deploy [horizon/deploy@a8c759e]: (no justification provided) [07:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:31] !log andrew@deploy1001 Finished deploy [horizon/deploy@a8c759e]: (no justification provided) (duration: 03m 11s) [07:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:36] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.55 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [07:55:29] !log installing libonig security updates [07:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:30] !log andrew@deploy1001 Started deploy [horizon/deploy@a8c759e]: (no justification provided) [07:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:00] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0875 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [08:01:33] !log andrew@deploy1001 Finished deploy [horizon/deploy@a8c759e]: (no justification provided) (duration: 02m 03s) [08:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:56] !log andrew@deploy1001 Started deploy [horizon/deploy@a8c759e]: (no justification provided) [08:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:24] !log andrew@deploy1001 Finished deploy [horizon/deploy@a8c759e]: (no justification provided) (duration: 01m 28s) [08:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:38] !log andrew@deploy1001 Started deploy [horizon/deploy@a8c759e]: (no justification provided) [08:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:46] !log andrew@deploy1001 Finished deploy [horizon/deploy@a8c759e]: (no justification provided) (duration: 00m 07s) [08:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:49] (03PS1) 10Andrew Bogott: Horizon: add config files for 'train' [puppet] - 10https://gerrit.wikimedia.org/r/555030 (https://phabricator.wikimedia.org/T239974) [08:20:51] (03PS1) 10Andrew Bogott: Horizon: update some horizon settings for Train [puppet] - 10https://gerrit.wikimedia.org/r/555031 (https://phabricator.wikimedia.org/T239974) [08:20:53] (03PS1) 10Andrew Bogott: codfw1dev: move to Horizon version 'train' [puppet] - 10https://gerrit.wikimedia.org/r/555032 (https://phabricator.wikimedia.org/T239974) [08:21:43] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: add config files for 'train' [puppet] - 10https://gerrit.wikimedia.org/r/555030 (https://phabricator.wikimedia.org/T239974) (owner: 10Andrew Bogott) [08:22:13] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update some horizon settings for Train [puppet] - 10https://gerrit.wikimedia.org/r/555031 (https://phabricator.wikimedia.org/T239974) (owner: 10Andrew Bogott) [08:22:27] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: move to Horizon version 'train' [puppet] - 10https://gerrit.wikimedia.org/r/555032 (https://phabricator.wikimedia.org/T239974) (owner: 10Andrew Bogott) [08:24:46] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:25:30] !log installing libgd2 security updates on stretch [08:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:52] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:30:20] (03PS1) 10Andrew Bogott: Add openstack client packages for train [puppet] - 10https://gerrit.wikimedia.org/r/555080 [08:30:55] (03CR) 10jerkins-bot: [V: 04-1] Add openstack client packages for train [puppet] - 10https://gerrit.wikimedia.org/r/555080 (owner: 10Andrew Bogott) [08:34:24] (03PS2) 10Andrew Bogott: Add openstack client packages for train [puppet] - 10https://gerrit.wikimedia.org/r/555080 [08:35:10] (03CR) 10Andrew Bogott: [C: 03+2] Add openstack client packages for train [puppet] - 10https://gerrit.wikimedia.org/r/555080 (owner: 10Andrew Bogott) [08:36:35] !log andrew@deploy1001 Started deploy [horizon/deploy@1911591]: (no justification provided) [08:36:36] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:20] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:38:20] (03CR) 10Ema: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/553369 (https://phabricator.wikimedia.org/T236017) (owner: 10Giuseppe Lavagetto) [08:38:30] !log andrew@deploy1001 Finished deploy [horizon/deploy@1911591]: (no justification provided) (duration: 01m 55s) [08:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:41] (03PS1) 10Andrew Bogott: Horizon: add 'train' versions of designate and neutron policy.json [puppet] - 10https://gerrit.wikimedia.org/r/555258 (https://phabricator.wikimedia.org/T239974) [08:40:15] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: add 'train' versions of designate and neutron policy.json [puppet] - 10https://gerrit.wikimedia.org/r/555258 (https://phabricator.wikimedia.org/T239974) (owner: 10Andrew Bogott) [08:41:32] !log andrew@deploy1001 Started deploy [horizon/deploy@1911591]: (no justification provided) [08:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:31] !log andrew@deploy1001 Finished deploy [horizon/deploy@1911591]: (no justification provided) (duration: 01m 59s) [08:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:32] !log andrew@deploy1001 Started deploy [horizon/deploy@1911591]: (no justification provided) [08:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:40] !log andrew@deploy1001 Finished deploy [horizon/deploy@1911591]: (no justification provided) (duration: 00m 08s) [08:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:13] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [09:07:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1002: SMART/disk error - https://phabricator.wikimedia.org/T230088 (10Mathew.onipe) 05Open→03Resolved [09:08:44] 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [09:20:45] 10Operations, 10Packaging, 10serviceops: Build and upload envoy 1.12.0 package. - https://phabricator.wikimedia.org/T237235 (10Joe) 05Open→03Resolved [09:20:48] 10Operations, 10RESTBase, 10Traffic: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe) [09:21:21] 10Operations, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10Joe) [09:23:22] (03PS1) 10Andrew Bogott: remove a dangling comma [labs/private] - 10https://gerrit.wikimedia.org/r/555276 [09:24:28] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] remove a dangling comma [labs/private] - 10https://gerrit.wikimedia.org/r/555276 (owner: 10Andrew Bogott) [09:26:04] (03CR) 10Giuseppe Lavagetto: blubberoid: break tls fucntionality into an helper (0311 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554832 (owner: 10Giuseppe Lavagetto) [09:26:13] (03PS2) 10Giuseppe Lavagetto: blubberoid: break TLS functionality into a helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/554832 (https://phabricator.wikimedia.org/T235411) [09:26:15] (03PS2) 10Giuseppe Lavagetto: scaffold: import the blubberoid tls helpers in scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/554833 (https://phabricator.wikimedia.org/T235411) [09:26:17] (03PS3) 10Giuseppe Lavagetto: eventgate: convert to use the common tls templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/554834 (https://phabricator.wikimedia.org/T235411) [09:37:02] 10Operations, 10Maps: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Mathew.onipe) [09:39:46] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:40:30] PROBLEM - Query Service HTTP Port on wdqs1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:40:41] this is me ^ [09:41:06] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1010 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:42:36] (03PS3) 10Muehlenhoff: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) [09:43:12] (03CR) 10jerkins-bot: [V: 04-1] Setup apt pinning for puppet 5 / facter 3 on stretch/jessie [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [09:49:12] (03PS4) 10Muehlenhoff: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) [09:54:00] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1010 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:54:44] RECOVERY - Query Service HTTP Port on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:55:20] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1010 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:02:24] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1010 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:02:30] (03PS5) 10Hashar: contint: role for CI package_builder instances [puppet] - 10https://gerrit.wikimedia.org/r/554642 (https://phabricator.wikimedia.org/T224943) [10:02:52] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:03:23] onimisionipe: is it you with wdqs1010 ? [10:03:30] Or dcausse ? [10:03:34] PROBLEM - Query Service HTTP Port on wdqs1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:04:37] gehel: it's me [10:04:46] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) As a way to identify more specifically where the TTFB regression comes from, in particular to understand precisely how much ats-be co... [10:05:06] I need to stop blazegraph to run some journal tools [10:05:56] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1010 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:06:24] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1010 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:06:34] dcausse: ok, I'll downtime [10:07:02] thanks [10:07:06] RECOVERY - Query Service HTTP Port on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:18:58] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) Could you generate a separate stats table for misses and passthroughs? [10:24:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] contint: role for CI package_builder instances [puppet] - 10https://gerrit.wikimedia.org/r/554642 (https://phabricator.wikimedia.org/T224943) (owner: 10Hashar) [10:25:08] 10Operations, 10Traffic, 10Patch-For-Review: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 (10hashar) > pristine-tar: delta is version 3, newer than maximum supported version 2 @ema the CI debian-glue jobs are now running on Buster instances and thus come wit... [10:25:25] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#5717652, @Gilles wrote: > Could you generate a separate stats table for misses and passthroughs? Certainly. Non-hits... [10:27:22] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) I meant specifically misses (ATS/Varnish did a lookup a didn't find the object) vs passthroughs (ATS/Varnish merely acted as a pro... [10:27:31] (03PS1) 10Elukey: Add fake keytab for analytics-search on stat1007 [labs/private] - 10https://gerrit.wikimedia.org/r/555360 [10:27:54] (03PS2) 10Giuseppe Lavagetto: prometheus::k8s: drop envoy metrics about the admin interface [puppet] - 10https://gerrit.wikimedia.org/r/553246 [10:27:56] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake keytab for analytics-search on stat1007 [labs/private] - 10https://gerrit.wikimedia.org/r/555360 (owner: 10Elukey) [10:36:08] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [10:37:56] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:38:36] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#5717657, @Gilles wrote: > I meant specifically misses (ATS/Varnish did a lookup a didn't find the object) vs passthrou... [10:41:51] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) Yes, from a cache application perspective they are different tasks and therefore the issues affecting each could have different ca... [10:59:21] (03PS4) 10Elukey: statistics::discovery: move cron to timer and add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/554528 [11:00:42] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5292 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [11:02:26] (03PS5) 10Elukey: statistics::discovery: move cron to timer and add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/554528 [11:04:16] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.07083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [11:08:22] (03PS1) 10Ema: ATS: mark uncacheable responses as 'pass' in X-Cache-Int [puppet] - 10https://gerrit.wikimedia.org/r/555396 (https://phabricator.wikimedia.org/T227432) [11:08:47] (03CR) 10Elukey: "Sent another version, I realized that I was missing some stuff, the code was not correct." [puppet] - 10https://gerrit.wikimedia.org/r/554528 (owner: 10Elukey) [11:10:06] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [11:11:54] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:21:30] (03PS6) 10Zoranzoki21: Upload HD logos for en, fi and nl arbcom wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) [11:21:49] (03CR) 10Zoranzoki21: "> arbcom_fiwiki logos aren't optipng'ed. Could you run optipng -o7 on" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [11:46:52] 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10elukey) p:05Triage→03High [11:48:53] 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10elukey) [12:07:56] PROBLEM - Disk space on netflow2001 is CRITICAL: DISK CRITICAL - free space: / 302 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops [12:18:53] 10Operations, 10Release-Engineering-Team-TODO, 10Jenkins, 10Release-Engineering-Team (CI & Testing services): Add latest jenkins debian packages to apt.wikimedia.org and upgrade jenkins to latest LTS (2.190.3) - https://phabricator.wikimedia.org/T239586 (10hashar) Thanks for the new Jenkins packages :] Fo... [12:27:37] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [12:52:34] 10Operations, 10Epic, 10Maps (Kartotherian), 10Patch-For-Review: Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10MSantos) >>! In T216826#5717180, @Jdforrester-WMF wrote: >>>! In T216826#5640424, @MSantos wrote: >> @Mathew.onipe and @Jdforrester-WMF just FYI: I h... [12:59:48] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/554849 (https://phabricator.wikimedia.org/T236080) (owner: 10KartikMistry) [13:15:19] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Hmm those machines have 2 disks after all. I would swear I thought they had 4. Anyway, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554961 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [13:23:46] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170 (10BBlack) 05Open→03Resolved a:03BBlack I'm not sure how long it's been fixed in our infra, but it definitely works correctly now in our new... [13:31:19] !log starting transfer of blazegraph journal from wdqs1007 to stat1004 - T239898 [13:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:26] T239898: Investigate triple counts difference between dumps and what blazegraph reports - https://phabricator.wikimedia.org/T239898 [13:32:10] (03PS3) 10Ema: ATS: pass uncacheable requests [puppet] - 10https://gerrit.wikimedia.org/r/553132 (https://phabricator.wikimedia.org/T238494) [13:34:54] (03CR) 10Ema: [C: 03+2] ATS: pass uncacheable requests [puppet] - 10https://gerrit.wikimedia.org/r/553132 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [13:35:04] PROBLEM - traffic_server tls process restarted on cp3064 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3064&var-layer=tls [13:41:47] !log cp2004: adding do_global_ doesn't seem to work with reload, restart ats-be T238494 [13:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:53] T238494: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [13:44:20] 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1002 - https://phabricator.wikimedia.org/T239957 (10Jclark-ctr) closing due to duplicate . [13:52:01] 10Operations, 10Traffic: Implement machine-local forwarding DNS caches - https://phabricator.wikimedia.org/T171498 (10BBlack) In these past couple of weeks we've had a real about-face on this issue, and I think there's a pretty strong consensus and rationale to pursue some kind of host-level caching, but there... [13:54:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] blubberoid: break TLS functionality into a helper (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554832 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:00:49] 10Operations, 10Traffic: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BBlack) [14:01:11] 10Operations, 10Pybal, 10Traffic: DNS recursors TCP retransmits - https://phabricator.wikimedia.org/T211131 (10BBlack) 05Open→03Declined These are still present AFAIK, and we're fairly certain it's just due to pybal healthchecks using blank/broken TCP connections to monitor them. That will be cleaned up... [14:01:37] (03Abandoned) 10Ema: ATS: mark uncacheable responses as 'pass' in X-Cache-Int [puppet] - 10https://gerrit.wikimedia.org/r/555396 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:02:46] 10Operations, 10Traffic: Make authdns-update compatible with local emergency changes - https://phabricator.wikimedia.org/T219400 (10BBlack) Sorry I hadn't remember we had this existing ticket. Will merge into the other newer one since it has patches already and some deeper context, and copy the main text over. [14:03:53] 10Operations, 10Traffic: Make DNS operations resilient against predictable failures - https://phabricator.wikimedia.org/T239711 (10BBlack) [14:03:55] 10Operations, 10Traffic: Make authdns-update compatible with local emergency changes - https://phabricator.wikimedia.org/T219400 (10BBlack) [14:04:29] 10Operations, 10Traffic: Make DNS operations resilient against predictable failures - https://phabricator.wikimedia.org/T239711 (10BBlack) Thoughts from the main text of the merged ticket: ------------ We should improve our current [1] support of deploying an emergency DNS change when other dependent services... [14:06:26] 10Operations, 10DNS, 10SRE-tools, 10Traffic: Include zone+subnet checks for DNS validation - https://phabricator.wikimedia.org/T238727 (10BBlack) 05Open→03Declined Declined in favor of netbox integration ( T233183 ? ) making this problem go away. [14:08:28] 10Operations, 10DNS, 10Traffic, 10Core Platform Team Legacy (Watching / External), 10Services (watching): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818 (10BBlack) [14:09:04] 10Operations, 10DNS, 10Traffic, 10serviceops, and 2 others: icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818 (10BBlack) [14:09:13] (03CR) 10Herron: [C: 03+2] install_server: switch ganeti[345]* to raid1 layout [puppet] - 10https://gerrit.wikimedia.org/r/554961 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [14:11:34] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:11:45] 10Operations, 10DNS, 10Traffic, 10serviceops, and 2 others: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability - https://phabricator.wikimedia.org/T162818 (10BBlack) [14:12:49] !log cp3050: ats-backend-restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/553132/ T238494 [14:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:55] T238494: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [14:13:20] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:13:29] 10Operations, 10DNS, 10Traffic, 10serviceops, and 2 others: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability - https://phabricator.wikimedia.org/T162818 (10BBlack) While we'll work on improvements that make this less-likely i... [14:16:09] 10Operations, 10DNS, 10Traffic: Consider DNSSec - https://phabricator.wikimedia.org/T26413 (10BBlack) Since we haven't updated this in two years, I figured I should post again: * DNSSEC is still awful * DNSSEC is still basically all the world has to solve certain problems, for better or worse. * DNSSEC has... [14:20:28] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:21:01] 10Operations, 10Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 (10BBlack) [14:21:40] 10Operations, 10Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 (10BBlack) a:05faidon→03None [14:22:16] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:23:12] 10Operations, 10Traffic: Lower geodns TTLs from 600 (10min) to 300 (5min) - https://phabricator.wikimedia.org/T140365 (10BBlack) [14:23:15] 10Operations, 10Traffic: Implement GeoDNS smooth repooling in gdnsd - https://phabricator.wikimedia.org/T228678 (10BBlack) [14:23:18] 10Operations, 10Traffic: Set up LVS for current AuthDNS - https://phabricator.wikimedia.org/T101525 (10BBlack) [14:23:30] 10Operations, 10Traffic: Lower geodns TTLs from 600 (10min) to 300 (5min) - https://phabricator.wikimedia.org/T140365 (10BBlack) This is still something we want to pursue, but we really need to get past the smooth repooling issue first, so I've added that as a subtask (consider it blocking this one). [14:24:31] 10Operations, 10Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 (10BBlack) [14:24:37] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) [14:27:49] 10Operations, 10Traffic: Implement DNS-over-TLS for AuthDNS - https://phabricator.wikimedia.org/T239994 (10BBlack) p:05Triage→03Normal [14:29:06] (03CR) 10Hashar: "From a discussion with Moritz: the resulting doxygen package will be used in a Docker container run by a CI Job. The resulting artifact " [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/554942 (https://phabricator.wikimedia.org/T239482) (owner: 10Hashar) [14:29:26] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:31:14] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:37:46] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.167 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [14:38:20] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:40:06] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:40:11] !log text@esams: rolling ats-backend-restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/553132/ T238494 [14:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:18] T238494: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [14:47:10] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:48:56] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:52:50] (03CR) 10Jgreen: [C: 03+1] frack: fix asset tag management records [dns] - 10https://gerrit.wikimedia.org/r/554079 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [14:55:30] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.021 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [14:56:02] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:57:48] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:04:24] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.025 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:04:56] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [15:06:42] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:12:09] Krinkle, AaronSchulz - o/ are you around? [15:13:16] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.008 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:16:21] (03PS1) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513 [15:18:36] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0875 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:19:03] (03PS2) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513 [15:21:48] (03PS4) 10Muehlenhoff: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) [15:22:48] (03CR) 10Muehlenhoff: "Great review, thanks! I've made a PS4, comments inline" (0326 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [15:23:56] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.8 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:24:26] (03PS3) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513 [15:24:30] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [15:25:01] (03CR) 10jerkins-bot: [V: 04-1] atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513 (owner: 10CDanis) [15:25:31] 10Operations, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Bstorm) It looks like the limit was last raised 5 years ago. I'll double check a couple things, but I suspect that's old stuff we can raise. [15:26:16] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:27:10] (03PS1) 10RLazarus: Refactor, preparatory to testing multiple hosts in parallel. [software/httpbb] - 10https://gerrit.wikimedia.org/r/555515 [15:27:30] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.07083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:28:31] (03CR) 10jerkins-bot: [V: 04-1] Refactor, preparatory to testing multiple hosts in parallel. [software/httpbb] - 10https://gerrit.wikimedia.org/r/555515 (owner: 10RLazarus) [15:28:55] (03CR) 10Muehlenhoff: [C: 03+1] "This version looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:29:50] 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Papaul) a:05Papaul→03Jgreen @Jgreen all yours [15:32:50] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 2.108 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:33:24] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [15:35:10] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:35:39] (03PS2) 10RLazarus: Refactor, preparatory to testing multiple hosts in parallel. [software/httpbb] - 10https://gerrit.wikimedia.org/r/555515 [15:36:35] (03CR) 10Muehlenhoff: "A few comments inline. @Luca, let me know if you disagree with my comment on the krb* hosts." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:38:10] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:38:31] (03PS4) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513 [15:39:04] (03CR) 10jerkins-bot: [V: 04-1] atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513 (owner: 10CDanis) [15:39:43] (03PS5) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513 [15:41:48] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.8333 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:42:20] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [15:42:25] (03PS6) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513 [15:43:30] (03CR) 10CDanis: "PCC looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/555513 (owner: 10CDanis) [15:45:52] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:46:30] these spikes are so excessively weird [15:46:53] rlazarus: it seems like el.ukey is onto the cause of these ones [15:47:14] yeah, reading [15:56:58] (03PS1) 10BBlack: lvs recdns: switch DNS aliases to anycast [dns] - 10https://gerrit.wikimedia.org/r/555520 (https://phabricator.wikimedia.org/T239993) [15:57:18] (03CR) 10jerkins-bot: [V: 04-1] lvs recdns: switch DNS aliases to anycast [dns] - 10https://gerrit.wikimedia.org/r/555520 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [15:57:48] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09167 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:58:45] gee thanks zone_validator :P [15:59:19] yeah...is anything down? [16:04:08] (03PS2) 10BBlack: lvs recdns: switch DNS aliases to anycast [dns] - 10https://gerrit.wikimedia.org/r/555520 (https://phabricator.wikimedia.org/T239993) [16:10:12] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.8125 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:10:55] Vermont: hi! Are you reporting any issue? [16:12:11] elukey: a few minutes ago i couldn’t access WMF sites, but I could to other sites [16:12:15] but it isn’t an issue now [16:12:18] thx :) [16:12:34] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [16:12:34] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:13:17] Vermont: we are experiencing some latency issues with MediaWiki Api appservers, so something is ongoing but shouldn't affect all wikis that heavily [16:13:51] (03PS1) 10Jhedden: ceph: remove rook.io based ceph modules [puppet] - 10https://gerrit.wikimedia.org/r/555528 (https://phabricator.wikimedia.org/T236290) [16:14:16] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:17:42] (03CR) 10Jhedden: [C: 03+2] ceph: remove rook.io based ceph modules [puppet] - 10https://gerrit.wikimedia.org/r/555528 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [16:17:54] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:19:06] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0875 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:21:50] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 6390 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:22:40] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.225 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:23:14] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [16:23:36] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 185 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:23:46] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:23:52] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:25:30] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:25:36] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:28:41] elukey: was it alert related? [16:30:06] AaronSchulz: good morning :) [16:30:14] yes! https://phabricator.wikimedia.org/T239983 [16:30:40] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom radium - https://phabricator.wikimedia.org/T203861 (10Papaul) [16:30:46] AaronSchulz: currently there is one single key that is causing troubles, the one for mc1026 [16:31:34] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5667 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:31:38] we are still not sure that this is the cause of the api latency spikes, but there is a correlation [16:32:08] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [16:32:51] <_joe_> !log flushing apcu on mw1339 [16:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10Papaul) ` papaul@asw2-a-eqiad# show | compare [edit interfaces] - ge-1/0/5 { - description db1069; - } [16:34:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10Papaul) [16:34:50] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:35:38] (03PS1) 10BBlack: lvs recdns decom [puppet] - 10https://gerrit.wikimedia.org/r/555537 (https://phabricator.wikimedia.org/T239993) [16:35:40] (03PS7) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513 [16:35:42] (03PS1) 10BBlack: lvs recdns post-decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/555538 (https://phabricator.wikimedia.org/T239993) [16:36:32] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:37:35] (03PS1) 10BBlack: lvs recdns: get rid of legacy recursor hostnames [dns] - 10https://gerrit.wikimedia.org/r/555539 (https://phabricator.wikimedia.org/T239993) [16:38:30] (03CR) 10CDanis: "Whoops, had forgotten the metric value. Fixed:" [puppet] - 10https://gerrit.wikimedia.org/r/555513 (owner: 10CDanis) [16:39:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Papaul) ` papaul@asw2-b-eqiad# show | compare [edit interfaces interface-range disabled] member xe-7/0/41 { ... } + member ge-2/0/19; [edi... [16:39:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Papaul) [16:40:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Papaul) [16:41:56] <_joe_> !log flush acpu across the api cluster in eqiad [16:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:16] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 3.737 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:42:50] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [16:44:10] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) 05Open→03Resolved [16:44:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Papaul) ` papaul@asw2-b-eqiad# show | compare [edit interfaces interface-range disabled] member ge-2/0/19 { ... } + member ge-3/0/26; [edit interfaces] - ge-3/0/26... [16:45:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Papaul) [16:45:55] 10Operations, 10Performance-Team, 10Traffic: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) We can now distinguish between hit, miss, and pass in text@esams ATS too. An important caveat when looking at these numbers is that Varnish supports hit-f... [16:47:32] <_joe_> !log acpu flush finished [16:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:38] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:48:38] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [16:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:48] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:07] (03PS1) 10Papaul: DNS: Remove mgmt DNS for radium,db1069,db1072 and db1073 [dns] - 10https://gerrit.wikimedia.org/r/555542 [16:51:58] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for radium,db1069,db1072 and db1073 [dns] - 10https://gerrit.wikimedia.org/r/555542 (owner: 10Papaul) [16:53:00] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [16:53:11] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom radium - https://phabricator.wikimedia.org/T203861 (10Papaul) [16:53:28] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom radium - https://phabricator.wikimedia.org/T203861 (10Papaul) 05Open→03Resolved complete [16:53:32] 10Operations, 10Patch-For-Review, 10Tor: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Papaul) [16:53:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10Papaul) [16:53:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10Papaul) 05Open→03Resolved complete [16:53:54] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Papaul) [16:54:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Papaul) [16:54:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Papaul) 05Open→03Resolved complete [16:54:24] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Papaul) [16:54:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Papaul) [16:55:02] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Papaul) [16:55:04] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Papaul) [16:55:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Papaul) 05Open→03Resolved complete [16:56:45] 10Operations, 10Traffic, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BBlack) In a sample I just took across all recdns for a little over 15 minutes of sniffer time looking for requests to the legacy LVS-based recdns IPs: * ulsfo, eqsin, and esams had no traffic to them... [16:57:14] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:01:12] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.664e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:02:10] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:02:28] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 448 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:03:22] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:03:26] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:04:17] (03CR) 10Phamhi: wmcs: make cloudmetrics1002 the primary instead of labmon1001 (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554844 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [17:04:21] (03PS1) 10Ssingh: Update tox.ini [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/555546 [17:06:05] (03CR) 10jerkins-bot: [V: 04-1] Update tox.ini [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/555546 (owner: 10Ssingh) [17:06:15] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [17:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:26] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:07:32] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:08:00] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:08:24] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:00] did someone just re-enable those install boxes? [17:10:17] (i went to one to run the agent and find the message, and it just ran) [17:10:55] same for both actually, weird [17:11:07] The last Puppet run was at Thu Dec 5 20:17:37 UTC 2019 (1252 minutes ago) [17:11:15] but agent ran fien without a re-enable on the first try [17:11:18] *fine [17:11:36] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:11:39] anyways [17:11:46] (03PS2) 10Ssingh: Update tox.ini [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/555546 [17:12:02] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:12:32] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:12:38] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:13:10] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:13:24] (03CR) 10Ssingh: [C: 03+2] Update tox.ini [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/555546 (owner: 10Ssingh) [17:13:54] (03Merged) 10jenkins-bot: Update tox.ini [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/555546 (owner: 10Ssingh) [17:17:04] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:18:05] (03PS1) 10CDanis: traffic drop: require minimum absolute rps [puppet] - 10https://gerrit.wikimedia.org/r/555550 (https://phabricator.wikimedia.org/T239039) [17:18:39] (03CR) 10jerkins-bot: [V: 04-1] traffic drop: require minimum absolute rps [puppet] - 10https://gerrit.wikimedia.org/r/555550 (https://phabricator.wikimedia.org/T239039) (owner: 10CDanis) [17:19:57] !log editing /e/n/i carefully with sed across the fleet via cumin, to correct legacy "dns-nameservers" line in older installs [17:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:14] (03PS2) 10CDanis: traffic drop: require minimum absolute rps [puppet] - 10https://gerrit.wikimedia.org/r/555550 (https://phabricator.wikimedia.org/T239039) [17:21:29] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:21:55] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:21:57] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:22:03] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:22:45] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:22:52] brennen, thcipriani: How'd you feel about deploying the nominal wmf.8 unblocker now (to group0)? [17:23:03] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:23:21] (03PS1) 10Dzahn: phabricator: limit mysql access for admins to production realm [puppet] - 10https://gerrit.wikimedia.org/r/555551 [17:23:41] (03CR) 10jerkins-bot: [V: 04-1] phabricator: limit mysql access for admins to production realm [puppet] - 10https://gerrit.wikimedia.org/r/555551 (owner: 10Dzahn) [17:24:11] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:24:13] (03PS2) 10Dzahn: phabricator: limit mysql access for admins to production realm [puppet] - 10https://gerrit.wikimedia.org/r/555551 [17:24:15] James_F: seems like it should be safe and might give us some insights about wmf.8 going to group1. I'll defer to brennen on his stomach for deployment though :) [17:24:19] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:24:33] * James_F grins. [17:24:39] I can do the deploy. [17:25:07] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:25:22] James_F: i feel like it's friday and my nerves are shot, but if you are willing to do the deploy it seems pretty low-risk. [17:25:35] Kk, let's do it. [17:25:46] (03CR) 10BBlack: [C: 03+2] lvs recdns: switch DNS aliases to anycast [dns] - 10https://gerrit.wikimedia.org/r/555520 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [17:25:49] (03PS3) 10BBlack: lvs recdns: switch DNS aliases to anycast [dns] - 10https://gerrit.wikimedia.org/r/555520 (https://phabricator.wikimedia.org/T239993) [17:26:30] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [17:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:01] (03CR) 10Paladox: [C: 03+1] phabricator: limit mysql access for admins to production realm [puppet] - 10https://gerrit.wikimedia.org/r/555551 (owner: 10Dzahn) [17:27:57] damn I missed my chance earlier to grab T240000 for something cool, it flew by a few hours ago :P [17:27:58] T240000: Config on the RequestContext may not be the same as the main config - https://phabricator.wikimedia.org/T240000 [17:28:37] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:32] bblack: SRE got T234567, I think it's only fair that other teams get cool numbered tasks from time to time. ;-) [17:29:33] T234567: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus - https://phabricator.wikimedia.org/T234567 [17:30:18] (03PS3) 10Dzahn: phabricator: limit mysql access for admins to production realm [puppet] - 10https://gerrit.wikimedia.org/r/555551 (https://phabricator.wikimedia.org/T238425) [17:30:21] whoa I didn't realize *I* got that task number until now, that's awesome [17:30:34] * James_F bows before the all-mighty cdanis. [17:30:57] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:31:34] (03CR) 10Dzahn: [C: 03+2] "noop https://puppet-compiler.wmflabs.org/compiler1002/19841/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/555551 (https://phabricator.wikimedia.org/T238425) (owner: 10Dzahn) [17:32:11] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.695e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:33:01] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:33:03] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:33:17] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 550 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:34:03] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:37:57] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:41:55] (03PS1) 10Jhedden: install_server: ceph change partman profile [puppet] - 10https://gerrit.wikimedia.org/r/555552 (https://phabricator.wikimedia.org/T236290) [17:43:43] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:43:53] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.8/includes/libs/rdbms/database/Database.php: T239877 Have Database::makeWhereFrom2d assume is string-based (duration: 01m 11s) [17:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:59] T239877: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 [17:47:04] (03CR) 10Jhedden: [C: 03+2] install_server: ceph change partman profile [puppet] - 10https://gerrit.wikimedia.org/r/555552 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [17:48:03] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:48:47] (03PS1) 10CDanis: six fives is a lot like five nines, if you think about it. [puppet] - 10https://gerrit.wikimedia.org/r/555555 [17:53:13] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.333 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:54:28] !log install2002 - restart squid3 service [17:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:33] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:55:07] James_F: this one, though, *was* on purpose [17:55:57] (03Abandoned) 10CDanis: six fives is a lot like five nines, if you think about it. [puppet] - 10https://gerrit.wikimedia.org/r/555555 (owner: 10CDanis) [17:58:21] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:02:19] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7458 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:03:03] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.242e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:04:31] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 241 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:05:15] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:05:31] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:05:35] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:06:39] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:06:59] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:08:35] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:12:23] !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw12.* [18:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:11] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.279 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:13:25] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [18:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:05] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:15:32] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:07] 10Operations, 10Traffic, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BBlack) Dug into the odd cases from `install2002` and `kraz` - the common pattern here is that there are some daemons in the world which both (a) parse `/etc/resolv.conf` for themselves because they u... [18:16:19] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Danny Horn - https://phabricator.wikimedia.org/T239881 (10DannyH) @colewhite I'll be using this to access Turnilo and Superset. Thanks for your help, I appreciate it! [18:17:27] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:20:31] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:22:55] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6417 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:23:01] (03PS1) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [18:24:27] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:25:43] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:29:01] PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [18:29:13] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:29:57] 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10CDanis) As discussed with @Joe , increased the weights of the lower-weighted api_appservers in eqiad. 18:12 conftool action : set/weight=15; selector: service=ng... [18:30:01] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:32:09] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.9958 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:34:01] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:34:41] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [18:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:31] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:36:49] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:47] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:41:11] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.096 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:41:57] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 7015 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:42:23] (03PS1) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) [18:43:02] (03Abandoned) 10Dzahn: new profile/role for IRC server using charybdis (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/345791 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [18:43:07] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 824 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:50:31] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 3.26e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:51:33] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 391 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:55:49] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [18:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:05] !log cdanis@cumin2001 conftool action : set/weight=20; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw12.* [18:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:56] 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10CDanis) This shows the number of api_appservers in the pool which are close to maxing out on their php-fpm workers: https://grafana.wikimedia.org/explore?orgId=1&left=%5B%22157550400... [18:57:59] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:47] 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10elukey) [19:03:52] 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10elukey) It seems that both tx and rx bandwidth gets saturated by SETs/GETs for the same key on mc1026: ` Time eth0 HH:MM:SS Kbps in Kbps out 17:41:34 680050.6 980400.... [19:07:05] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [19:07:15] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:07:27] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1007 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:07:55] RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:18] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 9837 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:19:48] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.196 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:20:22] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 199 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:24:26] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.06667 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:29:42] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.262 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:35:30] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:38:36] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7417 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:40:23] 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Jgreen) [19:43:03] 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Jgreen) [19:43:07] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) [19:43:30] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.07917 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:44:24] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) 05Resolved→03Open [19:44:44] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) p:05High→03Normal [19:45:17] 10Operations, 10ops-codfw: codfw: rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Jgreen) [19:45:55] 10Operations, 10ops-codfw, 10fundraising-tech-ops: codfw: rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Jgreen) [19:46:48] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Jgreen) [19:47:04] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) [20:08:02] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7125 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:11:32] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.03333 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:16:00] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.929e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:19:14] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 907 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:24:38] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 3.311e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:26:36] 10Operations, 10observability: Make grafana-next.wm.o HTTP 302 redirect to grafana.wm.o - https://phabricator.wikimedia.org/T240048 (10CDanis) [20:28:14] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 685 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:32:10] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 8652 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:34:48] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.8833 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:34:50] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [20:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:00] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 719 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:37:01] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:27] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.08333 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:41:57] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.469e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:41:58] 10Operations, 10Gerrit, 10Phabricator, 10Security-Team, 10Traffic: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10Dzahn) [20:42:45] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7125 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:44:21] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 487 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:50:03] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.591e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:56:05] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 628 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:00:09] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.647e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:01:39] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 415 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:01:53] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [21:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:02] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:33] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.04e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:12:11] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 658 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:12:32] !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1227 [21:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:36] !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1222 [21:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:48] !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1233 [21:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:52] 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10elukey) Tried to get a lot pcaps from tcpdump, to get host:timestamp combinations for mw apis with the goal of finding something interesting in their httpd access logs (for sv.wiki re... [21:14:34] !log cdanis@cumin2001 conftool action : set/weight=25; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw12[789].* [21:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:33] !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1233.eqiad.wmnet [21:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:41] !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1222.eqiad.wmnet [21:15:45] !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1227.eqiad.wmnet [21:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:04] !log cdanis@cumin2001 conftool action : set/weight=15; selector: service=nginx,cluster=api_appserver,dc=eqiad,name=mw1231.eqiad.wmnet [21:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:31] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 3.423e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:23:32] (03PS1) 10BBlack: Switch phab SPF back to phab1001 [dns] - 10https://gerrit.wikimedia.org/r/555611 (https://phabricator.wikimedia.org/T238956) [21:23:44] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [21:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:54] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:41] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.377e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:28:59] (03CR) 10Dzahn: [C: 03+2] Switch phab SPF back to phab1001 [dns] - 10https://gerrit.wikimedia.org/r/555611 (https://phabricator.wikimedia.org/T238956) (owner: 10BBlack) [21:33:29] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 783 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:37:33] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.958e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:39:03] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 694 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:41:40] !log mc1026: adjusting rx ring to 2047 and disabling ethernet pause (will be a minor blip of eth link state!) [21:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:33] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 3.08e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:48:07] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Dwisehaupt) bond0 interface set up and active. [21:48:23] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Dwisehaupt) [21:54:04] !log mc1026: add tc-fq qdisc to eth0 for tx [21:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:35] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.07083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [21:57:09] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.5 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [21:58:21] 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1002 - https://phabricator.wikimedia.org/T239957 (10wiki_willy) 05Open→03Resolved [21:58:46] cdanis: FYI/FTR, the 3 commands I've run to change things on 1026 are: ethtool -A eth0 autoneg off rx off tx off; ethtool -G eth0 rx 2047; tc qdisc add dev eth0 root fq [21:58:51] (it's the middle one that blips link) [22:00:57] reverting them would be, respectively: ethtool -A eth0 autoneg on rx on tx on; ethtool -G eth0 rx 200; tc qdisc del dev eth0 root [22:04:15] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.9 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [22:06:23] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 98 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:08:06] !log mc1033: ethernet tweaks as well (expect a short link blip) [22:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:10:30] (03PS1) 10Dwisehaupt: Adding new host frdb1003 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/555614 (https://phabricator.wikimedia.org/T239139) [22:10:32] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/555614 (https://phabricator.wikimedia.org/T239139) (owner: 10Dwisehaupt) [22:11:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:12:03] (03CR) 10Dzahn: [C: 03+2] Adding new host frdb1003 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/555614 (https://phabricator.wikimedia.org/T239139) (owner: 10Dwisehaupt) [22:12:07] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.05833 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [22:15:30] 10Operations, 10DC-Ops, 10serviceops: mw1252 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T236190 (10Dzahn) 05Open→03Resolved a:03Dzahn I don't know why but the alert in Icinga has cleared since 14 days. [22:16:10] (03CR) 10Dzahn: "deployed and ran puppet on icinga1001. Here you go.. pending checks at: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_stri" [puppet] - 10https://gerrit.wikimedia.org/r/555614 (https://phabricator.wikimedia.org/T239139) (owner: 10Dwisehaupt) [22:16:32] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=frdb1003 [22:23:00] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5125 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [22:27:58] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Dwisehaupt) [22:28:02] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.04583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [22:33:10] 10Operations, 10Performance-Team: Regression in mcrouter TKO/timeouts registered - https://phabricator.wikimedia.org/T239983 (10CDanis) @BBlack applied some NIC tweaks, which ultimately did not help: 22:08 mc1033: ethernet tweaks as well (expect a short link blip) 21:54 mc1026: add tc-fq qdi... [22:40:48] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10JHedden) 05Open→03Resolved [22:41:24] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7417 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [22:46:18] !log ppchelko@deploy1001 Started deploy [restbase/deploy@c2bab5d]: Parsoid: Disable mirroring all traffic in split mode [22:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:36] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.02083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [22:48:54] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.457e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:50:42] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 371 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:00:01] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@c2bab5d]: Parsoid: Disable mirroring all traffic in split mode (duration: 13m 43s) [23:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:03] (03PS1) 10Bjornskjald: Update three logos with more detailed versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) [23:10:12] (03CR) 10Zoranzoki21: [C: 04-1] "You need to update InitialiseSettings.php also, as requested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald) [23:10:30] !log andrew@deploy1001 Started deploy [horizon/deploy@1911591]: (no justification provided) [23:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:38] !log andrew@deploy1001 Finished deploy [horizon/deploy@1911591]: (no justification provided) (duration: 00m 07s) [23:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:42] (03PS1) 10Ammarpad: Enable local uploads on inh.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) [23:11:41] (03CR) 10jerkins-bot: [V: 04-1] Enable local uploads on inh.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) (owner: 10Ammarpad) [23:12:11] !log andrew@deploy1001 Started deploy [horizon/deploy@1911591]: (no justification provided) [23:12:18] !log andrew@deploy1001 Finished deploy [horizon/deploy@1911591]: (no justification provided) (duration: 00m 07s) [23:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:39] !log andrew@deploy1001 Started deploy [horizon/deploy@1911591]: (no justification provided) [23:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:04] (03CR) 10Zoranzoki21: [C: 04-1] "Looks like it needs removal from dblists/commonsuploads.dblist also" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) (owner: 10Ammarpad) [23:17:40] (03CR) 10Bjornskjald: "I know, the task says I should do it in another patchset." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald) [23:18:13] (03CR) 10Ebe123: [C: 03+1] "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald) [23:19:31] 10Operations, 10Traffic, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10colewhite) p:05Triage→03Normal [23:19:47] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Danny Horn - https://phabricator.wikimedia.org/T239881 (10colewhite) 05Open→03Resolved [23:21:57] (03CR) 10Ammarpad: "> Looks like it needs removal from dblists/commonsuploads.dblist also" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) (owner: 10Ammarpad) [23:23:30] (03CR) 10Zoranzoki21: "> > Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald) [23:25:22] (03CR) 10Bjornskjald: "Hey, that's not an issue with your patch, but next time please either remove the logos you've done from the list, or add the Phabricator b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554889 (owner: 10TechneSiyam) [23:28:16] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) (owner: 10Ammarpad) [23:30:13] (03PS2) 10Bjornskjald: Update three logos with more detailed versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) [23:30:44] (03CR) 10Zoranzoki21: "> > Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) (owner: 10Ammarpad) [23:31:33] (03CR) 10Jforrester: "> Patch Set 1: -Code-Review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) (owner: 10Ammarpad) [23:37:28] (03PS1) 10Bstorm: toolforge-k8s: reduce the default terminated-pod-gc-threshold [puppet] - 10https://gerrit.wikimedia.org/r/555627 (https://phabricator.wikimedia.org/T240009) [23:40:52] (03PS1) 10Bjornskjald: Add new HD logos to wgLogoHD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555629 (https://phabricator.wikimedia.org/T150618) [23:50:03] 10Operations, 10ContentSecurityPolicy, 10Gerrit, 10Phabricator, and 2 others: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10Bawolff)