[00:16:36] <wikibugs>	 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: Prepare puppet infrastructure for Debian  buster - https://phabricator.wikimedia.org/T213546 (10Krenair) Ah, and that's against the parent task which makes sense. +1 for closing.
[00:23:47] <icinga-wm>	 PROBLEM - puppet last run on archiva1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:50:15] <icinga-wm>	 RECOVERY - puppet last run on archiva1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[01:37:21] <sDrewth>	 I am getting 503 time out issues when trying to import a 34MB xml file to Wikimaniawiki .... Request from (xxx.xxx.xxx.xxx) via cp5007 frontend, Varnish XID 787971469   Error: 503, Backend fetch failed at Sun, 14 Apr 2019 01:26:44 GMT
[01:40:28] <Reedy>	 Repeatedly? Or just first try?
[01:43:41] <sDrewth>	 happened 12 hours ago (2x)
[01:44:03] <sDrewth>	 trying again now
[01:45:54] <Reedy>	 Are you using the PHP7 beta feature? Might help...
[01:45:55] <wikibugs>	 (03PS1) 10Alex Monk: Fix broken profile::swift::storage::labs [puppet] - 10https://gerrit.wikimedia.org/r/503707 (https://phabricator.wikimedia.org/T220895)
[01:46:45] <sDrewth>	 nope, will try when this one fails
[01:51:09] <sDrewth>	 Reedy, same fail
[01:51:23] <sDrewth>	 with PHP7
[01:54:43] <wikibugs>	 (03PS1) 10Alex Monk: udev: purge unpuppetised rules [puppet] - 10https://gerrit.wikimedia.org/r/503708
[01:55:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] udev: purge unpuppetised rules [puppet] - 10https://gerrit.wikimedia.org/r/503708 (owner: 10Alex Monk)
[01:56:39] <wikibugs>	 (03PS2) 10Alex Monk: udev: purge unpuppetised rules [puppet] - 10https://gerrit.wikimedia.org/r/503708
[02:01:33] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[02:05:39] <wikibugs>	 (03CR) 10Alex Monk: "Caused T220895" [puppet] - 10https://gerrit.wikimedia.org/r/371642 (owner: 10Filippo Giunchedi)
[02:05:43] <Reedy>	 What're you importing?
[02:05:47] <Reedy>	 Can you break it down to smaller chunks?
[02:07:27] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:09:22] <wikibugs>	 (03CR) 10Alex Monk: [C: 04-1] "./modules/profile/manifests/elasticsearch/cirrus.pp:    file { '/etc/udev/rules.d/elasticsearch-readahead.rules':" [puppet] - 10https://gerrit.wikimedia.org/r/503708 (owner: 10Alex Monk)
[02:10:03] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:10:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:12:25] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 253784360 and 17 seconds
[02:12:36] <wikibugs>	 (03PS5) 10Krinkle: Remove pear packages from MW Application Servers [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy)
[02:12:41] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Remove pear packages from MW Application Servers [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy)
[02:14:35] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:15:25] <wikibugs>	 (03PS3) 10Alex Monk: udev: purge unpuppetised rules [puppet] - 10https://gerrit.wikimedia.org/r/503708
[02:15:27] <wikibugs>	 (03PS1) 10Alex Monk: profile::elasticsearch::cirrus: Don't duplicate udev stuff [puppet] - 10https://gerrit.wikimedia.org/r/503709
[02:17:11] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:17:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 6488 and 24 seconds
[02:20:34] <wikibugs>	 (03CR) 10Alex Monk: udev: purge unpuppetised rules [puppet] - 10https://gerrit.wikimedia.org/r/503708 (owner: 10Alex Monk)
[02:21:09] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:21:13] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:22:33] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:23:11] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:25:49] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:36:49] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19598296 and 0 seconds
[02:44:35] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[02:49:31] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:49:47] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[02:50:49] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:00:19] <wikibugs>	 (03PS1) 10Alex Monk: deployment-prep: Add stretch storage hosts [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503714
[03:07:59] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[03:33:41] <icinga-wm>	 PROBLEM - puppet last run on cp4028 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz],File[/usr/share/GeoIP/GeoIPCity.dat.test]
[03:34:17] <icinga-wm>	 PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:35:09] <icinga-wm>	 PROBLEM - puppet last run on mw2277 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:44:35] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[04:00:43] <icinga-wm>	 RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[04:01:37] <icinga-wm>	 RECOVERY - puppet last run on mw2277 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:03:15] <wikibugs>	 (03CR) 10Alex Monk: "This is cleaning up after T95288 which I just stumbled into" [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk)
[04:05:25] <icinga-wm>	 RECOVERY - puppet last run on cp4028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[04:07:51] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[04:13:01] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:14:21] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:15:59] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:17:17] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:28:24] <wikibugs>	 (03PS3) 10Alex Monk: labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991
[04:28:36] <wikibugs>	 (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk)
[04:29:07] <icinga-wm>	 PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:29:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk)
[04:30:46] <wikibugs>	 (03PS4) 10Alex Monk: labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991
[04:30:51] <wikibugs>	 (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk)
[04:31:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk)
[04:31:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk)
[04:34:26] <wikibugs>	 (03CR) 10Alex Monk: "Looks exactly as I'd expect: https://puppet-compiler.wmflabs.org/compiler1002/145/labnet1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk)
[04:35:18] <wikibugs>	 (03PS5) 10Alex Monk: labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991
[04:43:09] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:44:31] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:53:28] <wikibugs>	 (03PS1) 10EBernhardson: Revert "Revert "Switch more_like and regex elsaticsearch queries from eqiad to codw"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503715
[04:53:33] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+2] Revert "Revert "Switch more_like and regex elsaticsearch queries from eqiad to codw"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503715 (owner: 10EBernhardson)
[04:54:36] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Switch more_like and regex elsaticsearch queries from eqiad to codw"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503715 (owner: 10EBernhardson)
[04:55:39] <icinga-wm>	 RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[04:58:47] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Revert "Switch more_like and regex elsaticsearch queries from eqiad to codw"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503715 (owner: 10EBernhardson)
[04:59:24] <ebernhardson>	 !log restart elasticsearch_6@production-searhc-psi-eqiad on elastic1027 due to 100% cpu for last 30+ minutes
[04:59:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:05:07] <icinga-wm>	 PROBLEM - Check systemd state on elastic1027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:23:29] <icinga-wm>	 PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:31:13] <ebernhardson>	 !log ban elastic1027 from elasticsearch-psi in eqiad
[05:31:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:36:32] <ebernhardson>	 !log unbanning elastic1027 after about half the shards left and load dropped
[05:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:41:37] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[05:49:29] <wikibugs>	 (03PS1) 10EBernhardson: shift wikidata and enwiki elasticsearch traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503716
[05:50:34] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+2] shift wikidata and enwiki elasticsearch traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503716 (owner: 10EBernhardson)
[05:51:33] <wikibugs>	 (03Merged) 10jenkins-bot: shift wikidata and enwiki elasticsearch traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503716 (owner: 10EBernhardson)
[05:54:44] <wikibugs>	 (03CR) 10jenkins-bot: shift wikidata and enwiki elasticsearch traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503716 (owner: 10EBernhardson)
[05:55:15] <icinga-wm>	 RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[05:58:03] <wikibugs>	 (03PS1) 10EBernhardson: Move wikidata elasticsearch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503717
[05:58:17] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+2] Move wikidata elasticsearch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503717 (owner: 10EBernhardson)
[05:59:17] <wikibugs>	 (03Merged) 10jenkins-bot: Move wikidata elasticsearch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503717 (owner: 10EBernhardson)
[06:05:53] <wikibugs>	 (03CR) 10jenkins-bot: Move wikidata elasticsearch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503717 (owner: 10EBernhardson)
[06:07:04] <onimisionipe>	 ebernhardson: anything I can do to help?
[06:07:28] <ebernhardson>	 onimisionipe: i've switched enwiki to codfw, which is most of the traffic. Hoping that will resolve things for the weekend
[06:08:36] <onimisionipe>	 Ok. That should settle things enough to investigate?
[06:08:53] <ebernhardson>	 onimisionipe: i hope so at least :) I wont be arround tomorrow afternoon and i don't imagine many other people will either
[06:09:35] <onimisionipe>	 Yea. I can imagine
[06:10:36] <ebernhardson>	 !log unban elastic1027 from eqiad-psi
[06:10:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:10:56] <onimisionipe>	 Any idea what could cause the latency issue?
[06:12:00] <ebernhardson>	 onimisionipe: not sure yet, it's often been mysterious things in the past. Of the issues i can remember, one was aggressive disk readaheads, one was overloading the memory controllers with bad numa handling, one was an analysis problem that caused problems on zhwiki
[06:12:06] <ebernhardson>	 would rather dig into it monday :)
[06:12:16] <onimisionipe>	 Ebernhardson: can you help create a phab task?
[06:12:39] <ebernhardson>	 sure
[06:12:56] <onimisionipe>	 Ok. Cool. 
[06:15:39] <wikibugs>	 10Operations, 10Discovery-Search (Current work): Elasticsearch nodes overloading in eqiad - https://phabricator.wikimedia.org/T220901 (10EBernhardson)
[06:16:43] <onimisionipe>	 Thanks
[06:16:47] <wikibugs>	 10Operations, 10Discovery-Search (Current work): Elasticsearch nodes overloading in eqiad - https://phabricator.wikimedia.org/T220901 (10EBernhardson) A previous time this happened we added some new metrics endpoints inside elasticsearch and started logging them to prometheus to collect per-node latency metric...
[06:18:25] <ebernhardson>	  onimisionipe: do you know, did we kill the wmf specialized prometheus-elasticsearch-exporter?
[06:18:32] <ebernhardson>	 missing some important custom metrics we added to elasticsearch
[06:18:44] <ebernhardson>	 oh nevermind, i see it in ps
[06:19:09] <ebernhardson>	 but it isn't reporting my custom metrics :(
[06:19:18] <onimisionipe>	 Ok
[06:19:22] <onimisionipe>	 Hmmm
[06:19:48] <onimisionipe>	 We recently upgraded that. I added custom metrics to upstream
[06:20:01] <onimisionipe>	 And we pull and built theirs
[06:20:21] <onimisionipe>	 I'm not sure what you mean by custom metrics
[06:20:29] <ebernhardson>	 it looks like the capture of metrics in prometheus-wmf-elasticsearch-exporter is wrapped in a try/except that throws away any error
[06:21:15] <ebernhardson>	 onimisionipe: this one: https://github.com/wikimedia/puppet/blob/production/modules/prometheus/files/usr/local/bin/prometheus-wmf-elasticsearch-exporter#L39
[06:21:21] <onimisionipe>	 It's possible. But they should at least be logged
[06:22:43] <ebernhardson>	 i see the metrics coming out of the elasticsearch endpoint it talks to, but when i fetch from the prometheus port nothing :( Anyways not important just this moment, can look more monday
[06:23:09] <onimisionipe>	 Oh...that's your custom metric?
[06:23:25] <ebernhardson>	 onimisionipe: yes, it comes from one of our elasticsearch plugins
[06:23:32] <ebernhardson>	 helped track down problems before :)
[06:24:13] <onimisionipe>	 I never really get that python file. I was talking about just watch exporter in golang 
[06:24:45] <onimisionipe>	 That's Prometheus-elasticsearch-exporter without wmf
[06:25:46] <ebernhardson>	 actually i know the problem, next(n['latencies'] for _, n in nodes.iteritems() if n['name'] == hostname)
[06:25:58] <ebernhardson>	 n['name'] now is something like elastic1034-production-search-eqiad
[06:26:05] <ebernhardson>	 and hostname is, well, just a hostname
[06:27:03] <onimisionipe>	 Ah...
[06:27:14] <onimisionipe>	 Thats why it's not reporting
[06:27:38] <onimisionipe>	 Multi instance yet again..:)
[06:28:32] <wikibugs>	 (03PS1) 10EBernhardson: Repair elasticsearch per-node latency reporting [puppet] - 10https://gerrit.wikimedia.org/r/503718 (https://phabricator.wikimedia.org/T220901)
[06:29:59] <icinga-wm>	 PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh]
[06:30:11] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch nodes overloading in eqiad - https://phabricator.wikimedia.org/T220901 (10EBernhardson) Patch does not fix overall problem, it fixes the per-node percentiles data collection which usually helps tracking down these kinds of pro...
[06:30:52] <ebernhardson>	 codfw looks happy enough with the traffic, going to leave things alone for now
[06:31:25] <icinga-wm>	 PROBLEM - puppet last run on acmechief1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml]
[06:31:49] <onimisionipe>	 alright!
[06:34:50] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] Repair elasticsearch per-node latency reporting [puppet] - 10https://gerrit.wikimedia.org/r/503718 (https://phabricator.wikimedia.org/T220901) (owner: 10EBernhardson)
[06:52:14] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch nodes overloading in eqiad - https://phabricator.wikimedia.org/T220901 (10EBernhardson) I don't know it's necessarily related, but i noticed that full text qps is up in the last month. Over the last year we've been pretty cons...
[06:56:21] <icinga-wm>	 RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[06:57:49] <icinga-wm>	 RECOVERY - puppet last run on acmechief1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[07:04:48] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Unable to send messages to Indian Wikimedia Community mailing list - https://phabricator.wikimedia.org/T220902 (10Nivas10798)
[07:05:34] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Unable to send messages to Indian Wikimedia Community mailing list - https://phabricator.wikimedia.org/T220902 (10Nivas10798)
[07:06:04] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Unable to send messages to Indian Wikimedia Community mailing list - https://phabricator.wikimedia.org/T220902 (10Nivas10798)
[08:32:33] <wikibugs>	 10Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic, 10Patch-For-Review, 10User-Ladsgroup: Shortened URLs won't redirect when there's data - https://phabricator.wikimedia.org/T219986 (10Legoktm) 05Open→03Resolved a:03Ladsgroup
[08:33:27] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10Mathew.onipe)
[08:37:15] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Yann) Can we move forward here? What are the blocking issues, if any?
[09:44:56] <wikibugs>	 (03PS16) 10Ammarpad: Add 'Author' namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553)
[10:38:51] <icinga-wm>	 PROBLEM - MD RAID on ms-be1013 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0
[10:38:53] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be1013 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T220907
[10:39:00] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220907 (10ops-monitoring-bot)
[10:42:31] <icinga-wm>	 PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 3 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[xfs_label-/dev/sdm3],Exec[xfs_label-/dev/sdm4],File[mountpoint-/srv/swift-storage/sdm3],File[mountpoint-/srv/swift-storage/sdm4]
[10:42:37] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:42:41] <icinga-wm>	 PROBLEM - Disk space on ms-be1013 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdm3 is not accessible: Input/output error
[11:02:03] <icinga-wm>	 RECOVERY - Disk space on ms-be1013 is OK: DISK OK
[11:05:55] <icinga-wm>	 PROBLEM - MegaRAID on ms-be1013 is CRITICAL: CRITICAL: 3 failed LD(s) (Offline, Offline, Offline)
[11:06:06] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on ms-be1013 is CRITICAL: CRITICAL: 3 failed LD(s) (Offline, Offline, Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T220909
[11:06:10] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220909 (10ops-monitoring-bot)
[11:11:52] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1013 is CRITICAL: CRITICAL - load average: 162.26, 107.69, 55.58 https://wikitech.wikimedia.org/wiki/Swift
[11:14:13] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be1013 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T220910
[11:14:21] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220910 (10ops-monitoring-bot)
[11:16:38] <icinga-wm>	 PROBLEM - Disk space on ms-be1013 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sda1 is not accessible: Input/output error
[11:17:06] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1013 is OK: OK - load average: 9.18, 64.26, 54.46 https://wikitech.wikimedia.org/wiki/Swift
[11:36:06] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1013 is OK: OK - running: The system is fully operational
[12:02:00] <icinga-wm>	 RECOVERY - Disk space on ms-be1013 is OK: DISK OK
[13:28:25] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Urbanecm) Hi @Yann, we experienced some major problems with the script we use to create new wikis. We would like to know the root cause of tho...
[14:11:24] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update lag - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:11:48] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:22:23] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Ladsgroup) >>! In T218155#5109699, @Urbanecm wrote: > Hi @Yann, we experienced some major problems with the script we use to create new wikis....
[14:30:44] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update lag - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:31:10] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:35:42] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220910 (10Krenair)
[14:35:43] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220907 (10Krenair)
[14:37:23] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: "You are doing that too often. Please try again later." during subscription a mailing list. - https://phabricator.wikimedia.org/T220914 (10jayantanth)
[14:37:26] <wikibugs>	 10Operations: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10Krenair)
[14:50:22] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Add botadmin group on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503753 (https://phabricator.wikimedia.org/T220915)
[15:55:31] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: "You are doing that too often. Please try again later." during subscription a mailing list. - https://phabricator.wikimedia.org/T220914 (10Aklapper) 05Open→03Declined This is an intentional rate limiting. As it says, "Please try again later."
[15:56:34] <icinga-wm>	 PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[15:57:40] <icinga-wm>	 RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 77928 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[16:03:59] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Unable to send messages to Indian Wikimedia Community mailing list - https://phabricator.wikimedia.org/T220902 (10Aklapper) 05Open→03Stalled Hi @Nivas10798, thanks for taking the time to report this!  Unfortunately this report lacks some information. If you have tim...
[16:25:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: VMs on cloudvirt1015 crashing - https://phabricator.wikimedia.org/T220853 (10Andrew) I finished draining cloudvirt1015 and put it in downtime, so it's ready for whatever reboots/rebuilds/hardware changes might be needed.
[16:31:00] <icinga-wm>	 PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:48:44] <icinga-wm>	 PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:48:58] <icinga-wm>	 PROBLEM - puppet last run on mw1269 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:02:42] <icinga-wm>	 RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:15:10] <icinga-wm>	 RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:15:24] <icinga-wm>	 RECOVERY - puppet last run on mw1269 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[18:20:37] <wikibugs>	 (03PS4) 10Krinkle: profiler: Discard incomplete stacks and monitor via statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503652 (https://phabricator.wikimedia.org/T176916)
[20:51:10] <wikibugs>	 10Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.25; 2019-04-09), and 2 others: Make UrlShortener 404s cacheable - https://phabricator.wikimedia.org/T220190 (10Legoktm) @ladsgroup is there anything else left to do here?
[20:53:31] <wikibugs>	 10Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.25; 2019-04-09), and 2 others: Make UrlShortener 404s cacheable - https://phabricator.wikimedia.org/T220190 (10Legoktm) Should we have a CDN purge when we create new short codes just in case the 404 was cached?
[21:11:11] <wikibugs>	 10Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.25; 2019-04-09), and 2 others: Make UrlShortener 404s cacheable - https://phabricator.wikimedia.org/T220190 (10Krinkle) +1 for purging after creation.  Varnish is configured to cache 4xx errors for a short time (5 min...
[21:18:30] <icinga-wm>	 PROBLEM - puppet last run on mw1320 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:44:58] <icinga-wm>	 RECOVERY - puppet last run on mw1320 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[22:54:30] <icinga-wm>	 PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100%
[22:56:39] <icinga-wm>	 PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[22:57:06] <cdanis>	 uhhhhh
[22:57:08] <paladox>	 hmm
[22:58:19] <mutante>	 eh, hi
[22:58:21] <chaomodus>	 both esams
[22:58:39] <chaomodus>	 or ar those the same machin ha
[22:59:04] * volans here
[22:59:24] <XioNoX>	 depool ams if any doubt
[22:59:27] <mutante>	 seeing if i can get to mgmt of lvs3001 but seems not
[22:59:42] <volans>	 3002 should have take over, checking
[22:59:59] <cdanis>	 3001 is up for me
[23:00:28] <paladox>	 wikipedia is still working for me :)
[23:00:49] <volans>	 cdanis: right, and also uptime is 39 days
[23:01:05] <cdanis>	 curl -v --resolve en.wikimedia.org:443:$(dig +short text-lb.esams.wikimedia.org) 'https://en.wikimedia.org/'
[23:01:07] <cdanis>	 also looks fine for me
[23:01:34] <chaomodus>	 yep same
[23:01:37] <cdanis>	 and it looks fine running it on icinga1001
[23:01:46] <mutante>	 icinga1001 can manually ping 3001 as well
[23:02:05] * volans errata corrige 3003 is the other of the couple
[23:02:49] <mutante>	 rescheduled host checks
[23:03:30] <mutante>	 why does icinga ping check think it's down but manual ping works.. eh
[23:03:35] <volans>	 cannot ping ipv4
[23:03:40] <volans>	 10.20.0.11
[23:03:47] <mutante>	 true, replies from v6
[23:04:05] <cdanis>	 ipv4 HTTPS works fine though
[23:04:43] <icinga-wm>	 RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 93%, RTA = 83.51 ms
[23:04:51] <chaomodus>	 welp
[23:05:11] <herron>	 hah figures as soon as I get to a computer
[23:05:11] <mutante>	 also all service checks on 3001 are green, it's only the host check itself
[23:05:12] <volans>	 any possibly related network maintenance?
[23:06:02] <mutante>	 got recovery
[23:06:18] <cdanis>	 there is definitely a lot of icmp packet loss for the v4 (but only ICMP I think)
[23:06:22] <mutante>	 icinga-wm: would you tell us here as well?
[23:08:01] <mutante>	 dont see anyting matching in maint-announce calenar or inbox so far
[23:08:12] <mutante>	 have to board a plane in a moment
[23:08:47] <icinga-wm>	 PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[23:08:54] <icinga-wm>	 RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 93%, RTA = 83.35 ms
[23:09:19] <cdanis>	 https://grafana.wikimedia.org/d/000000365/network-performances?orgId=1&from=now-1h&to=now
[23:09:32] <mutante>	 sorry, i have to board
[23:12:10] <Krenair>	 Didn't they just do something about rerouting ICMP?
[23:12:13] <chaomodus>	 that graph kinda sorta looks like
[23:12:17] <chaomodus>	 a bad thing
[23:14:00] <icinga-wm>	 PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100%
[23:15:18] <Krenair>	 I was thinking of https://phabricator.wikimedia.org/T190090
[23:16:15] <Krenair>	 which might not even apply in esams
[23:17:04] <cdanis>	 it looks like it doesn't
[23:20:22] <icinga-wm>	 RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 93%, RTA = 83.32 ms
[23:26:04] <icinga-wm>	 PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100%
[23:30:40] <icinga-wm>	 RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 66%, RTA = 83.36 ms
[23:30:43] <icinga-wm>	 RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.32 ms
[23:37:24] <XioNoX>	 big spike of out traffic: https://librenms.wikimedia.org/graphs/to=1555284900/id=13522/type=port_bits/from=1555198500/
[23:37:57] <Krenair>	 ICMP on the network performances graph cdanis linked earlier has gone back down
[23:39:35] <XioNoX>	 yeah, that big plateau is not normal neither