[00:16:36] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: Prepare puppet infrastructure for Debian buster - https://phabricator.wikimedia.org/T213546 (10Krenair) Ah, and that's against the parent task which makes sense. +1 for closing. [00:23:47] PROBLEM - puppet last run on archiva1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:50:15] RECOVERY - puppet last run on archiva1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:37:21] I am getting 503 time out issues when trying to import a 34MB xml file to Wikimaniawiki .... Request from (xxx.xxx.xxx.xxx) via cp5007 frontend, Varnish XID 787971469 Error: 503, Backend fetch failed at Sun, 14 Apr 2019 01:26:44 GMT [01:40:28] Repeatedly? Or just first try? [01:43:41] happened 12 hours ago (2x) [01:44:03] trying again now [01:45:54] Are you using the PHP7 beta feature? Might help... [01:45:55] (03PS1) 10Alex Monk: Fix broken profile::swift::storage::labs [puppet] - 10https://gerrit.wikimedia.org/r/503707 (https://phabricator.wikimedia.org/T220895) [01:46:45] nope, will try when this one fails [01:51:09] Reedy, same fail [01:51:23] with PHP7 [01:54:43] (03PS1) 10Alex Monk: udev: purge unpuppetised rules [puppet] - 10https://gerrit.wikimedia.org/r/503708 [01:55:11] (03CR) 10jerkins-bot: [V: 04-1] udev: purge unpuppetised rules [puppet] - 10https://gerrit.wikimedia.org/r/503708 (owner: 10Alex Monk) [01:56:39] (03PS2) 10Alex Monk: udev: purge unpuppetised rules [puppet] - 10https://gerrit.wikimedia.org/r/503708 [02:01:33] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [02:05:39] (03CR) 10Alex Monk: "Caused T220895" [puppet] - 10https://gerrit.wikimedia.org/r/371642 (owner: 10Filippo Giunchedi) [02:05:43] What're you importing? [02:05:47] Can you break it down to smaller chunks? [02:07:27] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:09:22] (03CR) 10Alex Monk: [C: 04-1] "./modules/profile/manifests/elasticsearch/cirrus.pp: file { '/etc/udev/rules.d/elasticsearch-readahead.rules':" [puppet] - 10https://gerrit.wikimedia.org/r/503708 (owner: 10Alex Monk) [02:10:03] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:10:37] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:12:25] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 253784360 and 17 seconds [02:12:36] (03PS5) 10Krinkle: Remove pear packages from MW Application Servers [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [02:12:41] (03CR) 10Krinkle: [C: 03+1] Remove pear packages from MW Application Servers [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [02:14:35] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:15:25] (03PS3) 10Alex Monk: udev: purge unpuppetised rules [puppet] - 10https://gerrit.wikimedia.org/r/503708 [02:15:27] (03PS1) 10Alex Monk: profile::elasticsearch::cirrus: Don't duplicate udev stuff [puppet] - 10https://gerrit.wikimedia.org/r/503709 [02:17:11] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:17:39] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 6488 and 24 seconds [02:20:34] (03CR) 10Alex Monk: udev: purge unpuppetised rules [puppet] - 10https://gerrit.wikimedia.org/r/503708 (owner: 10Alex Monk) [02:21:09] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:21:13] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:22:33] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:23:11] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:25:49] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:36:49] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19598296 and 0 seconds [02:44:35] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [02:49:31] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:49:47] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [02:50:49] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:00:19] (03PS1) 10Alex Monk: deployment-prep: Add stretch storage hosts [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503714 [03:07:59] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [03:33:41] PROBLEM - puppet last run on cp4028 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz],File[/usr/share/GeoIP/GeoIPCity.dat.test] [03:34:17] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:35:09] PROBLEM - puppet last run on mw2277 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:44:35] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [04:00:43] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [04:01:37] RECOVERY - puppet last run on mw2277 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:03:15] (03CR) 10Alex Monk: "This is cleaning up after T95288 which I just stumbled into" [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk) [04:05:25] RECOVERY - puppet last run on cp4028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:07:51] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [04:13:01] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:14:21] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:15:59] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:17:17] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:28:24] (03PS3) 10Alex Monk: labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991 [04:28:36] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk) [04:29:07] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:29:13] (03CR) 10jerkins-bot: [V: 04-1] labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk) [04:30:46] (03PS4) 10Alex Monk: labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991 [04:30:51] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk) [04:31:31] (03CR) 10jerkins-bot: [V: 04-1] labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk) [04:31:43] (03CR) 10jerkins-bot: [V: 04-1] labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk) [04:34:26] (03CR) 10Alex Monk: "Looks exactly as I'd expect: https://puppet-compiler.wmflabs.org/compiler1002/145/labnet1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk) [04:35:18] (03PS5) 10Alex Monk: labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991 [04:43:09] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:44:31] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:53:28] (03PS1) 10EBernhardson: Revert "Revert "Switch more_like and regex elsaticsearch queries from eqiad to codw"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503715 [04:53:33] (03CR) 10EBernhardson: [C: 03+2] Revert "Revert "Switch more_like and regex elsaticsearch queries from eqiad to codw"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503715 (owner: 10EBernhardson) [04:54:36] (03Merged) 10jenkins-bot: Revert "Revert "Switch more_like and regex elsaticsearch queries from eqiad to codw"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503715 (owner: 10EBernhardson) [04:55:39] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [04:58:47] (03CR) 10jenkins-bot: Revert "Revert "Switch more_like and regex elsaticsearch queries from eqiad to codw"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503715 (owner: 10EBernhardson) [04:59:24] !log restart elasticsearch_6@production-searhc-psi-eqiad on elastic1027 due to 100% cpu for last 30+ minutes [04:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:07] PROBLEM - Check systemd state on elastic1027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:23:29] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:31:13] !log ban elastic1027 from elasticsearch-psi in eqiad [05:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:32] !log unbanning elastic1027 after about half the shards left and load dropped [05:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:37] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [05:49:29] (03PS1) 10EBernhardson: shift wikidata and enwiki elasticsearch traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503716 [05:50:34] (03CR) 10EBernhardson: [C: 03+2] shift wikidata and enwiki elasticsearch traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503716 (owner: 10EBernhardson) [05:51:33] (03Merged) 10jenkins-bot: shift wikidata and enwiki elasticsearch traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503716 (owner: 10EBernhardson) [05:54:44] (03CR) 10jenkins-bot: shift wikidata and enwiki elasticsearch traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503716 (owner: 10EBernhardson) [05:55:15] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:58:03] (03PS1) 10EBernhardson: Move wikidata elasticsearch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503717 [05:58:17] (03CR) 10EBernhardson: [C: 03+2] Move wikidata elasticsearch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503717 (owner: 10EBernhardson) [05:59:17] (03Merged) 10jenkins-bot: Move wikidata elasticsearch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503717 (owner: 10EBernhardson) [06:05:53] (03CR) 10jenkins-bot: Move wikidata elasticsearch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503717 (owner: 10EBernhardson) [06:07:04] ebernhardson: anything I can do to help? [06:07:28] onimisionipe: i've switched enwiki to codfw, which is most of the traffic. Hoping that will resolve things for the weekend [06:08:36] Ok. That should settle things enough to investigate? [06:08:53] onimisionipe: i hope so at least :) I wont be arround tomorrow afternoon and i don't imagine many other people will either [06:09:35] Yea. I can imagine [06:10:36] !log unban elastic1027 from eqiad-psi [06:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:56] Any idea what could cause the latency issue? [06:12:00] onimisionipe: not sure yet, it's often been mysterious things in the past. Of the issues i can remember, one was aggressive disk readaheads, one was overloading the memory controllers with bad numa handling, one was an analysis problem that caused problems on zhwiki [06:12:06] would rather dig into it monday :) [06:12:16] Ebernhardson: can you help create a phab task? [06:12:39] sure [06:12:56] Ok. Cool. [06:15:39] 10Operations, 10Discovery-Search (Current work): Elasticsearch nodes overloading in eqiad - https://phabricator.wikimedia.org/T220901 (10EBernhardson) [06:16:43] Thanks [06:16:47] 10Operations, 10Discovery-Search (Current work): Elasticsearch nodes overloading in eqiad - https://phabricator.wikimedia.org/T220901 (10EBernhardson) A previous time this happened we added some new metrics endpoints inside elasticsearch and started logging them to prometheus to collect per-node latency metric... [06:18:25] onimisionipe: do you know, did we kill the wmf specialized prometheus-elasticsearch-exporter? [06:18:32] missing some important custom metrics we added to elasticsearch [06:18:44] oh nevermind, i see it in ps [06:19:09] but it isn't reporting my custom metrics :( [06:19:18] Ok [06:19:22] Hmmm [06:19:48] We recently upgraded that. I added custom metrics to upstream [06:20:01] And we pull and built theirs [06:20:21] I'm not sure what you mean by custom metrics [06:20:29] it looks like the capture of metrics in prometheus-wmf-elasticsearch-exporter is wrapped in a try/except that throws away any error [06:21:15] onimisionipe: this one: https://github.com/wikimedia/puppet/blob/production/modules/prometheus/files/usr/local/bin/prometheus-wmf-elasticsearch-exporter#L39 [06:21:21] It's possible. But they should at least be logged [06:22:43] i see the metrics coming out of the elasticsearch endpoint it talks to, but when i fetch from the prometheus port nothing :( Anyways not important just this moment, can look more monday [06:23:09] Oh...that's your custom metric? [06:23:25] onimisionipe: yes, it comes from one of our elasticsearch plugins [06:23:32] helped track down problems before :) [06:24:13] I never really get that python file. I was talking about just watch exporter in golang [06:24:45] That's Prometheus-elasticsearch-exporter without wmf [06:25:46] actually i know the problem, next(n['latencies'] for _, n in nodes.iteritems() if n['name'] == hostname) [06:25:58] n['name'] now is something like elastic1034-production-search-eqiad [06:26:05] and hostname is, well, just a hostname [06:27:03] Ah... [06:27:14] Thats why it's not reporting [06:27:38] Multi instance yet again..:) [06:28:32] (03PS1) 10EBernhardson: Repair elasticsearch per-node latency reporting [puppet] - 10https://gerrit.wikimedia.org/r/503718 (https://phabricator.wikimedia.org/T220901) [06:29:59] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:30:11] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch nodes overloading in eqiad - https://phabricator.wikimedia.org/T220901 (10EBernhardson) Patch does not fix overall problem, it fixes the per-node percentiles data collection which usually helps tracking down these kinds of pro... [06:30:52] codfw looks happy enough with the traffic, going to leave things alone for now [06:31:25] PROBLEM - puppet last run on acmechief1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:31:49] alright! [06:34:50] (03CR) 10Mathew.onipe: [C: 03+1] Repair elasticsearch per-node latency reporting [puppet] - 10https://gerrit.wikimedia.org/r/503718 (https://phabricator.wikimedia.org/T220901) (owner: 10EBernhardson) [06:52:14] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch nodes overloading in eqiad - https://phabricator.wikimedia.org/T220901 (10EBernhardson) I don't know it's necessarily related, but i noticed that full text qps is up in the last month. Over the last year we've been pretty cons... [06:56:21] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:57:49] RECOVERY - puppet last run on acmechief1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:04:48] 10Operations, 10Wikimedia-Mailing-lists: Unable to send messages to Indian Wikimedia Community mailing list - https://phabricator.wikimedia.org/T220902 (10Nivas10798) [07:05:34] 10Operations, 10Wikimedia-Mailing-lists: Unable to send messages to Indian Wikimedia Community mailing list - https://phabricator.wikimedia.org/T220902 (10Nivas10798) [07:06:04] 10Operations, 10Wikimedia-Mailing-lists: Unable to send messages to Indian Wikimedia Community mailing list - https://phabricator.wikimedia.org/T220902 (10Nivas10798) [08:32:33] 10Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic, 10Patch-For-Review, 10User-Ladsgroup: Shortened URLs won't redirect when there's data - https://phabricator.wikimedia.org/T219986 (10Legoktm) 05Open→03Resolved a:03Ladsgroup [08:33:27] 10Operations, 10Operations-Software-Development, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10Mathew.onipe) [08:37:15] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Yann) Can we move forward here? What are the blocking issues, if any? [09:44:56] (03PS16) 10Ammarpad: Add 'Author' namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) [10:38:51] PROBLEM - MD RAID on ms-be1013 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [10:38:53] ACKNOWLEDGEMENT - MD RAID on ms-be1013 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T220907 [10:39:00] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220907 (10ops-monitoring-bot) [10:42:31] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 3 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[xfs_label-/dev/sdm3],Exec[xfs_label-/dev/sdm4],File[mountpoint-/srv/swift-storage/sdm3],File[mountpoint-/srv/swift-storage/sdm4] [10:42:37] PROBLEM - Check systemd state on ms-be1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:42:41] PROBLEM - Disk space on ms-be1013 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdm3 is not accessible: Input/output error [11:02:03] RECOVERY - Disk space on ms-be1013 is OK: DISK OK [11:05:55] PROBLEM - MegaRAID on ms-be1013 is CRITICAL: CRITICAL: 3 failed LD(s) (Offline, Offline, Offline) [11:06:06] ACKNOWLEDGEMENT - MegaRAID on ms-be1013 is CRITICAL: CRITICAL: 3 failed LD(s) (Offline, Offline, Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T220909 [11:06:10] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220909 (10ops-monitoring-bot) [11:11:52] PROBLEM - very high load average likely xfs on ms-be1013 is CRITICAL: CRITICAL - load average: 162.26, 107.69, 55.58 https://wikitech.wikimedia.org/wiki/Swift [11:14:13] ACKNOWLEDGEMENT - MD RAID on ms-be1013 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T220910 [11:14:21] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220910 (10ops-monitoring-bot) [11:16:38] PROBLEM - Disk space on ms-be1013 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sda1 is not accessible: Input/output error [11:17:06] RECOVERY - very high load average likely xfs on ms-be1013 is OK: OK - load average: 9.18, 64.26, 54.46 https://wikitech.wikimedia.org/wiki/Swift [11:36:06] RECOVERY - Check systemd state on ms-be1013 is OK: OK - running: The system is fully operational [12:02:00] RECOVERY - Disk space on ms-be1013 is OK: DISK OK [13:28:25] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Urbanecm) Hi @Yann, we experienced some major problems with the script we use to create new wikis. We would like to know the root cause of tho... [14:11:24] PROBLEM - Mediawiki Cirrussearch update lag - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:11:48] PROBLEM - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:22:23] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Ladsgroup) >>! In T218155#5109699, @Urbanecm wrote: > Hi @Yann, we experienced some major problems with the script we use to create new wikis.... [14:30:44] RECOVERY - Mediawiki Cirrussearch update lag - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:31:10] RECOVERY - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:35:42] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220910 (10Krenair) [14:35:43] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220907 (10Krenair) [14:37:23] 10Operations, 10Wikimedia-Mailing-lists: "You are doing that too often. Please try again later." during subscription a mailing list. - https://phabricator.wikimedia.org/T220914 (10jayantanth) [14:37:26] 10Operations: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10Krenair) [14:50:22] (03PS1) 10Daimona Eaytoy: Add botadmin group on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503753 (https://phabricator.wikimedia.org/T220915) [15:55:31] 10Operations, 10Wikimedia-Mailing-lists: "You are doing that too often. Please try again later." during subscription a mailing list. - https://phabricator.wikimedia.org/T220914 (10Aklapper) 05Open→03Declined This is an intentional rate limiting. As it says, "Please try again later." [15:56:34] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:57:40] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 77928 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:03:59] 10Operations, 10Wikimedia-Mailing-lists: Unable to send messages to Indian Wikimedia Community mailing list - https://phabricator.wikimedia.org/T220902 (10Aklapper) 05Open→03Stalled Hi @Nivas10798, thanks for taking the time to report this! Unfortunately this report lacks some information. If you have tim... [16:25:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: VMs on cloudvirt1015 crashing - https://phabricator.wikimedia.org/T220853 (10Andrew) I finished draining cloudvirt1015 and put it in downtime, so it's ready for whatever reboots/rebuilds/hardware changes might be needed. [16:31:00] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:48:44] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:48:58] PROBLEM - puppet last run on mw1269 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:42] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:15:10] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:15:24] RECOVERY - puppet last run on mw1269 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:20:37] (03PS4) 10Krinkle: profiler: Discard incomplete stacks and monitor via statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503652 (https://phabricator.wikimedia.org/T176916) [20:51:10] 10Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.25; 2019-04-09), and 2 others: Make UrlShortener 404s cacheable - https://phabricator.wikimedia.org/T220190 (10Legoktm) @ladsgroup is there anything else left to do here? [20:53:31] 10Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.25; 2019-04-09), and 2 others: Make UrlShortener 404s cacheable - https://phabricator.wikimedia.org/T220190 (10Legoktm) Should we have a CDN purge when we create new short codes just in case the 404 was cached? [21:11:11] 10Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.25; 2019-04-09), and 2 others: Make UrlShortener 404s cacheable - https://phabricator.wikimedia.org/T220190 (10Krinkle) +1 for purging after creation. Varnish is configured to cache 4xx errors for a short time (5 min... [21:18:30] PROBLEM - puppet last run on mw1320 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:44:58] RECOVERY - puppet last run on mw1320 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:54:30] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [22:56:39] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [22:57:06] uhhhhh [22:57:08] hmm [22:58:19] eh, hi [22:58:21] both esams [22:58:39] or ar those the same machin ha [22:59:04] * volans here [22:59:24] depool ams if any doubt [22:59:27] seeing if i can get to mgmt of lvs3001 but seems not [22:59:42] 3002 should have take over, checking [22:59:59] 3001 is up for me [23:00:28] wikipedia is still working for me :) [23:00:49] cdanis: right, and also uptime is 39 days [23:01:05] curl -v --resolve en.wikimedia.org:443:$(dig +short text-lb.esams.wikimedia.org) 'https://en.wikimedia.org/' [23:01:07] also looks fine for me [23:01:34] yep same [23:01:37] and it looks fine running it on icinga1001 [23:01:46] icinga1001 can manually ping 3001 as well [23:02:05] * volans errata corrige 3003 is the other of the couple [23:02:49] rescheduled host checks [23:03:30] why does icinga ping check think it's down but manual ping works.. eh [23:03:35] cannot ping ipv4 [23:03:40] 10.20.0.11 [23:03:47] true, replies from v6 [23:04:05] ipv4 HTTPS works fine though [23:04:43] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 93%, RTA = 83.51 ms [23:04:51] welp [23:05:11] hah figures as soon as I get to a computer [23:05:11] also all service checks on 3001 are green, it's only the host check itself [23:05:12] any possibly related network maintenance? [23:06:02] got recovery [23:06:18] there is definitely a lot of icmp packet loss for the v4 (but only ICMP I think) [23:06:22] icinga-wm: would you tell us here as well? [23:08:01] dont see anyting matching in maint-announce calenar or inbox so far [23:08:12] have to board a plane in a moment [23:08:47] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:08:54] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 93%, RTA = 83.35 ms [23:09:19] https://grafana.wikimedia.org/d/000000365/network-performances?orgId=1&from=now-1h&to=now [23:09:32] sorry, i have to board [23:12:10] Didn't they just do something about rerouting ICMP? [23:12:13] that graph kinda sorta looks like [23:12:17] a bad thing [23:14:00] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [23:15:18] I was thinking of https://phabricator.wikimedia.org/T190090 [23:16:15] which might not even apply in esams [23:17:04] it looks like it doesn't [23:20:22] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 93%, RTA = 83.32 ms [23:26:04] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [23:30:40] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 66%, RTA = 83.36 ms [23:30:43] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.32 ms [23:37:24] big spike of out traffic: https://librenms.wikimedia.org/graphs/to=1555284900/id=13522/type=port_bits/from=1555198500/ [23:37:57] ICMP on the network performances graph cdanis linked earlier has gone back down [23:39:35] yeah, that big plateau is not normal neither