[00:01:47] 10Operations, 10Pywikibot, 10cloud-services-team (Kanban): http://pywikibot.org/ is displaying Wikimedia error page - https://phabricator.wikimedia.org/T257536 (10Dzahn) The WMF nameservers _are_ shown in the output of the whois command though. [00:04:18] 10Operations, 10Pywikibot, 10cloud-services-team (Kanban): http://pywikibot.org/ is displaying Wikimedia error page - https://phabricator.wikimedia.org/T257536 (10Dzahn) Also the DNS servers for toolforge.org are: ` Name Server: NS0.OPENSTACK.EQIAD1.WIKIMEDIACLOUD.ORG Name Server: NS1.OPENSTACK.EQIAD1.WIKI... [00:13:48] !log Performing one-time expiration of ArcLamp files older than 40 days (normal retention is 45 days), to solve disk space issue until either Ganeti issue is solved or compressed logfile support is merged. [00:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:22] ok, cool. i saw about 10G left. thanks Dave [00:14:56] 27GB now free. [00:15:22] great! [00:21:15] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10dpifke) We were down to ~5 GB free before the daily expiration took place, and about ~10 GB immediately after. To ensure this won't be a problem over the weekend,... [00:29:33] RECOVERY - Disk space on webperf1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [01:05:16] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:07:11] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:43:21] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Dzahn) An additional disk with just 20G has been created but it took forever and it won't help with this situation. [01:52:33] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [02:34:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:38:13] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:41:57] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:54:57] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:07:11] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={rsyslog-info,rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId [03:07:11] =eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [03:10:55] (03CR) 10Krinkle: mediawiki: Create /etc/firejail/mediawiki.local (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609840 (owner: 10Legoktm) [03:14:35] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.7856 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [03:23:13] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.5307 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [03:30:43] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.4438 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [04:25:37] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:37:46] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:39:23] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:43:07] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:43:25] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:03:41] (03PS6) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) [05:23:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:25:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:53:58] (03CR) 10DannyS712: [C: 03+1] Remove line saying ldaplist will be removed 30 August 2016 [puppet] - 10https://gerrit.wikimedia.org/r/613360 (owner: 10Reedy) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200718T0700) [08:24:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:25:45] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [08:26:37] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:27:31] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [08:34:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:36:01] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:49:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:51:03] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:04:26] 10Operations, 10Cloud-VPS, 10Comments, 10Community-IdeaLab, and 6 others: LOOMIO decision making software (also for discussing and concensus) - https://phabricator.wikimedia.org/T258317 (10Zblace) [09:06:20] 10Operations, 10Cloud-VPS, 10Community-IdeaLab, 10Design Research tools and infrastructure, and 4 others: LOOMIO decision making software (also for discussing and concensus) - https://phabricator.wikimedia.org/T258317 (10RhinosF1) Nothing to do with the SocialTools group of Extensions [09:07:16] 10Operations, 10Cloud-VPS, 10Community-IdeaLab, 10Design Research tools and infrastructure, and 4 others: LOOMIO decision making software (also for discussing and concensus) - https://phabricator.wikimedia.org/T258317 (10RhinosF1) Neither do I think Phabricator is the place to discuss this. Maybe meta? [09:09:59] cdanis: ^ I wonder if that could even be closed as invalid/declinded. I'll let someone else triage further though. [09:19:01] 10Operations, 10Cloud-VPS, 10Community-IdeaLab, 10Design Research tools and infrastructure, and 4 others: LOOMIO decision making software (also for discussing and concensus) - https://phabricator.wikimedia.org/T258317 (10RhinosF1) @ZBlace: why is this tagged with Cloud-VPS and Service requests? It's one or... [09:27:23] 10Operations, 10Cloud-VPS, 10Community-IdeaLab, 10Design Research tools and infrastructure, and 4 others: LOOMIO decision making software (also for discussing and concensus) - https://phabricator.wikimedia.org/T258317 (10Zblace) >>! In T258317#6316658, @RhinosF1 wrote: > Neither do I think Phabricator is t... [09:28:21] 10Operations, 10Cloud-VPS, 10Community-IdeaLab, 10Design Research tools and infrastructure, and 4 others: LOOMIO decision making software (also for discussing and concensus) - https://phabricator.wikimedia.org/T258317 (10RhinosF1) > I do not see how this would be informed discussion on Meta if there is no... [09:28:57] 10Operations, 10Cloud-VPS, 10Community-IdeaLab, 10Design Research tools and infrastructure, and 4 others: LOOMIO decision making software (also for discussing and concensus) - https://phabricator.wikimedia.org/T258317 (10Zblace) >>! In T258317#6316662, @RhinosF1 wrote: > @ZBlace: why is this tagged with Cl... [09:30:05] 10Operations, 10Cloud-VPS, 10Community-IdeaLab, 10Design Research tools and infrastructure, and 4 others: LOOMIO decision making software (also for discussing and concensus) - https://phabricator.wikimedia.org/T258317 (10RhinosF1) >>! In T258317#6316666, @Zblace wrote: >>>! In T258317#6316662, @RhinosF1 wr... [09:30:51] 10Operations, 10Cloud-VPS, 10Community-IdeaLab, 10Design Research tools and infrastructure, and 4 others: LOOMIO decision making software (also for discussing and concensus) - https://phabricator.wikimedia.org/T258317 (10Zblace) >>! In T258317#6316665, @RhinosF1 wrote: >> I do not see how this would be inf... [09:32:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:33:01] Retagged that task with what he actually wants [09:33:59] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:00:21] RhinosF1: thanks, appreciate it a lot :) [10:15:03] paravoid: I imagine it'll be like his first task that sits their waiting forever and a day because there's only him that cares [10:15:46] "it is an urgency for WMF to have something like this." Does not apply in anyone's wildest dreams. [10:41:03] (03CR) 10HitomiAkane: [C: 03+1] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613658 (https://phabricator.wikimedia.org/T258100) (owner: 10Tks4Fish) [12:11:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:14:59] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:12:55] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:14:49] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:57:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:59:43] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:04:57] Is anyone else getting slight slowness on enwp from eqiad? [14:05:10] 1859ms (mw1366.eqiad) on the Main Page [14:05:21] On https://en.wikipedia.org/wiki/Tukwila_International_Boulevard_station sorry [14:05:43] Seems to be happening on quite a lot of pages though that are long [14:06:20] 1030 via mw1333 on https://en.wikipedia.org/wiki/Antequerinae [14:07:45] I can't reploduced @ eqsin -> eqiad [14:08:31] what stats of X-CACHE? [14:17:46] rxy: I'm mobile so not sure. [14:17:55] ah... [14:18:03] Let me try icurl [14:20:18] I can't find the response time on that [14:22:01] https://usercontent.irccloud-cdn.com/file/scyXKybE/IMG_6062.PNG [14:22:13] rxy: If that helps ^ [14:22:57] hmmmm [14:23:12] uslfo? [14:24:08] rxy: good point, can someone see in access logs to make sure that app doesn't proxy and make sure it's actually ignoring eqiad? [14:24:26] I set a custom UA on that so I can give it in PM [14:24:39] 3xxx is ulsfo [14:24:45] Yeah I know [14:24:59] I see server: mw1407.eqiad.wmnet [14:25:09] is meant app server. [14:25:13] not a front [14:25:21] So I wonder why it's hitting ulsfo's cache [14:26:07] your location provides ulsfo 's IP address as a front end server [14:26:50] x-client-ip is me so the app isn't proxying it [14:26:51] If you're in the UK, why are you not hitting esams? [14:27:03] Reedy: I'm asking you [14:27:24] DNS req in your device -> DNS full resolver @ your ISP -> WMF DNS respound by geoIP [14:27:36] The geoip is listing you as the UK [14:27:41] hang on [14:27:43] no, you are [14:27:48] cp3050 and cp3060 [14:27:54] there's no app servers in esams [14:28:07] Reedy: I am in the UK. Would you like the full headers I'm seeing? [14:28:21] cpNNNN is varnish or ATS i guess [14:30:03] rxy: you're confused ;P [14:30:12] 4xxx is ulsfo, 3xxx is esams [14:30:16] wow [14:30:18] lol [14:30:35] scrap that idea then [14:30:47] Reedy: any idea on the response times? [14:31:28] 10Operations, 10Wikimedia-Mailing-lists: Reset admin password for WikimediaBR-l - https://phabricator.wikimedia.org/T258324 (10Chicocvenancio) [14:33:53] It doesn't feel slow to me [14:34:26] perhaps trouble with your wifi [14:35:23] https://tools.pingdom.com/#5cd896472ec00000 using UK tester. seems no problem [14:36:33] Reedy: it doesn't feel too bad now [14:36:44] * RhinosF1 blames it on a stupid fluke [14:37:35] there's nothing that obviously stands out on the grafana boards [14:39:36] Hmm [14:41:07] Reedy: 1089ms (mw1272.eqiad) on https://en.wikipedia.org/wiki/Centinela_Solar_Energy_Project [14:41:21] If I click random page until a long page comes I see it [14:41:32] It's normally about 200-300ms if that [14:48:33] 1089 is fine for me. [14:48:38] (ms) [14:50:33] 10Operations, 10Wikimedia-Mailing-lists: Reset admin password for WikimediaBR-l - https://phabricator.wikimedia.org/T258324 (10Chicocvenancio) 05Open→03Invalid Found password in old laptop backup. [15:25:17] RECOVERY - Check systemd state on testreduce1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:43:39] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:09:41] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:24:13] 10Operations, 10Dumps-Generation: Reboot snapshot hosts - https://phabricator.wikimedia.org/T255550 (10ArielGlenn) 05Open→03Resolved [20:27:17] PROBLEM - Host db1082 is DOWN: PING CRITICAL - Packet loss = 100% [20:35:53] PROBLEM - MariaDB Replica IO: s5 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1082.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1082.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:37:04] !log cdanis@cumin1001 dbctl commit (dc=all): 'depool db1082, it crashed', diff saved to https://phabricator.wikimedia.org/P11951 and previous config saved to /var/cache/conftool/dbconfig/20200718-203704-cdanis.json [20:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:23] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10CDanis) p:05Triage→03High [20:39:17] RECOVERY - Host db1082 is UP: PING WARNING - Packet loss = 33%, RTA = 0.22 ms [20:43:25] PROBLEM - MariaDB read only s5 on db1082 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [20:43:46] PROBLEM - MariaDB Replica SQL: s5 #page on db1082 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:44:16] PROBLEM - MariaDB Replica IO: s5 #page on db1082 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:44:46] <_joe_> uhm [20:44:48] PROBLEM - mysqld processes #page on db1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [20:44:51] here [20:44:53] <_joe_> just got the pages [20:44:53] rebooted itself? [20:45:02] <_joe_> I see it's already depooled [20:45:06] here [20:45:24] <_joe_> well done cdanis, but why didn't we paged earlier [20:45:27] PROBLEM - MariaDB Replica Lag: s5 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1138.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:45:52] root@db1082:~# uptime [20:45:53] 20:45:47 up 6 min, 1 user, load average: 0.00, 0.17, 0.13 [20:45:56] I am guessing BBU issues... [20:46:04] here [20:46:40] * volans|off here [20:46:48] here [20:46:50] anything for us to do immediately, given it's already depooled? [20:46:53] marostegui: how can I help you? [20:47:07] nah, let's disable notifications, create a task and we can get to it on monday [20:47:11] as cdanis already depooled it <3 [20:47:27] ack :) [20:47:44] can someone create the task? [20:47:49] I will downtime + disable notifications [20:48:03] should we ack the victorops alert [20:48:05] cdanis created T258336 already [20:48:06] T258336: db1082 crashed - https://phabricator.wikimedia.org/T258336 [20:48:07] does logstash work? [20:48:12] I see no logs there [20:48:39] <_joe_> no logs at all or no logs specifically? [20:48:50] nothing [20:49:07] oh thanks cdanis [20:49:09] <_joe_> yeah it's broken [20:49:10] since 2:30 [20:49:31] <_joe_> ok, that seems like a more serious problem than one database down [20:49:39] (03PS1) 10Marostegui: db1082: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/614292 (https://phabricator.wikimedia.org/T258336) [20:49:40] <_joe_> can someone call up people from observability? [20:50:11] 10Operations, 10DBA, 10Patch-For-Review: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Marostegui) Host rebooted itself: ` root@db1082:~# uptime 20:45:47 up 6 min, 1 user, load average: 0.00, 0.17, 0.13 ` Cause yet to be investigated with HW logs and so on [20:50:32] replied 73305 to victorops to ack it. in icinga the host is in downtime. [20:50:34] cdanis: thanks so much for depooling and creating the task! [20:50:40] mutante: yep, I downtimed it [20:50:50] I will ack db1124 replication too [20:50:52] thanks [20:51:08] this happened last time too [20:51:14] (03CR) 10Marostegui: [C: 03+2] db1082: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/614292 (https://phabricator.wikimedia.org/T258336) (owner: 10Marostegui) [20:51:17] where it didn't page until it came back up automatically [20:51:29] marostegui, jynus: from iLO logs I can see two entries I'll paste them in the task [20:51:32] sigh, should have thought to downtime the host [20:51:48] cdanis: yeah, cause we don't page on host down only on mysqld processes, and hence... [20:51:51] right [20:51:55] ACKNOWLEDGEMENT - MariaDB Replica IO: s5 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1082.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1082.eqiad.wmnet (110 Connection timed out) Jcrespo T258336 - The acknowledgement expires at: 2020-07-20 05:51:08. https://wikitech.wikimedia.org/wiki/MariaDB/tr [20:51:55] epooling_a_slave [20:51:55] ACKNOWLEDGEMENT - MariaDB Replica Lag: s5 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1470.15 seconds Jcrespo T258336 - The acknowledgement expires at: 2020-07-20 05:51:08. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:52:01] thanks volans [20:52:36] 10Operations, 10DBA, 10Patch-For-Review: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Volans) hpiLO related logs here below: ` hpiLO-> show /system1/log1/record17 status=0 status_tag=COMMAND COMPLETED Sat Jul 18 20:50:11 2020 /system1/log1/record17 Targets Properties number=1... [20:52:50] !log Due to db1082 crash there will be replication lag on s5 on labsdb hosts - T258336 [20:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:12] Ha, so it was BBU related.... [20:53:31] Not surprinsing...HP, old host to be refreshed on Q2....another day in the office [20:53:51] Anyways, to be followed up on Monday I think, I am going back to my sleep [20:53:59] Thanks everyone who responded so quickly, much appreciated <3 [20:55:00] anytime! :) [20:55:59] i sent an SMS to Keith about logstash [21:06:50] !log bounce logstash on logstash1009 [21:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:25] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.006271 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [21:10:59] !log bounce logstash on logstash1008 [21:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:09] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:14:26] !log bounce logstash on logstash1007 [21:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:06] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.002303 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [21:31:52] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [21:32:15] shdubsh: is what you are doing related with https://logstash.wikimedia.org/goto/474e3dcf162a253fa6a01ca0d45515e1 [21:32:46] we have stopped = [21:33:04] we have stopped receiving apache logs since 3 ap UTC [21:33:08] am`8 [21:33:10] grr [21:33:13] am* [21:33:29] effie: yep, that's the logging pipeline alright [21:34:13] thank you:) [21:37:02] shdubsh: so CPU high because garbage collection is running i guess. pid 14195 on logstash1019 is the one using it all and matches /var/log/logstash/logstash_jvm_gc.pid14195.log.0.current [21:39:23] mutante: heap graph on logstash100[789] definitely shows it pegged since 3am utc [21:40:56] heh, same happened in codfw near the same time. [21:41:16] !log restart logstash on logstash200[456] [21:41:19] exactly on the hour? [21:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:11] looks like 10-15mins before in codfw [21:44:02] correction, looks like both datacenters stopped ingesting logs at ~2:44 utc [21:44:41] searched Gerrit for any changes with logstash in the message. the last one was this change in mw core on Thursday but that is not deployed on mwdebug1001 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/599148 [21:47:31] shdubsh: at least 1009 is almost as high again as before now https://grafana.wikimedia.org/d/000000561/logstash?panelId=3&fullscreen&orgId=1&from=now-24h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-input=gelf%2F12201 [21:49:15] yeah, throughput is looking good. I'm a bit concerned that kafka consumer lag doesn't appear to be moving, though. [21:57:40] i see a lot of "event loop shut down". when i grep for that through all the logs it happened on 3 days: [21:57:54] 2020-07-01, 2020-07-08 and today [21:59:17] (ok, that was July-only, it happens roughly once a week) [21:59:22] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:08:43] when i keep reloading the mw apache2 log dashboard linked above by Effie i see the logs are slowly backfilling. for example the last one was at 2020-07-18T08:49:47 and next reload at 2020-07-18T09:17:23 now [22:12:11] kafka looks like its draining now [22:25:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:27:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:38:01] ingest looks pretty stable. I'll come back around and check on it later. [22:39:57] ack, cool. signing off [23:26:36] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 258.4 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen