[01:00:20] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:12] RECOVERY - Check the last execution of mediawiki_job_parser_cache_purging on mwmaint1002 is OK: OK: Status of the systemd unit mediawiki_job_parser_cache_purging https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:52:26] (03CR) 10Andrew Bogott: [C: 03+2] glance image_sync: use primary_glance_image_store to choose the image store [puppet] - 10https://gerrit.wikimedia.org/r/589096 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [02:59:00] (03PS8) 10Andrew Bogott: glance profiles: remove use of $nova_controller param [puppet] - 10https://gerrit.wikimedia.org/r/589112 (https://phabricator.wikimedia.org/T249941) [02:59:02] (03PS4) 10Andrew Bogott: Glance profiles: remove firewall rule for labs_hosts_range [puppet] - 10https://gerrit.wikimedia.org/r/589139 [02:59:04] (03PS5) 10Andrew Bogott: Glance: use keystone_api_fqdn for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/589143 (https://phabricator.wikimedia.org/T249941) [02:59:06] (03PS6) 10Andrew Bogott: Glance profiles: add param types and lookup() calls [puppet] - 10https://gerrit.wikimedia.org/r/589144 [02:59:08] (03PS1) 10Andrew Bogott: glance image_sync: don't make a file depend on itself! [puppet] - 10https://gerrit.wikimedia.org/r/589797 (https://phabricator.wikimedia.org/T249941) [03:03:23] (03CR) 10Andrew Bogott: [C: 03+2] glance image_sync: don't make a file depend on itself! [puppet] - 10https://gerrit.wikimedia.org/r/589797 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [03:42:55] (03PS9) 10Andrew Bogott: glance profiles: remove use of $nova_controller param [puppet] - 10https://gerrit.wikimedia.org/r/589112 (https://phabricator.wikimedia.org/T249941) [03:42:57] (03PS5) 10Andrew Bogott: Glance profiles: remove firewall rule for labs_hosts_range [puppet] - 10https://gerrit.wikimedia.org/r/589139 [03:42:59] (03PS6) 10Andrew Bogott: Glance: use keystone_api_fqdn for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/589143 (https://phabricator.wikimedia.org/T249941) [03:43:01] (03PS7) 10Andrew Bogott: Glance profiles: add param types and lookup() calls [puppet] - 10https://gerrit.wikimedia.org/r/589144 [03:43:03] (03PS1) 10Andrew Bogott: glance image sync: simplify secondary image handling [puppet] - 10https://gerrit.wikimedia.org/r/589814 (https://phabricator.wikimedia.org/T249941) [03:46:18] (03CR) 10jerkins-bot: [V: 04-1] glance image sync: simplify secondary image handling [puppet] - 10https://gerrit.wikimedia.org/r/589814 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [03:48:25] (03PS2) 10Andrew Bogott: glance image sync: simplify secondary image handling [puppet] - 10https://gerrit.wikimedia.org/r/589814 (https://phabricator.wikimedia.org/T249941) [03:48:26] (03PS10) 10Andrew Bogott: glance profiles: remove use of $nova_controller param [puppet] - 10https://gerrit.wikimedia.org/r/589112 (https://phabricator.wikimedia.org/T249941) [03:48:28] (03PS6) 10Andrew Bogott: Glance profiles: remove firewall rule for labs_hosts_range [puppet] - 10https://gerrit.wikimedia.org/r/589139 [03:48:31] (03PS7) 10Andrew Bogott: Glance: use keystone_api_fqdn for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/589143 (https://phabricator.wikimedia.org/T249941) [03:48:33] (03PS8) 10Andrew Bogott: Glance profiles: add param types and lookup() calls [puppet] - 10https://gerrit.wikimedia.org/r/589144 [03:49:25] (03CR) 10Andrew Bogott: [C: 03+2] glance image sync: simplify secondary image handling [puppet] - 10https://gerrit.wikimedia.org/r/589814 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [03:52:44] RECOVERY - PHP opcache health on mw1407 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:12:15] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for awight - https://phabricator.wikimedia.org/T250364 (10Nuria) Approved @awight already has permits on cluster to see private data, this is just an additional permit to be able to sudo as a particular user [04:14:04] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Christoph Jauera - https://phabricator.wikimedia.org/T250362 (10Nuria) Approved on my end as well, [04:37:42] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [04:42:30] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.129 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:44:18] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 85, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:47:06] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [04:52:56] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [04:54:40] PROBLEM - BFD status on cr1-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:56:34] RECOVERY - BFD status on cr1-eqsin is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:56:38] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [05:26:44] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [05:32:18] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [05:44:00] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqsin.wikimedia.org, port=443): Read timed out. (read timeout=15),): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [05:45:58] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10JJMC89) [05:47:14] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [05:51:38] PROBLEM - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:53:02] RECOVERY - OSPF status on cr1-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:04:12] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before [06:04:12] ceived: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a res [06:04:12] d: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a response was received: /api/rest_v1 [06:04:12] /{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [06:06:38] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:07:27] !log change OSPF metrics to prefer ulsfo tunnel transport [06:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:56] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [06:08:28] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 86, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:10:02] (03PS1) 10Ayounsi: eqsin: temporarily prefer tunnel transport [homer/public] - 10https://gerrit.wikimedia.org/r/589830 [06:14:09] (03CR) 10Ayounsi: [C: 03+2] eqsin: temporarily prefer tunnel transport [homer/public] - 10https://gerrit.wikimedia.org/r/589830 (owner: 10Ayounsi) [06:14:50] (03PS1) 10Ayounsi: Revert "eqsin: temporarily prefer tunnel transport" [homer/public] - 10https://gerrit.wikimedia.org/r/589832 [08:50:37] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) [09:16:33] Anyone able to get the info from logstash for https://phabricator.wikimedia.org/T250551 [09:20:48] Done by andre ^ [09:54:56] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:00:46] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 40 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:26:52] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:32:14] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22733 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:42:52] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:48:16] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is OK: HTTP OK: HTTP/1.0 200 OK - 22707 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:39:54] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:50:52] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22725 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:15:23] !log depool wdqs1006 blazegraph stuck [12:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:48] !log depool wdqs1006 blazegraph stuck T242453 [12:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:55] T242453: Deadlock in blazegraph blocking all queries and updates - https://phabricator.wikimedia.org/T242453 [12:19:06] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [12:19:13] thats me restarting it [12:19:30] !log restarting blazegraph on wdqs1006 blazegraph stuck T242453 [12:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:14] 8.2 hours lagged >.> ooof [12:20:58] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [12:30:38] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 108 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:35:06] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:42:20] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 43 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:43:08] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 76.67 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [12:46:06] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 22720 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:48:15] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) Is this some random spike? :D {F31763816} [12:59:48] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10matej_suchanek) T242453 is related. [13:08:21] thanks a lot addshore ! [13:08:25] :) [13:08:43] so it was blazegraph stuck [13:09:15] yup, in a deadlock again it seems, restarted it then it started reporting 8 hours lag, its depooled and now needs to catch up [13:09:36] unfortunatly the lag on the other eqiad servers is also like 2-3 hours and increasing, ad wdqs1006 wasn't taking its share of the load [13:10:19] and edit rate on wikidata stayed high, as currently its the median value for lagged servers that is used, and when eqiad has 3 (cause 1 is deadlocked) and codfw has 4, then thats always a codfw one [13:10:29] otherwise maxlag would have kicked in and kept it down [13:11:00] https://usercontent.irccloud-cdn.com/file/YDYxfDXR/image.png [13:11:55] i know nothing about anything, but if the weights are "10" for codfw and "10" for eqiad, why does codfw get so much less traffic? is there another level of weighting between the datacenters abov the per instance level? [13:25:09] addshore: the edge in codfw gets a lot less traffic overall, because of how we have geodns set up [13:25:44] aaah gotcha so it has to hit the codfw edge to get to the codfw query services :) [13:26:20] and I imagine that's a hard thing to change at a per service level [13:28:27] geodns --> edge DC --> edge DC does DNS Discovery (which doesn't have weights) to find the closest datacenter with the service available --> LVS+Pybal in that DC looks at per-backend weights [13:28:58] we've been meaning to make less use of eqiad and more use of codfw for NA traffic but no one's sunk too much time into it yet [13:29:18] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [13:34:05] ooof, its going to take them so long to recover... in the last 1 hour the lag went from 8.24 hours to 8.2 hours D: [13:34:38] and codfw is sitting there with ~30 seconds lag :D [13:38:26] you could disable eqiad in dns discovery temporarily -- codfw had better be able to handle 100% of the load, though [13:42:00] both clusters should be basically the same https://wikitech.wikimedia.org/wiki/Wikidata_query_service#Hardware [13:42:12] is that not codfw's purpose in life? [13:43:06] It definitely shouldn't make the situation worse, it might cause the codfw servers to start lagging a little, but during that time eqiad would be bouncing back [13:43:14] it is, but that doesn't mean it's remained true for all services, Krenair [13:43:26] :/ [13:43:34] (Maps is one example that comes to mind) [13:43:42] let me ssh into the machines and check [13:47:20] heh that page is out of date so i checked th wrong hosts :P [13:48:24] cdanis: it should be able to handle it, especially as eqiad will be on 3 machines for the next 8 hours or so [13:48:48] PROBLEM - MariaDB Slave Lag: pc2 on pc2008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:50:49] eqiad is currently running on 3 machines with 32 cores and 126gb ram, codfw currently has 4 machines, 2 with the same as eqiad and 2 with 24 cores an 94gb ram [13:55:25] I guess that would be "query 1D IN CNAME dyna.wikimedia.org." to something like "query 1H IN CNAME text-lb.codfw.wikimedia.org." [13:55:39] ah, no, don't change the geodns data [13:55:49] https://wikitech.wikimedia.org/wiki/DNS/Discovery [13:56:16] on one of the cumin hosts: confctl --object-type discovery select 'dnsdisc=wdqs,name=eqiad' set/pooled=false [13:56:37] oooh *reads the docs* [13:57:11] cdanis: I doubt I have the technical ability to run any of those commands [13:57:14] ah [13:58:03] looking at https://config-master.wikimedia.org/discovery/services.yaml i'm not sure i understand the differences between standard wdqs, wdqs-heavy-queries, and wdqs-internal [13:59:23] wdqs is what is exposed at query.wikidata.org, the other 2 are both internal only, the heavy one having a longer timeout setup in nginx to pass through the blazegraph [13:59:43] wdqs-internal will point to the internal only servers I gather [14:00:18] for eqiad wdqs1004, wdqs1005, wdqs1006, wdqs1007 being the public ones and 1003 and 1008 being the internal ones [14:00:31] ah [14:01:15] so, it should only be "wdqs" that would need to be touched [14:03:19] !log cdanis@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=eqiad [14:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:39] addshore: 5 minute ttl, so you should start seeing traffic shift over that interval [14:03:44] ack! [14:03:49] Keeping my eyes peeled [14:04:34] i'll be near my computer and pingable for another 8h, don't want to leave it depooled in eqiad for longer than that [14:04:45] yup, that sounds fair to me! [14:07:58] seeing load switch over a bit, lovely! [14:12:29] looks like traffic has totally switched over [14:13:34] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 56 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [14:13:55] (03CR) 10Andrew Bogott: [C: 03+2] glance profiles: remove use of $nova_controller param [puppet] - 10https://gerrit.wikimedia.org/r/589112 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [14:14:14] addshore: and slope has changed on the lag plots for the eqiad wdqsen [14:14:20] yup :) [14:14:33] and so far, no increased lag in codfw [14:15:56] how easy is it to get the ability to switch of one dc for the service added to https://github.com/wikimedia/puppet/blob/ed9bb7be2fed0e4dfa549a82e371649cc5c05205/modules/admin/data/data.yaml#L395-L410 ? :P [14:16:55] (03CR) 10Andrew Bogott: [C: 03+2] Glance profiles: remove firewall rule for labs_hosts_range [puppet] - 10https://gerrit.wikimedia.org/r/589139 (owner: 10Andrew Bogott) [14:17:13] addshore: I'll file a task [14:17:54] =] [14:18:54] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:19:16] (03CR) 10Andrew Bogott: [C: 03+2] Glance: use keystone_api_fqdn for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/589143 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [14:22:29] (03CR) 10Andrew Bogott: [C: 03+2] Glance profiles: add param types and lookup() calls [puppet] - 10https://gerrit.wikimedia.org/r/589144 (owner: 10Andrew Bogott) [14:22:31] 10Operations: allow non-roots to pool/depool certain DNS Discovery services - https://phabricator.wikimedia.org/T250557 (10CDanis) [14:22:34] ok, I'm off for a bit [14:22:38] o/ [14:22:46] 10Operations: allow non-roots to pool/depool certain DNS Discovery services - https://phabricator.wikimedia.org/T250557 (10CDanis) p:05Triage→03Medium [14:29:54] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22717 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:51] im impresed with how well codfw is handeling the laod actually [14:48:54] (03PS1) 10Andrew Bogott: Openstack nova: remove spiceproxy code and config [puppet] - 10https://gerrit.wikimedia.org/r/589856 [14:48:56] (03PS1) 10Andrew Bogott: nova.conf: remove cc_host config [puppet] - 10https://gerrit.wikimedia.org/r/589857 [14:48:58] (03PS1) 10Andrew Bogott: nova common: replace nova_controller and nova_controller_standby with openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/589858 (https://phabricator.wikimedia.org/T249941) [14:49:01] (03CR) 10Jhedden: [C: 03+2] cloudvps: add prometheus alert rules for project instances [puppet] - 10https://gerrit.wikimedia.org/r/589716 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [14:51:22] (03PS2) 10Andrew Bogott: nova common: replace nova_controller and nova_controller_standby with openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/589858 (https://phabricator.wikimedia.org/T249941) [14:52:33] (03CR) 10jerkins-bot: [V: 04-1] nova common: replace nova_controller and nova_controller_standby with openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/589858 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [14:53:40] (03PS3) 10Andrew Bogott: nova common: replace nova_controller and nova_controller_standby with openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/589858 (https://phabricator.wikimedia.org/T249941) [14:56:18] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:57:00] (03CR) 10jerkins-bot: [V: 04-1] nova common: replace nova_controller and nova_controller_standby with openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/589858 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [14:58:03] (03PS4) 10Andrew Bogott: nova common: replace nova_controller and nova_controller_standby [puppet] - 10https://gerrit.wikimedia.org/r/589858 (https://phabricator.wikimedia.org/T249941) [15:01:05] (03CR) 10jerkins-bot: [V: 04-1] nova common: replace nova_controller and nova_controller_standby [puppet] - 10https://gerrit.wikimedia.org/r/589858 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [15:01:44] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is OK: HTTP OK: HTTP/1.0 200 OK - 22731 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:05:10] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 228 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [15:10:12] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:13:48] !log applying schema change of T139090 on labswiki (wikitech) [15:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:58] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [15:19:14] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 22721 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:44:05] (03PS1) 10Jhedden: cloudvps: update prometheus rule annotations [puppet] - 10https://gerrit.wikimedia.org/r/589864 (https://phabricator.wikimedia.org/T250206) [15:45:26] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7fe4657c8400: Failed to establish a new connection: [Errno 111] Connec [15:45:26] ttps://wikitech.wikimedia.org/wiki/Search%23Administration [15:46:10] PROBLEM - Check systemd state on cloudelastic1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:16] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 390 threshold =0.15 breach: status: yellow, number_of_nodes: 3, delayed_unassigned_shards: 0, active_primary_shards: 797, cluster_name: cloudelastic-chi-eqiad, unassigned_shards: 378, number_of_data_nodes: 3, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, timed_out: False, number_of_in_flig [15:46:16] ve_shards: 1219, initializing_shards: 12, active_shards_percent_as_number: 75.76134244872593, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:46:22] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 390 threshold =0.15 breach: status: yellow, timed_out: False, active_shards_percent_as_number: 75.76134244872593, initializing_shards: 12, delayed_unassigned_shards: 0, active_primary_shards: 797, number_of_nodes: 3, task_max_waiting_in_queue_millis: 0, number_of_in_flight_fetch: 0, unassigned_shards: [15:46:22] s: 1219, number_of_data_nodes: 3, relocating_shards: 0, number_of_pending_tasks: 0, cluster_name: cloudelastic-chi-eqiad https://wikitech.wikimedia.org/wiki/Search%23Administration [15:47:38] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 388 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, timed_out: False, active_primary_shards: 797, initializing_shards: 12, relocating_shards: 0, number_of_data_nodes: 3, status: yellow, active_shards_percent_as_number: 75.88564325668116, unassigned_shards: 376, number_of_pending_tasks: 0, [15:47:38] ght_fetch: 0, delayed_unassigned_shards: 0, active_shards: 1221, task_max_waiting_in_queue_millis: 0, number_of_nodes: 3 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:47:58] (03CR) 10Jhedden: [C: 03+2] cloudvps: update prometheus rule annotations [puppet] - 10https://gerrit.wikimedia.org/r/589864 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [15:53:01] checking cloudelastic [15:54:19] there might be some new shards not being initialized, some reindexes were/are running [15:54:33] dcausse or gehel - around by any chance? [15:54:40] elukey: yes [15:54:56] elukey: looking [15:55:40] thanksss [15:57:14] omg 340 unssagined shards :/ [15:59:04] RECOVERY - Check systemd state on cloudelastic1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:06] dcausse: IIRC I saw some discussions between Erik and Maryum about some duplication in last reindex, could it be just that we have duplicates to remove? [15:59:11] elukey: elastic on cloudelastic1003 chi was down [15:59:18] sigh [16:00:08] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: number_of_pending_tasks: 0, number_of_nodes: 4, cluster_name: cloudelastic-chi-eqiad, number_of_in_flight_fetch: 0, active_shards_percent_as_number: 89.40234134319162, unassigned_shards: 166, number_of_data_nodes: 4, relocating_shards: 0, active_primary_shards: 804, initializing_shards: 6, delayed_ [16:00:09] : 0, status: yellow, task_max_waiting_in_queue_millis: 0, active_shards: 1451, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [16:00:36] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: timed_out: False, number_of_nodes: 4, delayed_unassigned_shards: 0, active_primary_shards: 804, unassigned_shards: 96, initializing_shards: 5, relocating_shards: 0, number_of_data_nodes: 4, number_of_pending_tasks: 4, active_shards_percent_as_number: 93.7769562538509, cluster_name: cloudelastic-chi [16:00:36] _in_flight_fetch: 0, active_shards: 1522, task_max_waiting_in_queue_millis: 617, status: yellow https://wikitech.wikimedia.org/wiki/Search%23Administration [16:00:53] elukey, dcausse I'm around if you need me [16:00:53] dcausse: now I see number_of_nodes: 3 in the alarm, ufff [16:01:02] gehel: David already fixed! [16:01:04] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_primary_shards: 804, relocating_shards: 0, cluster_name: cloudelastic-chi-eqiad, number_of_in_flight_fetch: 0, unassigned_shards: 54, number_of_pending_tasks: 4, active_shards: 1564, number_of_data_nodes: 4, task_max_waiting_in_queue_millis: 4312, delayed_unassigned_shards: 0, number_of_node [16:01:04] ds_percent_as_number: 96.36475662353666, status: yellow, timed_out: False, initializing_shards: 5 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:01:09] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1004 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: initializing_shards: 5, number_of_data_nodes: 4, relocating_shards: 0, timed_out: False, active_shards_percent_as_number: 96.36475662353666, number_of_pending_tasks: 4, active_primary_shards: 804, active_shards: 1564, number_of_nodes: 4, unassigned_shards: 54, number_of_in_flight_fetch: 0, task_max [16:01:09] _millis: 10195, cluster_name: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, status: yellow https://wikitech.wikimedia.org/wiki/Search%23Administration [16:01:47] * gehel is too slow :( [16:02:01] well systemd restarted it automatically [16:02:08] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 1.404 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [16:02:10] not sure why it went down [16:02:18] nothing in the logs :/ [16:02:40] Apr 18 15:42:20 cloudelastic1003 systemd[1]: elasticsearch_6@cloudelastic-chi-eqiad.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED [16:03:30] and then [16:03:31] Apr 18 15:58:32 cloudelastic1003 puppet-agent[6507]: (/Stage[main]/Elasticsearch/Elasticsearch::Instance[cloudelastic-chi-eqiad]/Service[elasticsearch_6@cloudelastic-chi-eqiad]/ensure) ensure changed 'stopped' to 'running' [16:03:35] dcausse: --^ [16:03:40] what does this means? it exited cleanly? [16:04:12] I think it failed with return code 3, and systemd returned that code since it doesn't know how to respond [16:04:16] then puppet restarted [16:04:23] maybe it went OOM or similar [16:05:30] hm oom it has usually the time to log something [16:06:09] yep yep [16:10:36] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10jmads) >>! In T249873#6058417, @fgiunchedi wrote: > @jmads shell and basic authentication (nda ldap group) access should work now, please confirm! confirmed! th... [16:15:00] dcausse: can't really find what happened :( [16:15:32] elukey: looking at grafana I'm going to assume that it died of an OOM (300+ old gc/hour) [16:16:03] if there's a way to access the systemd PrivateTmp of a dead process then perhaps the jvm err file is still there [16:17:41] dcausse: there is, let's go to #discovery? [16:17:49] sure [16:25:34] cdanis: In ~30 mins I would guess 3 of the 4 eqiad wdqs servers will have caught up, we can probably think about turning eqiad back on then (ish) [16:26:08] also the wikidata edit rate just jumped up as the maxlag has decreased (because less lagged query services) and thus the codfw query service lag will start going up (and it has) [16:31:00] addshore: o/, I'd wait to have the 4 servers pooled, looks like only 3 cannot keep-up the load the edit rate [16:31:20] Well, it looks like codfw can't keep up with the edit rate either :/ [16:31:44] hm.. and the max lag does not propagate from codfw? [16:31:55] not by itself anyway, not sure how long it will take for wdqs1006 to finish catching up, it was 8 hours lagged 4 hours ago [16:32:08] dcausse: yup maxlag will start slowing edits once codfw gets lagged too [16:32:33] (03PS5) 10Andrew Bogott: nova common: replace nova_controller and nova_controller_standby [puppet] - 10https://gerrit.wikimedia.org/r/589858 (https://phabricator.wikimedia.org/T249941) [16:32:35] but it is the median of all wdqs servers that is used for maxlag, so if 3 out of eqiad have 0 lag, I expect maxlag will be low [16:33:19] * addshore looks at the code that does the maxlag magic [16:33:40] addshore: I see that codfw is still fine [16:34:31] its okay, but the lag on 2001 2002 and 2003 isn't going to stop going up now, its at ~10 mins already and set for a straight line increase until something else takes the load [16:35:18] (03PS6) 10Andrew Bogott: nova common: replace nova_controller and nova_controller_standby [puppet] - 10https://gerrit.wikimedia.org/r/589858 (https://phabricator.wikimedia.org/T249941) [16:36:11] addshore: indeed I had missed that [16:36:49] If I could change the wdqs -> maxlag code to use median +1 :P this might all be a little nicer [16:38:18] if 3 of 4 servers pooled can't keep up, then I hope that's something that's been noted for capex next year :) [16:38:24] (03PS7) 10Andrew Bogott: nova common: replace nova_controller and nova_controller_standby [puppet] - 10https://gerrit.wikimedia.org/r/589858 (https://phabricator.wikimedia.org/T249941) [16:39:40] it'd be great to have a way to spread evenly between codfw & eqiad [16:39:45] yes [16:40:38] for sure, and it's something traffic has talked about -- but that helps the median case, not the worst case [16:40:46] (03PS8) 10Andrew Bogott: nova common: replace nova_controller and nova_controller_standby [puppet] - 10https://gerrit.wikimedia.org/r/589858 (https://phabricator.wikimedia.org/T249941) [16:40:57] !log forcing replica count to 1 on some cloudelastic@chi indices [16:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:44] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5026 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:43:34] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 194 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:45:14] elukey: pretty clear case of memcache NIC tx saturation resulting in an appserver tail latency spike right now :) [16:45:25] (and also 75 minutes ago) [16:46:42] dcausse: cdanis I might backport https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikidata.org/+/589874/ to make this median issue a little nicer for the weekend.... [16:47:23] addshore: I'm not competent to review the code, but I think the idea is sensible enough [16:47:23] It's only run by a maint script on mwmaint so a low risk deploy [16:47:55] cdanis: yeah mcrouter metrics show also tkos :( [16:49:27] cdanis: wow did you see https://grafana.wikimedia.org/d/000000316/memcache?panelId=58&fullscreen&orgId=1 ? [16:49:46] https://grafana.wikimedia.org/d/000000316/memcache?panelId=58&fullscreen&orgId=1&from=now-24h&to=now looks horrible [16:49:48] wow yeah... that's not just saturating second-to-second [16:49:52] that's just saturated, steady state [16:50:27] looking at codfw i see massive drops? https://grafana.wikimedia.org/d/000000316/memcache?panelId=58&fullscreen&orgId=1&from=now-24h&to=now&var-datasource=codfw%20prometheus%2Fops&var-cluster=memcached&var-instance=All [16:50:56] i don't know what uses codfw memcache [16:51:04] in theory in codfw we should see only traffic replicated [16:51:11] mw uses a special mcrouter prefix to broadcast [16:51:19] admitly thats from 11Mbps down.... but the time lines up [16:51:20] but currently not really used or important IIUC [16:51:52] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 109.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [16:53:05] cdanis: do you have a min to chat in #sre? [16:53:08] yes [17:08:05] (03PS1) 10Andrew Bogott: Keystone: remove openstack::keystone::cleanup [puppet] - 10https://gerrit.wikimedia.org/r/589876 (https://phabricator.wikimedia.org/T243418) [17:12:20] (03PS2) 10Andrew Bogott: Keystone: remove openstack::keystone::cleanup [puppet] - 10https://gerrit.wikimedia.org/r/589876 (https://phabricator.wikimedia.org/T243418) [17:12:22] (03PS1) 10Andrew Bogott: openstack::keystone::cleanup: remove all timers [puppet] - 10https://gerrit.wikimedia.org/r/589877 (https://phabricator.wikimedia.org/T243418) [17:13:05] (03PS5) 10ArielGlenn: for 7z production in batches, skip files that exist at beginning of each batch [dumps] - 10https://gerrit.wikimedia.org/r/565301 (https://phabricator.wikimedia.org/T250260) [17:14:28] (03PS3) 10ArielGlenn: add convenience bash script that runs all unit tests [dumps] - 10https://gerrit.wikimedia.org/r/589008 [17:15:47] (03PS3) 10ArielGlenn: fix an annoying typo in the test modules docs [dumps] - 10https://gerrit.wikimedia.org/r/589010 [17:23:20] (03PS1) 10Reedy: Update EventBus classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589878 [17:23:37] (03PS2) 10Reedy: Update EventBus classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589878 [17:23:39] (03CR) 10jerkins-bot: [V: 04-1] Update EventBus classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589878 (owner: 10Reedy) [17:24:14] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:24:46] (03CR) 10jerkins-bot: [V: 04-1] Update EventBus classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589878 (owner: 10Reedy) [17:25:56] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22722 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:40:21] (03PS2) 10ArielGlenn: check bz2 page content files for existence before running command batch [dumps] - 10https://gerrit.wikimedia.org/r/589032 (https://phabricator.wikimedia.org/T250260) [17:42:01] (03PS3) 10Reedy: Update EventBus classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589878 [17:45:14] (03PS1) 10ArielGlenn: Re-enable the second dumps run of the month. [puppet] - 10https://gerrit.wikimedia.org/r/589883 [17:45:43] (03CR) 10Reedy: Update EventBus classes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589878 (owner: 10Reedy) [17:46:06] (03PS4) 10Reedy: Update EventBus classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589878 [17:46:08] (03CR) 10ArielGlenn: [C: 03+2] Re-enable the second dumps run of the month. [puppet] - 10https://gerrit.wikimedia.org/r/589883 (owner: 10ArielGlenn) [17:49:18] (03CR) 10Reedy: [C: 04-2] "Not till later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589878 (owner: 10Reedy) [18:26:44] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:37:48] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:43:36] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:50:58] PROBLEM - MariaDB Slave Lag: pc3 on pc2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:52:38] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:52:44] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22723 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:11:52] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:24:32] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 22719 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:00:57] (03PS1) 10Alex Monk: labs: Move RB traffic to new stretch host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589912 (https://phabricator.wikimedia.org/T250574) [20:11:21] (03PS1) 10Alex Monk: deployment-prep: Move RB traffic to new stretch host [puppet] - 10https://gerrit.wikimedia.org/r/589915 (https://phabricator.wikimedia.org/T250574) [20:11:29] (03CR) 10jerkins-bot: [V: 04-1] deployment-prep: Move RB traffic to new stretch host [puppet] - 10https://gerrit.wikimedia.org/r/589915 (https://phabricator.wikimedia.org/T250574) (owner: 10Alex Monk) [20:11:35] (03PS2) 10Alex Monk: deployment-prep: Move RB traffic to new stretch host [puppet] - 10https://gerrit.wikimedia.org/r/589915 (https://phabricator.wikimedia.org/T250574) [20:17:46] PROBLEM - Disk space on Hadoop worker on an-worker1082 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:26:24] PROBLEM - Disk space on Hadoop worker on an-worker1090 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 14 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:27:03] !log restart gerrit-replica [20:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:36] !log cdanis@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=eqiad [20:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:08] PROBLEM - Disk space on Hadoop worker on an-worker1088 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:54:44] RECOVERY - Disk space on Hadoop worker on an-worker1088 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:01:12] PROBLEM - Disk space on Hadoop worker on an-worker1095 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:02:52] PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 15 GB (0% inode=99%): /var/lib/hadoop/data/d 31 GB (0% inode=99%): /var/lib/hadoop/data/l 24 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:10:16] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:17:40] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22716 bytes in 7.453 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:21:22] RECOVERY - Disk space on Hadoop worker on an-worker1094 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:22:28] RECOVERY - Disk space on Hadoop worker on an-worker1082 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:25:32] RECOVERY - Disk space on Hadoop worker on an-worker1090 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:28:52] RECOVERY - Disk space on Hadoop worker on an-worker1095 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:29:35] this was me, freed some space, sadly it was only a band-aid :( [21:29:40] will try to work on it tomorrow [21:31:40] Urbanecm: can you give https://phabricator.wikimedia.org/T250583 a look? A quick google translation says phab might not be the right place for that question but it’s probably you’re area to reply? [21:32:10] RhinosF1: let me see! [21:32:23] Urbanecm: thx! [21:33:07] closed and explained in Czec [21:33:07] h [21:33:27] Amazing! [21:40:07] Urbanecm: see PM [21:42:54] RECOVERY - MariaDB Slave Lag: pc3 on pc2009 is OK: OK slave_sql_lag Replication lag: 57.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:50:04] !log pool wdqs1006 blazegraph caught up T242453 [22:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:12] T242453: Deadlock in blazegraph blocking all queries and updates - https://phabricator.wikimedia.org/T242453 [23:21:52] 10Operations, 10serviceops: Research dataset about in-memory caching - https://phabricator.wikimedia.org/T240503 (10Krinkle) [23:23:55] 10Operations, 10serviceops: Publish dataset about Memcached traffic for caching research - https://phabricator.wikimedia.org/T240503 (10Krinkle) [23:26:19] 10Operations, 10serviceops: Publish dataset about Memcached traffic for caching research - https://phabricator.wikimedia.org/T240503 (10Krinkle) Narrowing scope to be about Memcached, which is presumably more widely applicable to the industry. Our Memcached cluster is horizontally scaled (sharded) and accessed... [23:29:37] 10Operations, 10serviceops: Publish dataset about Memcached traffic for caching research - https://phabricator.wikimedia.org/T240503 (101a1a11a) Hi Krinkle, this is what we are looking for, it would be great if we can have such dataset, even if it is sampled. Thank you! Meanwhile, may I ask what kind of data i... [23:30:42] 10Operations, 10serviceops: Publish dataset about Memcached traffic for caching research - https://phabricator.wikimedia.org/T240503 (101a1a11a) BTW, we can help contribute some log collection scripts if needed. [23:32:44] RECOVERY - MariaDB Slave Lag: pc2 on pc2008 is OK: OK slave_sql_lag Replication lag: 14.95 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [23:33:33] 10Operations, 10serviceops: Publish dataset about Memcached traffic for caching research - https://phabricator.wikimedia.org/T240503 (10Krinkle) >>! In T240503#6068813, @1a1a11a wrote: > […] what kind of data is stored in memcached cluster? Pretty much anything and everything you can imagine relating to the...