[00:25:40] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [00:31:46] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is OK: HTTP OK: HTTP/1.0 200 OK - 22331 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [00:45:10] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:12] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:46:26] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [00:47:16] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:32] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [00:57:16] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:59:17] (03CR) 10DannyS712: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582531 (https://phabricator.wikimedia.org/T241114) (owner: 10JJMC89) [01:00:36] (03PS2) 10VolkerE: Remove unnecessary, overqualified element parts of id selectors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581879 (https://phabricator.wikimedia.org/T248137) [01:20:54] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [01:21:06] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [01:37:50] (03CR) 10Krinkle: [C: 03+1] Remove unnecessary, overqualified element parts of id selectors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581879 (https://phabricator.wikimedia.org/T248137) (owner: 10VolkerE) [01:38:12] (03CR) 10Krinkle: [C: 03+1] "Confirmed by copying the RHS output into DevTools while viewing https://noc.wikimedia.org/." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581879 (https://phabricator.wikimedia.org/T248137) (owner: 10VolkerE) [01:38:37] (03PS3) 10Krinkle: Remove unnecessary, overqualified element parts of id selectors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581879 (https://phabricator.wikimedia.org/T248137) (owner: 10VolkerE) [04:37:37] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [04:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:58] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [06:23:36] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [06:33:52] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is OK: HTTP OK: HTTP/1.0 200 OK - 22312 bytes in 3.053 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [06:42:16] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [06:42:24] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [07:05:56] PROBLEM - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-04-21 07:03:51 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [08:12:21] (03PS1) 10KartikMistry: apertium-hbs-mkd: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-hbs-mkd] - 10https://gerrit.wikimedia.org/r/582556 (https://phabricator.wikimedia.org/T247585) [08:50:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:52:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:06:40] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:19:02] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is OK: HTTP OK: HTTP/1.0 200 OK - 22320 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:33:05] (03PS1) 10Elukey: systemd::timer::job: add a parameter to change the systemd slice [puppet] - 10https://gerrit.wikimedia.org/r/582558 [09:37:01] (03PS1) 10Elukey: statistics::discovery: run the timer under the user.slice [puppet] - 10https://gerrit.wikimedia.org/r/582559 [09:41:12] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/21524/" [puppet] - 10https://gerrit.wikimedia.org/r/582559 (owner: 10Elukey) [12:47:06] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [12:55:26] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [13:00:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:02:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:45:18] RECOVERY - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [15:34:37] (03PS1) 10KartikMistry: apertium-ind-zlm: Fix FTBFS with apertium 3.6 + 0.1.2 release [debs/contenttranslation/apertium-id-ms] - 10https://gerrit.wikimedia.org/r/582581 (https://phabricator.wikimedia.org/T247585) [15:34:46] (03CR) 10jerkins-bot: [V: 04-1] apertium-ind-zlm: Fix FTBFS with apertium 3.6 + 0.1.2 release [debs/contenttranslation/apertium-id-ms] - 10https://gerrit.wikimedia.org/r/582581 (https://phabricator.wikimedia.org/T247585) (owner: 10KartikMistry) [15:37:42] (03PS2) 10KartikMistry: apertium-ind-zlm: Fix FTBFS with apertium 3.6 + 0.1.2 release [debs/contenttranslation/apertium-id-ms] - 10https://gerrit.wikimedia.org/r/582581 (https://phabricator.wikimedia.org/T247585) [16:47:23] (03PS1) 10Andrew Bogott: Horizon: convert disabled_policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/582584 (https://phabricator.wikimedia.org/T247795) [16:47:25] (03PS1) 10Andrew Bogott: horizon: remove many unused config files [puppet] - 10https://gerrit.wikimedia.org/r/582585 [17:05:24] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:13:42] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is OK: HTTP OK: HTTP/1.0 200 OK - 22330 bytes in 7.058 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:46:24] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: convert disabled_policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/582584 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott) [18:52:34] (03CR) 10Andrew Bogott: [C: 03+2] horizon: remove many unused config files [puppet] - 10https://gerrit.wikimedia.org/r/582585 (owner: 10Andrew Bogott) [18:53:24] PROBLEM - WDQS high update lag on wdqs1010 is CRITICAL: 1.696e+06 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:19:36] (03PS1) 10Andrew Bogott: Horizon: split $version into $horizon_version and $openstack_version [puppet] - 10https://gerrit.wikimedia.org/r/582594 (https://phabricator.wikimedia.org/T247575) [19:23:06] (03CR) 10jerkins-bot: [V: 04-1] Horizon: split $version into $horizon_version and $openstack_version [puppet] - 10https://gerrit.wikimedia.org/r/582594 (https://phabricator.wikimedia.org/T247575) (owner: 10Andrew Bogott) [19:25:31] (03PS2) 10Andrew Bogott: Horizon: split $version into $horizon_version and $openstack_version [puppet] - 10https://gerrit.wikimedia.org/r/582594 (https://phabricator.wikimedia.org/T247575) [19:29:06] (03CR) 10jerkins-bot: [V: 04-1] Horizon: split $version into $horizon_version and $openstack_version [puppet] - 10https://gerrit.wikimedia.org/r/582594 (https://phabricator.wikimedia.org/T247575) (owner: 10Andrew Bogott) [20:02:14] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:02:24] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:21:52] 10Operations, 10observability, 10service-runner, 10serviceops-radar, and 2 others: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10Aklapper) [22:03:28] chasemp: can you give https://phabricator.wikimedia.org/T248273#5990536 a look? [23:12:54] (03PS1) 10Brian Wolff: Make wgWMEClientErrorIntakeURL use https on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582599 (https://phabricator.wikimedia.org/T248274) [23:15:12] (03PS1) 10Reedy: Use https for wgWMEClientErrorIntakeURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582600 (https://phabricator.wikimedia.org/T248274) [23:15:34] lol [23:15:40] (03Abandoned) 10Reedy: Use https for wgWMEClientErrorIntakeURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582600 (https://phabricator.wikimedia.org/T248274) (owner: 10Reedy) [23:15:50] (03CR) 10Reedy: [C: 03+2] Make wgWMEClientErrorIntakeURL use https on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582599 (https://phabricator.wikimedia.org/T248274) (owner: 10Brian Wolff) [23:16:43] (03Merged) 10jenkins-bot: Make wgWMEClientErrorIntakeURL use https on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582599 (https://phabricator.wikimedia.org/T248274) (owner: 10Brian Wolff) [23:16:53] Its a race! [23:19:16] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: T248274 (duration: 01m 19s) [23:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:23] T248274: Sometimes requests on beta cluster are made to http://eventgate-logging.wmflabs.org without TLS - https://phabricator.wikimedia.org/T248274 [23:19:29] Thanks :) [23:53:23] today I'm getting a lot emails from XTools about production API calls timing out, and sometimes an "empty reply from server". Is this is a known issue? [23:54:18] what I mean by that is XTools has a try/catch when it makes API calls to the wiki, and sends an email to maintainers when they fail [23:56:19] I'm still not very good at finding things on logstash, but the last timeout was 2 minutes ago for https://en.wikipedia.org/wiki/2nd_Division [23:56:50] -cloud is probably the better channel [23:56:51] I can find out what API that was used, but it seems all are affected. Some kind of networking issue, I'm guessing [23:57:19] no this is production, I was just clarifying how I found out about these timeouts [23:57:32] forget I said XTools [23:58:24] these timeouts are pretty regular, I get a handful of emails a day. But today there many hundreds [23:58:34] *there were [23:58:57] I don't really know, but i don't see anything abnormal looking in the various pretty graphs