[00:03:32] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:05:00] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:10:04] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [00:13:58] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [00:20:18] PROBLEM - people.wikimedia.org requires authentication on people2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [00:27:24] ACKNOWLEDGEMENT - people.wikimedia.org requires authentication on people2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found daniel_zahn reinstalled https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [00:31:20] (03CR) 10Dzahn: "on webperf1002/2002 the puppet runs are not repeating things anymore - looks good - https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cg" [puppet] - 10https://gerrit.wikimedia.org/r/626779 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [00:32:24] (03CR) 10Dzahn: "yep, gone from icinga now. :)" [puppet] - 10https://gerrit.wikimedia.org/r/626779 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [00:37:24] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 240, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:38:26] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:38:38] !log all issues with hosts doing stuff "on every run" have been fixed except one is left: analytics1034 [00:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:22] (03PS1) 10Dzahn: peopleweb: set people2001 as rsync destination [puppet] - 10https://gerrit.wikimedia.org/r/626791 [01:03:41] (03CR) 10Dzahn: [C: 03+2] peopleweb: set people2001 as rsync destination [puppet] - 10https://gerrit.wikimedia.org/r/626791 (owner: 10Dzahn) [01:07:22] !log people2001 - rsyncing user home dirs from people1002 [01:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:30] RECOVERY - people.wikimedia.org requires authentication on people2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 2.176 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [01:20:46] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru@wikimedia.org - https://phabricator.wikimedia.org/T262525 (10Dzahn) @Adamant.pwn I asked around for a public page that describes how to contact them, for the very reason you describe. I'm not sure there is one. But you can go to https://wikimediafoundat... [01:26:03] (03PS1) 10Dzahn: admins: add Michael Raish to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/626792 (https://phabricator.wikimedia.org/T262316) [01:28:57] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) Hello @MNovotny_WMF when adding contractors to LDAP there are 2 more fields to fill out. That is the expiration_date (of the contract) and who to cont... [01:29:01] (03CR) 10Dzahn: [C: 03+2] admins: add Michael Raish to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/626792 (https://phabricator.wikimedia.org/T262316) (owner: 10Dzahn) [01:30:19] (03CR) 10Dzahn: "NDA on file - confirmed on ticket - adding to nda LDAP group (contractor), expiry_date will be added" [puppet] - 10https://gerrit.wikimedia.org/r/626792 (https://phabricator.wikimedia.org/T262316) (owner: 10Dzahn) [01:32:32] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) @Mraish You have been added to the "nda" LDAP group. You should be able to login now. [01:32:47] 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) [01:33:03] 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) a:03Dzahn [02:22:12] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:23:08] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:37:10] (03CR) 10Krinkle: "The 'dcs' addition and the wmg* var for to used in filebackend LGTM. But since it affects production and is potentially risky and hard to " (037 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [03:23:52] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 240, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:24:58] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:11:20] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (people2001), Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:00:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:04:48] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: (C)100 gt (W)50 gt 50 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:57:08] mutante: o/ analytics1034 had the disk full, I think that puppet should work fine now! [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200912T0700) [07:06:44] 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10RhinosF1) > I'm also seeing intermittent cirrussearch-too-busy-error when searching (on all wikis). They were a couple report... [07:14:12] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:09:36] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:11:34] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:48:44] PROBLEM - kubelet operational latencies on kubernetes2008 is CRITICAL: instance=kubernetes2008.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:52:38] RECOVERY - kubelet operational latencies on kubernetes2008 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:05:18] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-RU mailing list page has wrong encoding - https://phabricator.wikimedia.org/T135226 (10Aklapper) See T261031 [11:37:04] PROBLEM - PHP7 rendering on mw2297 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1310 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:37:30] PROBLEM - Apache HTTP on mw2297 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1310 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:38:10] RECOVERY - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [13:42:37] (03CR) 10Dbarratt: [C: 03+1] Enable Special:Investigate on itwiki, eswiki and svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626715 (https://phabricator.wikimedia.org/T262436) (owner: 10Tchanders) [14:01:54] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:03:52] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:49:06] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:53:02] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:06:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: 128 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:08:12] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: (C)100 gt (W)50 gt 34 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:36:08] (03PS1) 10Majavah: Change votewiki language temporarily to fa for fawiki elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626851 (https://phabricator.wikimedia.org/T262689) [19:47:22] 10Operations, 10Wikimedia-Mailing-lists: Several unreadable mailing list descriptions due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Carn) [19:47:34] (03CR) 10Urbanecm: [C: 04-2] "not yet ready to be deployed, this is a part of "Oct 19 - Oct 22: Election setup"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626851 (https://phabricator.wikimedia.org/T262689) (owner: 10Majavah) [22:22:41] (03CR) 10Ryan Kemper: [C: 03+2] Use dedicated schedules for the various wikidata ttl dumps [puppet] - 10https://gerrit.wikimedia.org/r/622342 (https://phabricator.wikimedia.org/T261204) (owner: 10DCausse)