[00:02:13] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10wiki_willy) Looks like it's still not shipped yet. Dell has an order number, but no tracking number yet for shipment. [00:04:51] (03PS1) 10Legoktm: Use common k8s labels [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844) [00:05:13] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/26244/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [00:05:54] (03PS2) 10Dzahn: phabricator: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) [00:07:45] 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10MNovotny_WMF) We can use an expiration date of Feb 1, 2021 - though we may need to extend if the project work continues past that point. thank you! [00:11:48] (03CR) 10Dzahn: "noop on phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [00:12:22] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: valid ROAs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [00:12:39] !log removed Nuria from wmf group, she is already in nda group (T266086) [00:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:46] T266086: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 [00:16:45] (03CR) 10Dzahn: [C: 03+1] "ready now I think" [puppet] - 10https://gerrit.wikimedia.org/r/636936 (owner: 10Muehlenhoff) [00:17:25] (03CR) 10Dzahn: [C: 03+1] "LDAP group part already done" [puppet] - 10https://gerrit.wikimedia.org/r/636936 (owner: 10Muehlenhoff) [00:19:01] (03CR) 10Dzahn: [C: 03+2] openstack: turn bash scripts without bashisms into sh scripts [puppet] - 10https://gerrit.wikimedia.org/r/631891 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [00:21:40] (03CR) 10Dzahn: "I guess a new type has to be first added.. and then used in a second patch." [puppet] - 10https://gerrit.wikimedia.org/r/635660 (owner: 10Dzahn) [00:27:25] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on clouddb1019 - https://phabricator.wikimedia.org/T266912 (10Peachey88) [00:28:58] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 2019 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:30:36] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 9 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:31:17] (03CR) 10Peachey88: "I think the task number should be T266911" [puppet] - 10https://gerrit.wikimedia.org/r/637809 (https://phabricator.wikimedia.org/T266912) (owner: 10Ryan Kemper) [00:33:40] (03PS4) 10Dzahn: puppetmaster: add data type for server type [puppet] - 10https://gerrit.wikimedia.org/r/635660 [00:36:26] (03PS5) 10Dzahn: puppetmaster: add data type for server type and use it [puppet] - 10https://gerrit.wikimedia.org/r/635660 [00:37:09] (03PS2) 10Dzahn: cirrus: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/637809 (https://phabricator.wikimedia.org/T266911) (owner: 10Ryan Kemper) [00:37:11] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/637809 (https://phabricator.wikimedia.org/T266911) (owner: 10Ryan Kemper) [00:39:15] 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) Thank you! I'll add it. And renewals are no problem at all. That's standard practice. We will reach out to you again once it expires. [00:40:03] 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) 05Stalled→03Open a:05MNovotny_WMF→03Dzahn [00:40:09] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26245/" [puppet] - 10https://gerrit.wikimedia.org/r/635660 (owner: 10Dzahn) [00:42:27] (03PS1) 10Dzahn: admin: add expiry date and contact for mraish [puppet] - 10https://gerrit.wikimedia.org/r/637817 (https://phabricator.wikimedia.org/T262316) [00:44:46] (03CR) 10Dzahn: [C: 03+2] admin: add expiry date and contact for mraish [puppet] - 10https://gerrit.wikimedia.org/r/637817 (https://phabricator.wikimedia.org/T262316) (owner: 10Dzahn) [00:45:17] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) 05Open→03Resolved [00:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:43] (03CR) 10Dzahn: [C: 03+2] "thanks for the detailed review. Yea, I think you can convince me that we don't any of this anymore in the future and only rely on netbox, " [puppet] - 10https://gerrit.wikimedia.org/r/637577 (owner: 10Dzahn) [00:49:01] (03PS2) 10Dzahn: site: move mw1267,mw1268 from rack A7 to rack A8 [puppet] - 10https://gerrit.wikimedia.org/r/637576 (https://phabricator.wikimedia.org/T266164) [00:55:35] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26246/" [puppet] - 10https://gerrit.wikimedia.org/r/637576 (https://phabricator.wikimedia.org/T266164) (owner: 10Dzahn) [00:56:10] (03PS4) 10Dzahn: site/appservers: cleanup comments about appserver rack locations [puppet] - 10https://gerrit.wikimedia.org/r/637577 [00:57:13] (03CR) 10Dzahn: [C: 03+2] "comments only" [puppet] - 10https://gerrit.wikimedia.org/r/637577 (owner: 10Dzahn) [00:58:07] (03CR) 10Dzahn: "noop" [puppet] - 10https://gerrit.wikimedia.org/r/637576 (https://phabricator.wikimedia.org/T266164) (owner: 10Dzahn) [02:22:25] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [02:47:17] (03PS1) 10Urbanecm: Add Response namespace at otrs_wikiwiki to namespaces searched by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637819 (https://phabricator.wikimedia.org/T266917) [02:51:26] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [02:53:00] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:56:56] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:01:58] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:07:42] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:09:22] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 39 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:15:06] PROBLEM - ores on ores1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [03:18:24] RECOVERY - ores on ores1007 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 4.802 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [03:30:51] Restarting ores uwsgi [03:57:06] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:02:08] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:03:49] 10Operations, 10Performance-Team, 10Traffic, 10serviceops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10greg) >>! In T266865#6592398, @Legoktm wrote: >>>! In T266865#6592372, @CDanis wrote: >> Long ago, frwiki's default feed length (in da... [04:27:22] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:24] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:54:34] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201031T0700) [07:22:28] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on clouddb1019 - https://phabricator.wikimedia.org/T266912 (10Marostegui) 05Open→03Resolved a:03Marostegui This host was being installed, and hence the raid was building. It is ok now: ` Number of Virtual Disks: 1 Virtual Drive: 0... [07:22:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Marostegui) [07:27:40] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:32:42] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:34:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:37:34] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:41:43] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) 05Resolved→03Open es1032 has RAID0 instead of RAID10. Can we get that one re-done with RAID10 and strip size 256? Thanks! ` root@es10... [07:42:36] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [08:16:44] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:17:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:31:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:06] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:53:44] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:37:42] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 80 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:43:18] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 36 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:16:02] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:17:38] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:21:47] (03PS11) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [16:26:56] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:58] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:24] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:02:28] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:21:56] 10Operations, 10Performance-Team, 10Traffic, 10serviceops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Dzahn) >>! In T266865#6592398, @Legoktm wrote: >>> I would say the practical impact of this change will be pretty low. Yes, the impac... [17:27:00] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [17:28:36] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [17:33:38] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:45:24] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 23601 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:27:02] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:08] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:27:26] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:32:28] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:57:36] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:40] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:12] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:47] (03PS12) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [22:18:08] (03CR) 10jerkins-bot: [V: 04-1] per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [22:31:40] (03PS13) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396)