[00:02:13] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10wiki_willy) Looks like it's still not shipped yet.  Dell has an order number, but no tracking number yet for shipment.
[00:04:51] <wikibugs>	 (03PS1) 10Legoktm: Use common k8s labels [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844)
[00:05:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/26244/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[00:05:54] <wikibugs>	 (03PS2) 10Dzahn: phabricator: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479)
[00:07:45] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10MNovotny_WMF) We can use an expiration date of Feb 1, 2021 - though we may need to extend if the project work continues past that point. thank you!
[00:11:48] <wikibugs>	 (03CR) 10Dzahn: "noop on phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[00:12:22] <icinga-wm>	 PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: valid ROAs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[00:12:39] <mutante>	 !log removed Nuria from wmf group, she is already in nda group (T266086)
[00:12:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:46] <stashbot>	 T266086: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086
[00:16:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "ready now I think" [puppet] - 10https://gerrit.wikimedia.org/r/636936 (owner: 10Muehlenhoff)
[00:17:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "LDAP group part already done" [puppet] - 10https://gerrit.wikimedia.org/r/636936 (owner: 10Muehlenhoff)
[00:19:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] openstack: turn bash scripts without bashisms into sh scripts [puppet] - 10https://gerrit.wikimedia.org/r/631891 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn)
[00:21:40] <wikibugs>	 (03CR) 10Dzahn: "I guess a new type has to be first added.. and then used in a second patch." [puppet] - 10https://gerrit.wikimedia.org/r/635660 (owner: 10Dzahn)
[00:27:25] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on clouddb1019 - https://phabricator.wikimedia.org/T266912 (10Peachey88)
[00:28:58] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 2019 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:30:36] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 9 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:31:17] <wikibugs>	 (03CR) 10Peachey88: "I think the task number should be T266911" [puppet] - 10https://gerrit.wikimedia.org/r/637809 (https://phabricator.wikimedia.org/T266912) (owner: 10Ryan Kemper)
[00:33:40] <wikibugs>	 (03PS4) 10Dzahn: puppetmaster: add data type for server type [puppet] - 10https://gerrit.wikimedia.org/r/635660
[00:36:26] <wikibugs>	 (03PS5) 10Dzahn: puppetmaster: add data type for server type and use it [puppet] - 10https://gerrit.wikimedia.org/r/635660
[00:37:09] <wikibugs>	 (03PS2) 10Dzahn: cirrus: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/637809 (https://phabricator.wikimedia.org/T266911) (owner: 10Ryan Kemper)
[00:37:11] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/637809 (https://phabricator.wikimedia.org/T266911) (owner: 10Ryan Kemper)
[00:39:15] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) Thank you! I'll add it. And renewals are no problem at all. That's standard practice. We will reach out to you again once it expires.
[00:40:03] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) 05Stalled→03Open a:05MNovotny_WMF→03Dzahn
[00:40:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26245/" [puppet] - 10https://gerrit.wikimedia.org/r/635660 (owner: 10Dzahn)
[00:42:27] <wikibugs>	 (03PS1) 10Dzahn: admin: add expiry date and contact for mraish [puppet] - 10https://gerrit.wikimedia.org/r/637817 (https://phabricator.wikimedia.org/T262316)
[00:44:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admin: add expiry date and contact for mraish [puppet] - 10https://gerrit.wikimedia.org/r/637817 (https://phabricator.wikimedia.org/T262316) (owner: 10Dzahn)
[00:45:17] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) 05Open→03Resolved
[00:45:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thanks for the detailed review. Yea, I think you can convince me that we don't any of this anymore in the future and only rely on netbox, " [puppet] - 10https://gerrit.wikimedia.org/r/637577 (owner: 10Dzahn)
[00:49:01] <wikibugs>	 (03PS2) 10Dzahn: site: move mw1267,mw1268 from rack A7 to rack A8 [puppet] - 10https://gerrit.wikimedia.org/r/637576 (https://phabricator.wikimedia.org/T266164)
[00:55:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26246/" [puppet] - 10https://gerrit.wikimedia.org/r/637576 (https://phabricator.wikimedia.org/T266164) (owner: 10Dzahn)
[00:56:10] <wikibugs>	 (03PS4) 10Dzahn: site/appservers: cleanup comments about appserver rack locations [puppet] - 10https://gerrit.wikimedia.org/r/637577
[00:57:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "comments only" [puppet] - 10https://gerrit.wikimedia.org/r/637577 (owner: 10Dzahn)
[00:58:07] <wikibugs>	 (03CR) 10Dzahn: "noop" [puppet] - 10https://gerrit.wikimedia.org/r/637576 (https://phabricator.wikimedia.org/T266164) (owner: 10Dzahn)
[02:22:25] <icinga-wm>	 RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[02:47:17] <wikibugs>	 (03PS1) 10Urbanecm: Add Response namespace at otrs_wikiwiki to namespaces searched by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637819 (https://phabricator.wikimedia.org/T266917)
[02:51:26] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[02:53:00] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[02:56:56] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:01:58] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:07:42] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:09:22] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 39 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:15:06] <icinga-wm>	 PROBLEM - ores on ores1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[03:18:24] <icinga-wm>	 RECOVERY - ores on ores1007 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 4.802 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[03:30:51] <chrisalbon>	 Restarting ores uwsgi
[03:57:06] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:02:08] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:03:49] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10serviceops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10greg) >>! In T266865#6592398, @Legoktm wrote: >>>! In T266865#6592372, @CDanis wrote: >> Long ago, frwiki's default feed length (in da...
[04:27:22] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:32:24] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:54:34] <icinga-wm>	 PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201031T0700)
[07:22:28] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on clouddb1019 - https://phabricator.wikimedia.org/T266912 (10Marostegui) 05Open→03Resolved a:03Marostegui This host was being installed, and hence the raid was building. It is ok now: ` Number of Virtual Disks: 1 Virtual Drive: 0...
[07:22:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Marostegui)
[07:27:40] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:32:42] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:34:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:37:34] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[07:41:43] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) 05Resolved→03Open es1032 has RAID0 instead of RAID10.  Can we get that one re-done with RAID10 and strip size 256?  Thanks! ` root@es10...
[07:42:36] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[08:16:44] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:17:46] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:31:40] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:53:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:53:44] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:37:42] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 80 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:43:18] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 36 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:16:02] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[16:17:38] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[16:21:47] <wikibugs>	 (03PS11) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396)
[16:26:56] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:31:58] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:57:24] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:02:28] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:21:56] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10serviceops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Dzahn) >>! In T266865#6592398, @Legoktm wrote: >>> I would say the practical impact of this change will be pretty low.  Yes, the impac...
[17:27:00] <icinga-wm>	 PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[17:28:36] <icinga-wm>	 RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[17:33:38] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[17:45:24] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 23601 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:27:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:32:08] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:27:26] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:32:28] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:57:36] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:02:40] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:16:12] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:17:47] <wikibugs>	 (03PS12) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396)
[22:18:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn)
[22:31:40] <wikibugs>	 (03PS13) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396)