[01:34:50] (03Abandoned) 10Ladsgroup: meet: Add /etc/meet-auth to store the configs and data [puppet] - 10https://gerrit.wikimedia.org/r/606824 (owner: 10Ladsgroup) [05:27:01] (03CR) 10Legoktm: mailman3: Add parts for Postorius (web interface) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [05:27:52] (03CR) 10Legoktm: [C: 03+1] "LGTM, I'll merge this in a bit" [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [07:36:45] 10SRE: Use static PHIDs instead of fragile Phab project names in in modules/icinga/files/raid_handler.py - https://phabricator.wikimedia.org/T272233 (10Aklapper) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210117T0800) [08:29:44] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.267 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [09:17:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:20:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:37:02] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:39:14] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:55:45] 10SRE, 10Wikimedia-Logstash, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10Majavah) [12:05:58] PROBLEM - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1263267 MB (15% inode=81%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [13:44:36] 10SRE, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179 (10Aklapper) >>! In T180179#6616621, @aborrero wrote: > Isn't that the meaning of the `stalled` status? :-) No, see https://www.mediawiki.or... [17:08:43] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to production shell and wmf ldap access for Razzi Abuissa - https://phabricator.wikimedia.org/T261443 (10Dzahn) This was a legit Gerrit-Privilege-Request and it was resolved. Why are we removing project tags after the fact now? [17:17:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:19:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:43:49] (03CR) 10Elukey: [C: 03+2] Disable directory listing in Jetty [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/656448 (https://phabricator.wikimedia.org/T272082) (owner: 10Elukey) [19:40:30] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 27.8 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [19:47:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:49:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:15:14] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:08] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2055 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:40:12] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:34] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2055 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:59:15] (03PS1) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [21:59:56] (03CR) 10jerkins-bot: [V: 04-1] Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [22:02:04] (03PS2) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [22:02:46] (03CR) 10jerkins-bot: [V: 04-1] Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [22:18:51] (03PS3) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [22:20:24] (03CR) 10jerkins-bot: [V: 04-1] Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [22:22:27] (03PS4) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [22:29:04] (03PS5) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [22:32:28] (03PS6) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [22:34:26] (03PS7) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [22:38:21] (03PS8) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [22:39:49] (03PS9) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [22:39:56] (03CR) 10jerkins-bot: [V: 04-1] Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [22:43:11] (03PS10) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [22:44:47] (03CR) 10jerkins-bot: [V: 04-1] Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [22:46:18] (03PS11) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [22:47:50] (03CR) 10jerkins-bot: [V: 04-1] Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [22:48:58] (03PS12) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [22:59:11] (03PS13) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [23:01:33] (03PS14) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [23:03:15] (03PS15) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [23:07:42] (03CR) 10Andrew Bogott: "I've tested the puppet integration &c. on a VM; mostly interested in thoughts about the .py file and use of webob/jinja." [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [23:39:52] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 56.82 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:41:00] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 58.29 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:42:20] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 100.2 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:43:26] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 79.77 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1