[01:05:51] (03CR) 10Jhedden: [C: 03+1] "Great idea, looks good" [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [01:19:08] (03CR) 10Jhedden: [C: 03+1] "Looks really good!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [02:32:18] PROBLEM - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [02:50:40] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:15:34] PROBLEM - cassandra-a SSL 10.192.16.85:7001 on restbase2014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [03:15:40] PROBLEM - cassandra-c SSL 10.192.16.87:7001 on restbase2014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [03:15:56] PROBLEM - cassandra-b SSL 10.192.16.86:7001 on restbase2014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [03:16:04] PROBLEM - Check systemd state on restbase2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:04] PROBLEM - cassandra-b CQL 10.192.16.86:9042 on restbase2014 is CRITICAL: connect to address 10.192.16.86 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [03:16:10] PROBLEM - cassandra-c CQL 10.192.16.87:9042 on restbase2014 is CRITICAL: connect to address 10.192.16.87 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [03:16:14] PROBLEM - cassandra-a service on restbase2014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:16:38] PROBLEM - cassandra-a CQL 10.192.16.85:9042 on restbase2014 is CRITICAL: connect to address 10.192.16.85 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [03:16:48] PROBLEM - cassandra-c service on restbase2014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:16:52] PROBLEM - cassandra-b service on restbase2014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:25:26] PROBLEM - MD RAID on restbase2014 is CRITICAL: CRITICAL: State: degraded, Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [03:25:27] ACKNOWLEDGEMENT - MD RAID on restbase2014 is CRITICAL: CRITICAL: State: degraded, Active: 6, Working: 6, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T250050 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [03:25:30] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10ops-monitoring-bot) [03:49:02] RECOVERY - Check systemd state on restbase2014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:54:34] PROBLEM - Check systemd state on restbase2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:19:06] RECOVERY - cassandra-c service on restbase2014 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:24:36] PROBLEM - cassandra-c service on restbase2014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:33:24] RECOVERY - snapshot of s5 in eqiad on db1115 is OK: snapshot for s5 at eqiad taken less than 3 days ago and larger than 90 GB: Last one 2020-04-13 03:18:16 from db1102.eqiad.wmnet:3315 (667 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [04:35:20] PROBLEM - Device not healthy -SMART- on restbase2014 is CRITICAL: cluster=restbase device=sdd instance=restbase2014:9100 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase2014&var-datasource=codfw+prometheus/ops [05:18:33] (03PS1) 10Vgutierrez: Release 8.0.7-rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588201 [05:19:18] RECOVERY - cassandra-c service on restbase2014 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:19:20] RECOVERY - cassandra-b service on restbase2014 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:24:46] PROBLEM - cassandra-c service on restbase2014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:24:48] PROBLEM - cassandra-b service on restbase2014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:25:22] !log restart varnish-fe on cp3050 [05:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:57] (03PS1) 10Marostegui: clouddb.sql.erb: Add GRANTs file [puppet] - 10https://gerrit.wikimedia.org/r/588202 [05:59:34] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.7-rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588201 (owner: 10Vgutierrez) [06:02:39] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gr.wikimedia.org - https://phabricator.wikimedia.org/T245912 (10Marostegui) @Urbanecm I saw you created the database, next time please ping us on this sort of ticket "Prepare and check storage layer" s... [06:03:02] !log Sanitize grwikimedia on db2094:3313 and db1124:3313 - T245912 [06:03:03] T245912: Prepare and check storage layer for gr.wikimedia.org - https://phabricator.wikimedia.org/T245912 [06:19:40] RECOVERY - cassandra-c service on restbase2014 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:19:42] RECOVERY - cassandra-b service on restbase2014 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:20:45] !log upload trafficserver 8.0.7-rc0-1wm1 to apt.wm.o (buster) [06:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:41] !log upgrade to ats 8.0.7-rc0-1wm1 on cp[4026,4032,5006,5012] [06:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:08] PROBLEM - cassandra-c service on restbase2014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:25:10] PROBLEM - cassandra-b service on restbase2014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:26:25] (03PS1) 10Marostegui: pc[12]008: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/588203 (https://phabricator.wikimedia.org/T247787) [06:27:32] ah lovely - for restbase2014 [06:27:34] java.nio.file.FileSystemException: /srv/sdc4/cassandra-a/data: Input/output error [06:27:44] (03CR) 10Marostegui: [C: 03+2] pc[12]008: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/588203 (https://phabricator.wikimedia.org/T247787) (owner: 10Marostegui) [06:29:39] elukey: happy monday! [06:31:24] marostegui: hola :D indeed [06:31:49] so /dev/sdc failed, and it seems preventing cassandra instances on 2014 to run [06:34:33] ah nice https://phabricator.wikimedia.org/T250050 [06:36:19] !log temporary stopped puppet on restbase2014 to avoid attempts to start cassandra on each run - T250050 [06:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:25] T250050: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 [06:37:43] acked the alerts to avoid spam in here [06:37:51] err in icinga [06:38:35] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10elukey) @Eevans this is the weekend of broken cassandra hosts, adding you as FYI :) [06:50:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 T249973', diff saved to https://phabricator.wikimedia.org/P10961 and previous config saved to /var/cache/conftool/dbconfig/20200413-065022-marostegui.json [06:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:29] T249973: db1110 has 5 important database drifts that are unique to the host - https://phabricator.wikimedia.org/T249973 [06:51:31] !log Deploy schema changes on db1110 - T249973 [06:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1110 T249973', diff saved to https://phabricator.wikimedia.org/P10962 and previous config saved to /var/cache/conftool/dbconfig/20200413-071740-marostegui.json [07:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:47] T249973: db1110 has 5 important database drifts that are unique to the host - https://phabricator.wikimedia.org/T249973 [07:23:46] (03CR) 10Dzahn: [C: 03+2] mediawiki: Document the apache sample hosts [puppet] - 10https://gerrit.wikimedia.org/r/587289 (https://phabricator.wikimedia.org/T244472) (owner: 10Krinkle) [07:23:59] (03PS1) 10Vgutierrez: ATS: Enable res_track_memory in cp1085 [puppet] - 10https://gerrit.wikimedia.org/r/588366 (https://phabricator.wikimedia.org/T249335) [07:30:15] (03CR) 10Dzahn: icinga: Add git local changes check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588049 (owner: 10CRusnov) [07:30:30] (03PS2) 10Vgutierrez: ATS: Enable res_track_memory in cp1085 [puppet] - 10https://gerrit.wikimedia.org/r/588366 (https://phabricator.wikimedia.org/T249335) [07:34:17] (03CR) 10Vgutierrez: "https://puppet-compiler.wmflabs.org/compiler1002/21875/" [puppet] - 10https://gerrit.wikimedia.org/r/588366 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [07:37:37] (03CR) 10Dzahn: [C: 03+1] "tried it out locally with the puppet repo. works for me. returns 0 without changes and CRIT when adding an untracked file." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588049 (owner: 10CRusnov) [07:39:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1092 T232446', diff saved to https://phabricator.wikimedia.org/P10963 and previous config saved to /var/cache/conftool/dbconfig/20200413-073939-marostegui.json [07:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:46] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [07:40:11] (03CR) 10Dzahn: [C: 03+2] "easy to revert if we end up wanting to reinstall them. cleaning up." [puppet] - 10https://gerrit.wikimedia.org/r/585185 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [07:40:22] (03PS2) 10Dzahn: DHCP: remove mw1254-mw1258 [puppet] - 10https://gerrit.wikimedia.org/r/585185 (https://phabricator.wikimedia.org/T247780) [07:40:40] !log rolling upgrade to ats 8.0.7-rc0-1wm1 in ulsfo [07:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Temporary pool db1111 in s8 API', diff saved to https://phabricator.wikimedia.org/P10964 and previous config saved to /var/cache/conftool/dbconfig/20200413-074158-marostegui.json [07:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:23] !log Compress db1092 T232446 [07:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:11] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable res_track_memory in cp1085 [puppet] - 10https://gerrit.wikimedia.org/r/588366 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [07:50:06] !log enable memory tracking in ats-tls on cp1085 - T249335 [07:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:13] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [07:51:54] 10Operations: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10ayounsi) p:05Triage→03Low [07:55:58] 10Operations, 10ops-eqiad, 10ops-ulsfo: Netbox report coherence_rack Icinga alert - https://phabricator.wikimedia.org/T250054 (10ayounsi) p:05Triage→03Low [07:59:41] 10Operations, 10Traffic, 10Patch-For-Review: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10Vgutierrez) res_memory_tracking should help a lot, this is an example report @ cp1085 after restarting ats-tls: ` Allocated | In-Use | Type Size | Free List Name --... [08:15:06] !log Remove grants for haproxy@10.64.37.15 from labsdb hosts T231280 [08:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:14] T231280: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 [08:18:07] (03PS3) 10Elukey: kafkatee::instance: add types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/588086 [08:18:09] (03PS4) 10Elukey: Enable TLS encryption between kafkatee instances and Kafka [puppet] - 10https://gerrit.wikimedia.org/r/588015 [08:22:21] (03CR) 10Elukey: [C: 03+2] profile::kafkatee::webrequest::analytics: use ssl_array for Kafka Brokers [puppet] - 10https://gerrit.wikimedia.org/r/588085 (owner: 10Elukey) [08:24:25] (03CR) 10Elukey: kafkatee::instance: add types to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588086 (owner: 10Elukey) [09:26:03] (03PS1) 10QChris: Add .gitreview [software/purged] - 10https://gerrit.wikimedia.org/r/588373 [09:26:05] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/purged] - 10https://gerrit.wikimedia.org/r/588373 (owner: 10QChris) [09:26:37] ema: ^ there's your new repo [09:31:30] (03CR) 10Ayounsi: "I didn't finish the review as being able to have the same VIP on multiple devices might make the rest of my review useless." (0311 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [09:38:33] (03PS1) 10Dzahn: cloud/devtools: set profile::tlsproxy::envoy::capitalize_headers [puppet] - 10https://gerrit.wikimedia.org/r/588375 [09:45:08] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gr.wikimedia.org - https://phabricator.wikimedia.org/T245912 (10Urbanecm) Upps, sorry, will do next time @Marostegui! [09:45:52] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gr.wikimedia.org - https://phabricator.wikimedia.org/T245912 (10Marostegui) No problem! Can you check my comment at: T245911#6051001 Thank you! [09:47:41] !log mwscript createAndPromote.php --wiki=grwikimedia --force Gerakiw (T245911) [09:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:48] T245911: Create a wiki for Wikimedia Community User Group Greece - https://phabricator.wikimedia.org/T245911 [09:52:19] !log Rename user account Gerakiw@grwikimedia to Geraki@grwikimedia (T245911) [09:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:57] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: Netbox report coherence_rack Icinga alert - https://phabricator.wikimedia.org/T250054 (10faidon) [09:55:59] 10Operations, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10faidon) [09:56:50] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gr.wikimedia.org - https://phabricator.wikimedia.org/T245912 (10Marostegui) #cloud-services-team this is ready for the views creation on labsdb1009, 1010, 1011 and 1012. I have run this: `set session s... [10:00:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) (owner: 10Arturo Borrero Gonzalez) [10:07:34] Is this... normal? (I was trying to do `pwb.py claimit` https://www.irccloud.com/pastebin/tlmsyyY3/wikidata-log.txt [10:08:17] (03CR) 10Dzahn: [C: 03+2] "unbreak puppet on phabricator instances in cloud VPS devtools" [puppet] - 10https://gerrit.wikimedia.org/r/588375 (owner: 10Dzahn) [10:08:46] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:21] can't confirm. 0 units failed [10:11:02] ah, i can [10:11:29] (03PS1) 10Arturo Borrero Gonzalez: toolforge: legacy URLs: use HTTP 307/308 for the redirects [puppet] - 10https://gerrit.wikimedia.org/r/588380 (https://phabricator.wikimedia.org/T249843) [10:12:01] !log mwmaint1002 - sudo systemctl status mediawiki_job_translationnotifications-mediawikiwiki.service [10:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:23] 10Operations, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Needs Product Owner Decisions), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10ovasileva) >>! In T236241#6047510, @Aklapper wrote: > @ovasileva: Could you please check t... [10:12:24] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: legacy URLs: use HTTP 307/308 for the redirects [puppet] - 10https://gerrit.wikimedia.org/r/588380 (https://phabricator.wikimedia.org/T249843) (owner: 10Arturo Borrero Gonzalez) [10:19:14] !log Kill updateSpecialPages.php --only=Fewestrevisions for s8 in mwmaint1002, the vslow host is lagging and creating errors [10:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:58] !log depooled wdqs1004 by request because of high lag [10:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:23] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588383 (https://phabricator.wikimedia.org/T128546) [10:30:04] jan_drewniak: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200413T1030). [10:31:35] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588383 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:32:33] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588383 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:33:38] (03CR) 10Jbond: [C: 03+2] Make totp profile parameters optional [puppet] - 10https://gerrit.wikimedia.org/r/587714 (owner: 10Muehlenhoff) [10:36:33] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:588383| Bumping portals to master (563985)]] (duration: 01m 00s) [10:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:32] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:588383| Bumping portals to master (563985)]] (duration: 00m 58s) [10:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:00] (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/587726 (owner: 10Elukey) [10:56:20] (03CR) 10Jbond: [C: 03+2] "thanks, will merge" [puppet] - 10https://gerrit.wikimedia.org/r/587988 (owner: 10Hashar) [10:57:53] (03CR) 10Jbond: [C: 03+2] "thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/587989 (owner: 10Hashar) [10:58:05] (03PS2) 10Jbond: admin: enhance test output for groups GID [puppet] - 10https://gerrit.wikimedia.org/r/587989 (owner: 10Hashar) [10:58:38] (03PS2) 10Jbond: admin: show gid in gid test error [puppet] - 10https://gerrit.wikimedia.org/r/587990 (owner: 10Hashar) [10:59:53] (03CR) 10Jbond: [C: 03+2] "thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/587990 (owner: 10Hashar) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200413T1100). [11:00:04] Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:08] Here :) [11:02:42] P. S. I have only one patch which no needs mwdebug ;) [11:11:11] Zoranzoki21: I can SWAT today! [11:11:19] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [11:12:13] (03Merged) 10jenkins-bot: robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [11:13:27] Zoranzoki21: syncing [11:14:21] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: efe2feb: robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki (T248860) (duration: 00m 58s) [11:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:28] T248860: Disable indexing user (sub)pages and drafts on Serbian Wikipedia - https://phabricator.wikimedia.org/T248860 [11:15:26] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: efe2feb: robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki (T248860; take II) (duration: 00m 58s) [11:15:29] Cool [11:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:34] Zoranzoki21: should be all done :) [11:15:59] Yes, thanks Urbanecm. [11:16:03] yw [11:19:41] (03PS5) 10Alexandros Kosiaris: admin: Deduplicate defaults.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/581507 [11:19:43] (03PS3) 10Alexandros Kosiaris: admin: deduplicate main helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/581656 [11:19:45] (03PS3) 10Alexandros Kosiaris: admin/namespace: Deduplicate all helmfile templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/581657 [11:19:47] (03PS3) 10Alexandros Kosiaris: admin: Default to sensible values for deploUser, namespaceName [deployment-charts] - 10https://gerrit.wikimedia.org/r/581658 [11:19:49] (03PS3) 10Alexandros Kosiaris: admin: Remove all override files [deployment-charts] - 10https://gerrit.wikimedia.org/r/581748 [11:29:29] (03PS1) 10Jbond: admin: show uid in uid test error [puppet] - 10https://gerrit.wikimedia.org/r/588387 [11:30:49] (03CR) 10Cparle: [C: 03+1] MachineVision: Add MachineVisionWithholdImageList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588053 (https://phabricator.wikimedia.org/T249939) (owner: 10Mholloway) [11:45:11] 10Operations, 10netops, 10User-jbond: Sporatic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10jbond) >>! In T238823#5681406, @akosiaris wrote: > Could be totally different but with @jijiki we 've seen this behavior elsewhere as well. The latest installment is T238789. Per that log... [11:47:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:50:48] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:53:13] !log Deploy schema change on codfw master (lag will appear on codfw) - T250062 [11:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:19] T250062: ipb_parent_block_id_2 index on ipblocks table on s8 only - https://phabricator.wikimedia.org/T250062 [11:53:29] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:53:35] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:40] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:54] !log Deploy schema change on eqiad s8 hosts - T250062 [11:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:24] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [11:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Deduplicate defaults.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/581507 (owner: 10Alexandros Kosiaris) [12:04:13] (03Merged) 10jenkins-bot: admin: Deduplicate defaults.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/581507 (owner: 10Alexandros Kosiaris) [12:12:05] (03CR) 10Arturo Borrero Gonzalez: "Some comments. Mostly about datatypes, the [0] syntax and lookup()." (0317 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [12:12:21] !log rolling upgrade to ats 8.0.7-rc0-1wm1 in eqsin and codfw [12:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:21] (03PS1) 10ArielGlenn: fix listing of input files for 7z recompression [dumps] - 10https://gerrit.wikimedia.org/r/588393 (https://phabricator.wikimedia.org/T250018) [12:31:59] (03PS5) 10Jbond: apereo_cas: update templates login page [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) [12:39:04] 10Operations, 10Patch-For-Review, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10jbond) >>! In T233939#6041881, @Volker_E wrote: > We're following WCAG 2.0 level AA color contrast ratios, so something like a placeholder text color needs to provide 4.5:1 contra... [12:39:14] (03CR) 10Jbond: "thanks see inline" (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [13:04:24] (03PS1) 10Vgutierrez: Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) [13:04:43] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [13:08:25] (03CR) 10Andrew Bogott: "Hello all!" [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [13:08:50] (03PS2) 10Vgutierrez: Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) [13:15:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:18:42] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:25:36] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [13:26:03] (03PS1) 10Alexandros Kosiaris: Heavily amend Description: field [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588404 [13:28:17] (03PS3) 10Vgutierrez: Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) [13:30:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] Heavily amend Description: field [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588404 (owner: 10Alexandros Kosiaris) [13:31:35] (03Merged) 10jenkins-bot: Heavily amend Description: field [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588404 (owner: 10Alexandros Kosiaris) [13:33:18] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:33:32] (03PS1) 10Mholloway: MachineVision: Add MachineVisionWithholdImageList config (Beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588407 (https://phabricator.wikimedia.org/T249939) [13:34:47] (03CR) 10jerkins-bot: [V: 04-1] MachineVision: Add MachineVisionWithholdImageList config (Beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588407 (https://phabricator.wikimedia.org/T249939) (owner: 10Mholloway) [13:35:48] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:36:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:37:38] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:37:47] (03PS2) 10Mholloway: MachineVision: Add MachineVisionWithholdImageList config (Beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588407 (https://phabricator.wikimedia.org/T249939) [13:39:54] (03CR) 10Mholloway: [C: 03+2] MachineVision: Add MachineVisionWithholdImageList config (Beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588407 (https://phabricator.wikimedia.org/T249939) (owner: 10Mholloway) [13:40:47] (03Merged) 10jenkins-bot: MachineVision: Add MachineVisionWithholdImageList config (Beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588407 (https://phabricator.wikimedia.org/T249939) (owner: 10Mholloway) [13:41:32] (03CR) 10Ottomata: "COOL! Does this mean we can access this as .Values.puppet_ca_crt?" [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [13:43:09] (03CR) 10Ottomata: "Interesting! Does writing even work? We might want to allow people to write results somewhere, no? Perhaps to their own Hive DBs?" [puppet] - 10https://gerrit.wikimedia.org/r/588073 (owner: 10Elukey) [13:48:29] (03PS1) 10Jhedden: openstack: increase labweb memcached size [puppet] - 10https://gerrit.wikimedia.org/r/588411 (https://phabricator.wikimedia.org/T145703) [13:51:55] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1001/21877/labweb1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/588411 (https://phabricator.wikimedia.org/T145703) (owner: 10Jhedden) [13:56:10] (03CR) 10Elukey: [C: 03+2] "> Interesting! Does writing even work? We might want to allow" [puppet] - 10https://gerrit.wikimedia.org/r/588073 (owner: 10Elukey) [13:58:39] (03PS2) 10ArielGlenn: fix listing of input files for 7z recompression [dumps] - 10https://gerrit.wikimedia.org/r/588393 (https://phabricator.wikimedia.org/T250018) [14:02:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:04:26] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:16:51] (03PS1) 10Alexandros Kosiaris: Revert "Ignore quilt dir .pc via .gitignore" [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588413 [14:17:20] (03PS1) 10Alexandros Kosiaris: Fix debian/copyright [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588414 [14:17:22] (03PS1) 10Alexandros Kosiaris: Bump to 0.109.0 [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 [14:17:24] (03CR) 10jerkins-bot: [V: 04-1] Bump to 0.109.0 [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 (owner: 10Alexandros Kosiaris) [14:18:24] (03CR) 10Alexandros Kosiaris: "recheck" [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 (owner: 10Alexandros Kosiaris) [14:18:42] kormat: 👀 [14:21:27] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Revert "Ignore quilt dir .pc via .gitignore" [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588413 (owner: 10Alexandros Kosiaris) [14:21:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fix debian/copyright [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588414 (owner: 10Alexandros Kosiaris) [14:21:46] (03CR) 10jerkins-bot: [V: 04-1] Fix debian/copyright [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588414 (owner: 10Alexandros Kosiaris) [14:21:51] cdanis: XDDD [14:29:41] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix debian/copyright [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588414 (owner: 10Alexandros Kosiaris) [14:43:15] (03PS2) 10Alexandros Kosiaris: Bump to 0.109.0 [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 [14:43:17] (03PS1) 10Alexandros Kosiaris: Merge branch 'master' into buster-wikimedia [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588418 [14:43:53] (03CR) 10Arturo Borrero Gonzalez: Replace pykube with a custom API client (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/586162 (https://phabricator.wikimedia.org/T197930) (owner: 10BryanDavis) [14:43:55] (03CR) 10jerkins-bot: [V: 04-1] Bump to 0.109.0 [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 (owner: 10Alexandros Kosiaris) [14:44:27] (03PS3) 10Jbond: profile::mail::jumpcloud: add new class to manage jumpcloud aliases [puppet] - 10https://gerrit.wikimedia.org/r/585501 (https://phabricator.wikimedia.org/T244792) [14:44:29] (03PS1) 10Jbond: profile::mail::mx: add type enforcment, lookups and move defaults [puppet] - 10https://gerrit.wikimedia.org/r/588419 (https://phabricator.wikimedia.org/T244792) [14:44:31] (03PS1) 10Jbond: profile::mail::mx: Add toggle to enable jumpcloud integration [puppet] - 10https://gerrit.wikimedia.org/r/588420 (https://phabricator.wikimedia.org/T244792) [14:45:09] (03CR) 10Andrew Bogott: [C: 03+1] "Harmless at worst" [puppet] - 10https://gerrit.wikimedia.org/r/588411 (https://phabricator.wikimedia.org/T145703) (owner: 10Jhedden) [14:48:18] (03CR) 10jerkins-bot: [V: 04-1] profile::mail::jumpcloud: add new class to manage jumpcloud aliases [puppet] - 10https://gerrit.wikimedia.org/r/585501 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [14:49:31] (03CR) 10jerkins-bot: [V: 04-1] profile::mail::mx: Add toggle to enable jumpcloud integration [puppet] - 10https://gerrit.wikimedia.org/r/588420 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [14:50:30] (03PS4) 10Jbond: profile::mail::jumpcloud: add new class to manage jumpcloud aliases [puppet] - 10https://gerrit.wikimedia.org/r/585501 (https://phabricator.wikimedia.org/T244792) [14:51:08] (03PS2) 10Jbond: profile::mail::mx: add type enforcment, lookups and move defaults [puppet] - 10https://gerrit.wikimedia.org/r/588419 (https://phabricator.wikimedia.org/T244792) [14:51:49] (03PS2) 10Jhedden: openstack: increase labweb memcached size [puppet] - 10https://gerrit.wikimedia.org/r/588411 (https://phabricator.wikimedia.org/T145703) [14:52:32] (03PS3) 10Alexandros Kosiaris: Bump to 0.109.0 [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 [14:52:51] (03PS3) 10Jbond: profile::mail::mx: add type enforcment, lookups and move defaults [puppet] - 10https://gerrit.wikimedia.org/r/588419 (https://phabricator.wikimedia.org/T244792) [14:53:03] (03PS2) 10Jbond: profile::mail::mx: Add toggle to enable jumpcloud integration [puppet] - 10https://gerrit.wikimedia.org/r/588420 (https://phabricator.wikimedia.org/T244792) [14:53:12] (03CR) 10jerkins-bot: [V: 04-1] Bump to 0.109.0 [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 (owner: 10Alexandros Kosiaris) [14:56:37] (03CR) 10Jhedden: [C: 03+2] openstack: increase labweb memcached size [puppet] - 10https://gerrit.wikimedia.org/r/588411 (https://phabricator.wikimedia.org/T145703) (owner: 10Jhedden) [14:58:16] 10Operations, 10ops-ulsfo: update rack location of decom wmf5801 - https://phabricator.wikimedia.org/T249287 (10RobH) [14:58:18] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: Netbox report coherence_rack Icinga alert - https://phabricator.wikimedia.org/T250054 (10RobH) [14:58:28] (03CR) 10jerkins-bot: [V: 04-1] profile::mail::mx: Add toggle to enable jumpcloud integration [puppet] - 10https://gerrit.wikimedia.org/r/588420 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:01:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [15:02:37] (03PS3) 10Jbond: profile::mail::mx: Add toggle to enable jumpcloud integration [puppet] - 10https://gerrit.wikimedia.org/r/588420 (https://phabricator.wikimedia.org/T244792) [15:09:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] designate: change api_base_uri to proper HA endpoint [puppet] - 10https://gerrit.wikimedia.org/r/588176 (owner: 10Andrew Bogott) [15:10:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:10:06] PROBLEM - Memory correctable errors -EDAC- on scb1001 is CRITICAL: 10 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1001&var-datasource=eqiad+prometheus/ops [15:10:44] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Performance-Team, 10Traffic: Separate Cache-Control header for proxy and client - https://phabricator.wikimedia.org/T50835 (10Krinkle) [15:10:49] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Performance-Team, 10Traffic: Separate Cache-Control header for proxy and client - https://phabricator.wikimedia.org/T50835 (10Krinkle) a:05Krinkle→03None [15:13:44] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:14:04] (03CR) 10BryanDavis: Replace pykube with a custom API client (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/586162 (https://phabricator.wikimedia.org/T197930) (owner: 10BryanDavis) [15:16:13] (03PS4) 10Alexandros Kosiaris: Bump to 0.109.0 [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 [15:16:16] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:16:56] (03CR) 10jerkins-bot: [V: 04-1] Bump to 0.109.0 [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 (owner: 10Alexandros Kosiaris) [15:17:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:20:44] (03PS1) 10Jbond: role::mail::mx: enable jumpcloud test domain [puppet] - 10https://gerrit.wikimedia.org/r/588425 [15:22:33] (03PS2) 10Jbond: role::mail::mx: enable jumpcloud test domain [puppet] - 10https://gerrit.wikimedia.org/r/588425 (https://phabricator.wikimedia.org/T244792) [15:23:03] (03PS23) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [15:23:06] (03PS1) 10Andrew Bogott: Designate: remove the coordination_host param [puppet] - 10https://gerrit.wikimedia.org/r/588426 (https://phabricator.wikimedia.org/T250087) [15:23:34] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:24:17] (03CR) 10Jbond: [C: 04-1] "I have self -1 as this needs review from kieth or someone else familiar with exim to correct my copy/paste" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588425 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:24:42] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:27:14] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:28:40] (03CR) 10Krinkle: apereo_cas: update templates login page (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [15:30:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:32:39] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:33:44] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:34:25] (03PS4) 10Jbond: profile::mail::mx: add type enforcment, lookups and move defaults [puppet] - 10https://gerrit.wikimedia.org/r/588419 (https://phabricator.wikimedia.org/T244792) [15:34:53] 10Operations, 10DC-Ops, 10cloud-services-team (Kanban): labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286 (10RobH) a:03Bstorm Please note this was NOT in ops-eqiad, and was likely being overlooked by onsites in eqiad due to that reason. (It also is not ass... [15:35:03] (03PS4) 10Jbond: profile::mail::mx: Add toggle to enable jumpcloud integration [puppet] - 10https://gerrit.wikimedia.org/r/588420 (https://phabricator.wikimedia.org/T244792) [15:36:16] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:36:38] 10Operations, 10DC-Ops: document all scs connections - https://phabricator.wikimedia.org/T175876 (10RobH) 05Open→03Resolved a:03RobH quick review of the scs devices on https://netbox.wikimedia.org/dcim/devices/?q=scs&status=active&mac_address=&has_primary_ip=&local_context_data=&virtual_chassis_member=&c... [15:36:40] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625 (10RobH) [15:38:35] 10Operations, 10DC-Ops: Wipe of spare/replacement disks - https://phabricator.wikimedia.org/T166368 (10RobH) This task is over a year old (should we resolve/reject it?) Please note that we no longer require all disks be wiped before decom (just reuse), as we physically destroy all disks now. I don't think ha... [15:39:38] (03PS6) 10Jbond: apereo_cas: update templates login page [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) [15:39:56] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:40:25] 10Operations, 10netops, 10User-jbond: Sporatic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10akosiaris) >>! In T238823#6051460, @jbond wrote: >>>! In T238823#5681406, @akosiaris wrote: >> Could be totally different but with @jijiki we 've seen this behavior elsewhere as well. The... [15:40:49] (03CR) 10Jbond: apereo_cas: update templates login page (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [15:44:06] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/588419 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:44:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:45:26] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:46:22] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:47:20] (03PS3) 10CRusnov: icinga: Add git local changes check [puppet] - 10https://gerrit.wikimedia.org/r/588049 [15:48:06] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:48:49] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) >>! In T187984#6048219, @Gryllida wrote: > @akosiaris Thank you for volunteering with this task. Are you still interested? How has the situation changed in the last... [15:49:20] (03CR) 10CRusnov: [C: 03+2] icinga: Add git local changes check [puppet] - 10https://gerrit.wikimedia.org/r/588049 (owner: 10CRusnov) [15:50:08] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:50:52] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:56:12] !log Deploy schema change on s4 codfw master - T250067 [15:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:18] T250067: user_newtalk has two indexes not renamed in s4 - https://phabricator.wikimedia.org/T250067 [15:56:18] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:56:40] (03CR) 10CDanis: [C: 03+1] "LGTM overall, some questions" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588049 (owner: 10CRusnov) [15:57:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:01:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:01:46] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:05:22] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:07:10] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:09:19] (03PS1) 10CDanis: WIP: add NIC saturation exporter [puppet] - 10https://gerrit.wikimedia.org/r/588431 [16:14:11] (03PS5) 10Alexandros Kosiaris: Bump to 0.109.0 [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 [16:15:02] (03CR) 10jerkins-bot: [V: 04-1] Bump to 0.109.0 [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 (owner: 10Alexandros Kosiaris) [16:16:18] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:19:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:19:22] (03PS4) 10Andrew Bogott: designate: change api_base_uri to proper HA endpoint [puppet] - 10https://gerrit.wikimedia.org/r/588176 [16:21:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:21:44] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:25:34] (03CR) 10Andrew Bogott: [C: 03+2] designate: change api_base_uri to proper HA endpoint [puppet] - 10https://gerrit.wikimedia.org/r/588176 (owner: 10Andrew Bogott) [16:28:56] RECOVERY - Host ps1-c6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [16:29:00] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:29:50] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:30:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:30:44] RECOVERY - Host mw1335.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.54 ms [16:30:56] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [16:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:21] !log replacing msw-c6-eqiad [16:31:24] RECOVERY - Host db1134.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [16:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:41] !log Sample all inbound v6 traffic on cr2-eqsin [16:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:38] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:32:57] (03PS10) 10Andrew Bogott: designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) [16:33:18] RECOVERY - Host mw1329.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [16:33:28] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:33:36] RECOVERY - Host mw1331.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [16:33:45] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [16:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:24] RECOVERY - Host mw1322.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [16:34:24] RECOVERY - Host mw1336.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [16:34:42] RECOVERY - Host mw1345.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.49 ms [16:34:46] RECOVERY - Host mw1325.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.45 ms [16:34:46] RECOVERY - Host mw1339.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [16:34:48] RECOVERY - Host mw1340.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.64 ms [16:34:49] RECOVERY - Host mw1323.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms [16:34:49] RECOVERY - Host mw1326.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [16:34:54] RECOVERY - Host mw1333.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [16:34:58] RECOVERY - Host mw1330.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [16:34:58] RECOVERY - Host mw1320.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [16:34:58] RECOVERY - Host mw1332.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [16:34:58] RECOVERY - Host mw1328.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.48 ms [16:34:58] RECOVERY - Host mw1342.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms [16:34:59] RECOVERY - Host mw1344.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [16:35:28] RECOVERY - Host mw1337.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [16:35:37] wooooo [16:35:42] thanks cmjohnson1 ! [16:35:42] RECOVERY - Host mw1343.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.42 ms [16:35:44] RECOVERY - Host mw1321.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [16:35:48] RECOVERY - Host mw1319.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.50 ms [16:35:48] RECOVERY - Host mw1327.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [16:36:05] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [16:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:12] RECOVERY - Host mw1324.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [16:36:12] RECOVERY - Host mw1334.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [16:36:12] RECOVERY - Host mw1348.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.89 ms [16:36:12] RECOVERY - Host mw1346.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.73 ms [16:36:12] RECOVERY - Host mw1338.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.52 ms [16:36:13] RECOVERY - Host mw1347.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [16:36:19] !log sample before any other border-in terms in eqsin [16:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:16] RECOVERY - Host wdqs1010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.64 ms [16:37:23] wut [16:37:24] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:38:38] RECOVERY - Host bast1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.45 ms [16:39:19] (03CR) 10Andrew Bogott: [C: 03+2] designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [16:39:26] RECOVERY - Host mw1341.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [16:40:36] RECOVERY - Host db1121.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [16:42:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:45:24] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:46:13] !log sample before any other border-in terms in ulsfo [16:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:47:27] what are all the mgmt up alerts, does anyone know? [16:48:03] apergos: dead mgmt switch got replaced [16:48:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:48:19] aaahhh [16:48:22] excellent [16:48:28] tyvm! [16:48:39] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10wiki_willy) a:03Jclark-ctr Hey @Jclark-ctr - per our conversation from last Thursday, can you work on fixing these following Netbox errors for eqiad when you go onsite this week? https... [16:49:53] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: Netbox report coherence_rack Icinga alert - https://phabricator.wikimedia.org/T250054 (10wiki_willy) a:03wiki_willy [16:50:04] (03PS1) 10Andrew Bogott: cloudservices: add a missing ) in a ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/588434 (https://phabricator.wikimedia.org/T249941) [16:50:17] 10Operations, 10ops-eqiad: msw-a2-eqiad missing from Netbox - https://phabricator.wikimedia.org/T249685 (10Cmjohnson) 05Open→03Resolved Very odd, I fixed this [16:50:27] !log sample before any other border-in terms in dfw [16:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:52] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:53:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:54:28] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:55:18] (03CR) 10jerkins-bot: [V: 04-1] cloudservices: add a missing ) in a ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/588434 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [16:55:44] 10Operations, 10ops-eqiad, 10netops: Eqiad: C6 mgmt switch down - https://phabricator.wikimedia.org/T249309 (10Cmjohnson) 05Open→03Resolved replaced the management switch today and updated netbox with new information, keeping the same name. changed the old one to msw-c6-eqiad-old and set status to decomm... [16:56:07] (03PS2) 10Andrew Bogott: cloudservices: add a missing ) in a ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/588434 (https://phabricator.wikimedia.org/T249941) [16:56:16] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:57:00] !log sample before any other border-in terms in esams [16:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:59:56] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [17:00:04] gehel and onimisionipe: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200413T1700). [17:00:37] 10Operations, 10ops-eqiad, 10DC-Ops: ganeti1011.mgmt is un-configured (was: Puppet resolves wrong IP for Icinga host config) - https://phabricator.wikimedia.org/T249314 (10Cmjohnson) 05Stalled→03Resolved the new management switch fixed the issue. [17:00:40] 10Operations, 10ops-eqiad, 10netops: Eqiad: C6 mgmt switch down - https://phabricator.wikimedia.org/T249309 (10Cmjohnson) [17:01:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:01:13] !log sample before any other border-in terms in eqiad [17:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:30] (03CR) 10Ayounsi: [C: 03+2] Sample all inbound traffic [homer/public] - 10https://gerrit.wikimedia.org/r/577316 (https://phabricator.wikimedia.org/T246618) (owner: 10Ayounsi) [17:01:38] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices: add a missing ) in a ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/588434 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:01:46] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:02:44] PROBLEM - Check systemd state on cescout1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:31] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10Cmjohnson) I swapped the DIMM A side to B side to see if error disappears or presents itself on the same slot or if it followed the DIMM [17:10:50] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:11:22] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10Cmjohnson) I cleared the log, this is a paste of the error Record: 17 Date/Time: 04/12/2020 06:30:30 Source: system Severity: Critical Description: Multi-bit memory errors detec... [17:11:43] (03PS1) 10Andrew Bogott: openstack: increase labweb memcached size for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/588439 (https://phabricator.wikimedia.org/T145703) [17:11:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:15:02] gehel: around? [17:15:12] (03PS2) 10Andrew Bogott: Designate: remove the coordination_host param [puppet] - 10https://gerrit.wikimedia.org/r/588426 (https://phabricator.wikimedia.org/T250087) [17:15:14] (03PS24) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [17:15:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:16:18] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:16:42] 10Operations, 10DC-Ops: Wipe of spare/replacement disks - https://phabricator.wikimedia.org/T166368 (10faidon) If I understand it correctly, this task is specifically about a box that was returned to the spare pool and then was reallocated for a new purpose but kept its old data. We should //definitely// wipe... [17:18:40] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:19:59] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10wiki_willy) [17:21:46] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [17:23:54] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:24:10] RECOVERY - DNS on ganeti1011.mgmt is OK: DNS OK: 0.011 seconds response time. ganeti1011.mgmt.eqiad.wmnet returns 10.65.5.106 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:26:06] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:26:50] 10Operations, 10DC-Ops: determine/process/document bios firmware tracking/updating policies - https://phabricator.wikimedia.org/T141128 (10wiki_willy) a:03wiki_willy [17:27:16] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:28:28] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:29:46] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:31:02] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:34:18] RhinosF1: public holiday here. I'll be back tomorrow [17:34:51] gehel: Sorted, was about to ask about one of them patches but saw another email saying they were both deployed [17:35:54] RhinosF1: good! Ping me if there is any issue with that federation ! [17:36:31] (03PS3) 10Mholloway: MachineVision: Add MachineVisionWithholdImageList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588053 (https://phabricator.wikimedia.org/T249939) [17:36:36] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:36:44] gehel: I’ll let you know if I hear anything [17:37:58] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:38:10] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:39:26] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:39:40] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:39:48] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:40:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:40:33] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10DannyS712) [17:41:01] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10DannyS712) [17:41:13] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10RhinosF1) Caused by a reboot to fix T246577 (again) [17:41:16] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:41:20] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:42:38] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10RhinosF1) 05Open→03Resolved [17:42:52] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:43:58] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10Jdforrester-WMF) >>! In T250103#6052721, @RhinosF1 wrote: > Caused by a reboot to fix T246577 (again) No? This task was *why* I restarted the instance. [17:44:41] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10RhinosF1) >>! In T250103#6052734, @Jdforrester-WMF wrote: >>>! In T250103#6052721, @RhinosF1 wrote: >> Caused by a reboot to fix T246577 (again) > > No? This t... [17:45:04] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:45:22] RECOVERY - Check systemd state on cescout1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:48:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:49:59] 10Operations, 10DC-Ops: Wipe of spare/replacement disks - https://phabricator.wikimedia.org/T166368 (10RobH) >>! In T166368#6052598, @faidon wrote: > If I understand it correctly, this task is specifically about a box that was returned to the spare pool and then was reallocated for a new purpose but kept its o... [17:50:06] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:51:50] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:52:40] 10Operations, 10DC-Ops: Wipe of spare/replacement disks - https://phabricator.wikimedia.org/T166368 (10wiki_willy) Hi @faidon - from our last conversation around this topic during the all-hands, if the onsite shredding was successful on March 20, then we could proceed with onsite shredding over drive wiping fo... [17:54:12] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:54:48] RECOVERY - Host restbase1025 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:55:56] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:56:30] jouncebot: next [17:56:30] In 0 hour(s) and 3 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200413T1800) [17:58:10] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:58:37] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10elukey) The host seems up, but the following is listed in dmesg: ` [Mon Apr 13 17:54:20 2020] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [Mon Apr 13 17:54:2... [18:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200413T1800). Please do the needful. [18:00:04] awight and niedzielski: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:31] I can deploy :-) [18:00:34] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:01:05] o/ hey all [18:01:52] nice awight :-) [18:01:58] mholloway has an undeployed config patch, "MachineVision: Add MachineVisionWithholdImageList config (Beta)" [18:02:22] ^ if anyone knows their screen name, would be nice to ping. [18:02:31] mdholloway: ^ [18:02:33] I'll just tiptoe around that for now [18:02:35] :-) [18:02:38] ty\ [18:02:43] awight: that one is beta-only, it doesn't affect production at all [18:02:44] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [18:03:04] awight: I'm actually trying to become a real SWAT deployer at some point. I have the access rights AFAIK but haven't done one yet. Would it be possible to sit in on our deployment today (or another day)? [18:03:10] Urbanecm: thanks for pointing it out! [18:03:24] niedzielski: Sure thing -- want to screen share perhaps? [18:03:34] awight: that'd be amazing! [18:03:43] awight: do you have a preferred service? [18:04:14] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:05:28] (03CR) 10Andrew Bogott: [C: 03+2] openstack: increase labweb memcached size for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/588439 (https://phabricator.wikimedia.org/T145703) (owner: 10Andrew Bogott) [18:05:29] awight: sorry about that, i haven't been doing manual deployments for beta-only config updates lately since they roll out automatically via jenkins job. [18:05:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:05:34] thanks niedzielski for the ping [18:06:36] mdholloway: you should pull beta-only patches onto the deployment host, to avoid further confusions :-) (no need to sync, just fetch them there) [18:07:27] Urbanecm: will do, thanks! [18:08:14] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:09:08] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:09:51] OK, should be fixed up onw [18:09:53] *now [18:09:55] (03CR) 10Andrew Bogott: [C: 03+2] Designate: remove the coordination_host param [puppet] - 10https://gerrit.wikimedia.org/r/588426 (https://phabricator.wikimedia.org/T250087) (owner: 10Andrew Bogott) [18:13:42] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:14:18] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:14:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:16:04] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:16:09] 10Operations, 10serviceops, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10Andrew) Pinging @MoritzMuehlenhoff, any objections to this? [18:16:15] 10Operations, 10WM-Bot: wm-bot doesn't set charset=utf-8, which breaks (amongst other things) emoji rendering - https://phabricator.wikimedia.org/T250104 (10CDanis) [18:22:37] 10Operations, 10DC-Ops: determine/process/document bios firmware tracking/updating policies - https://phabricator.wikimedia.org/T141128 (10wiki_willy) I'll take this as action item to discuss during our next staff meeting. I gave our Dell account rep a call today inquiring about when the latest firmware/bios... [18:23:37] (03PS25) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [18:27:16] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:29:02] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:29:34] (03CR) 10Andrew Bogott: [C: 03+2] Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [18:30:26] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:32:10] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:33:05] 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 4 others: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10Mholloway) Just verified that wikifeeds is now using named_le... [18:37:42] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:38:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:39:03] (03PS1) 10Andrew Bogott: Designate: fix ferm rules for designate_hosts list [puppet] - 10https://gerrit.wikimedia.org/r/588455 (https://phabricator.wikimedia.org/T249941) [18:39:58] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:40:02] (03PS2) 10Andrew Bogott: Designate: fix ferm rules for designate_hosts list [puppet] - 10https://gerrit.wikimedia.org/r/588455 (https://phabricator.wikimedia.org/T249941) [18:40:56] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:41:04] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:41:06] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:41:12] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:41:26] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:41:40] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:42:25] (03PS3) 10Andrew Bogott: Designate: fix ferm rules for designate_hosts list [puppet] - 10https://gerrit.wikimedia.org/r/588455 (https://phabricator.wikimedia.org/T249941) [18:43:00] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:43:08] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:43:38] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:45:10] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:45:34] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:46:32] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [18:46:32] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:47:02] PROBLEM - Check systemd state on db2078 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:20] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:48:48] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:48:54] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:49:06] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:49:13] !log niedzielski@deploy1001 Synchronized php-1.35.0-wmf.27/extensions/TwoColConflict: SWAT: [[gerrit:588370|Fix double HTML escaping of "copytext" lines in the diff (T249986)]] (duration: 01m 01s) [18:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:19] T249986: Double escaping of HTML characters in the wikitext of "copy" lines - https://phabricator.wikimedia.org/T249986 [18:49:24] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10RLazarus) [18:50:10] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:50:10] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:50:37] (03PS4) 10Andrew Bogott: Designate: fix ferm rules for designate_hosts list [puppet] - 10https://gerrit.wikimedia.org/r/588455 (https://phabricator.wikimedia.org/T249941) [18:51:06] PROBLEM - PHP7 rendering on mw1396 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1310 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:51:40] PROBLEM - Apache HTTP on mw1396 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1310 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:53:42] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:53:50] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:53:56] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:02] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:12] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:18] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:30] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:40] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:44] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:55:17] (03CR) 10Andrew Bogott: [C: 03+2] Designate: fix ferm rules for designate_hosts list [puppet] - 10https://gerrit.wikimedia.org/r/588455 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [18:55:42] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:55:44] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:55:48] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:55:56] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:56:00] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:56:06] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:56:26] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:56:28] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:58:13] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10wkandek) [18:58:19] awight: niedzielski: o/ still working? i see there's a minerva backport on the schedule that i haven't seen come through yet [18:59:18] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:59:22] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:59:54] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:00:46] mdholloway: yep, it's on its way. just a moment please [19:01:01] niedzielski: ok, no rush :) [19:01:18] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:01:26] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:01:36] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:02:02] EU SWAT is going over our window, probably just a few more minutes. [19:02:04] (03PS1) 10Andrew Bogott: designate: fix a misplaced ) in a ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/588462 (https://phabricator.wikimedia.org/T249941) [19:02:46] !log niedzielski@deploy1001 Synchronized php-1.35.0-wmf.27/skins/MinervaNeue: SWAT: [[gerrit:588405|Update the icon glyph (T249864)]] (duration: 01m 00s) [19:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:52] T249864: Section edit icon not displaying in Minerva skin - https://phabricator.wikimedia.org/T249864 [19:03:02] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:03:08] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:03:24] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:04:48] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:04:52] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:05:24] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:05:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:06:10] !log Morning SWAT done [19:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:34] mdholloway: thanks for waiting! all done now [19:06:44] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:06:50] niedzielski: great, thanks! [19:06:52] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:07:10] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:07:44] Thank you so much awight for the SWAT deployment lessons! [19:07:50] Very helpful!! [19:08:05] (03CR) 10Andrew Bogott: [C: 03+2] designate: fix a misplaced ) in a ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/588462 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [19:08:08] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:08:17] (03CR) 10Mholloway: [C: 03+2] MachineVision: Add MachineVisionWithholdImageList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588053 (https://phabricator.wikimedia.org/T249939) (owner: 10Mholloway) [19:08:32] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:08:36] niedzielski: Great work, thanks for letting me not get my hands dirty, so I could eat cookies :-) [19:08:59] ;D [19:09:00] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out bef [19:09:00] s received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:09:58] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:10:11] (03Merged) 10jenkins-bot: MachineVision: Add MachineVisionWithholdImageList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588053 (https://phabricator.wikimedia.org/T249939) (owner: 10Mholloway) [19:10:50] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:11:16] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:12:26] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:12:32] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was rece [19:12:32] tech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:12:44] RECOVERY - Check systemd state on db2078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:50] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:13:00] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision: Add MachineVisionWithholdImageList config (T249939) (duration: 01m 03s) [19:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:06] T249939: Don't include images of humans in Special:SuggestedTags - https://phabricator.wikimedia.org/T249939 [19:14:14] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:14:20] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:14:36] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:14:52] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:16:06] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:17:50] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:19:36] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:20:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:21:28] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:23:13] (03PS1) 10Andrew Bogott: cloud puppetmaster: fix flatten() syntax when producing cert manager list [puppet] - 10https://gerrit.wikimedia.org/r/588465 (https://phabricator.wikimedia.org/T249941) [19:23:18] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:23:28] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:25:04] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:25:06] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:25:14] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:25:24] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:25:34] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:26:00] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:27:53] (03PS2) 10Andrew Bogott: cloud puppetmaster: fix flatten() syntax when producing cert manager list [puppet] - 10https://gerrit.wikimedia.org/r/588465 (https://phabricator.wikimedia.org/T249941) [19:27:55] (03PS1) 10Andrew Bogott: cloud-vps: add a dummy profile::backup::director_seed: changeme value for VMs [puppet] - 10https://gerrit.wikimedia.org/r/588469 [19:27:56] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:28:44] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:29:00] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:29:04] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:29:10] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:29:16] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:29:32] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:29:38] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:30:46] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:30:48] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:31:02] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:31:11] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: add a dummy profile::backup::director_seed: changeme value for VMs [puppet] - 10https://gerrit.wikimedia.org/r/588469 (owner: 10Andrew Bogott) [19:31:14] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:33:04] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:33:42] (03CR) 10Andrew Bogott: [C: 03+2] cloud puppetmaster: fix flatten() syntax when producing cert manager list [puppet] - 10https://gerrit.wikimedia.org/r/588465 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [19:33:56] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:34:38] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:35:16] 10Operations, 10MediaWiki-Cache, 10Traffic: Cache not being invalidated on new edits on Vaginal steaming - https://phabricator.wikimedia.org/T250108 (10AntiCompositeNumber) [19:36:18] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out bef [19:36:18] s received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:37:31] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.27/extensions/MachineVision: Add support for WITHHOLD_ALL review state (T249939) (duration: 01m 23s) [19:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:37] T249939: Don't include images of humans in Special:SuggestedTags - https://phabricator.wikimedia.org/T249939 [19:38:34] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:38:38] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:39:58] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:41:20] !log ran extensions/MachineVision/maintenance/withholdImages.php on testcommonswiki [19:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:34] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:41:50] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:41:58] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:43:58] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:44:08] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:44:24] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:46:25] 10Operations, 10Services, 10Service-deployment-requests: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) [19:46:38] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is OK: HTTP OK: HTTP/1.0 200 OK - 22420 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:46:51] 10Operations, 10Services, 10Service-deployment-requests: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) [19:49:58] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:50:44] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:51:02] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:51:02] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out bef [19:51:02] s received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:51:07] !log running extensions/MachineVision/maintenance/withholdImages.php on commonswiki [19:51:10] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [19:51:10] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:28] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:51:46] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [19:51:46] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:52:38] 10Operations, 10CommRel-Specialists-Support (Apr-Jun-2020): CommRel support for FY2019-2020 Q4 DC switchover - https://phabricator.wikimedia.org/T244808 (10Whatamidoing-WMF) We have a lot of requests right now, but if we get enough warning (which your team is always good about), then I think it's likely that s... [19:53:10] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:53:28] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:54:36] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:54:40] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:56:43] 10Operations, 10serviceops, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10bd808) Just so everyone is on the same page about this task and its parent ({T237773}), the desired long term solution ({T161859}) will //NOT// require LDAP for Wikitech.... [19:56:44] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:56:47] !log finished running extensions/MachineVision/maintenance/withholdImages.php on commonswiki (T249939) [19:56:48] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:56:52] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:54] T249939: Don't include images of humans in Special:SuggestedTags - https://phabricator.wikimedia.org/T249939 [19:56:58] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:57:04] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:57:18] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:58:22] (03PS1) 10Wolfgang Kandek: admin: add Wolfgang Kandek to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/588474 (https://phabricator.wikimedia.org/T249352) [19:58:24] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:59:02] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:59:52] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:00:04] halfak and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200413T2000). [20:00:54] (03CR) 10jerkins-bot: [V: 04-1] admin: add Wolfgang Kandek to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/588474 (https://phabricator.wikimedia.org/T249352) (owner: 10Wolfgang Kandek) [20:02:12] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_cxserver_cluster_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:02:12] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:03:05] 10Operations, 10MediaWiki-Cache, 10Traffic: Cache not being invalidated on new edits on Vaginal steaming - https://phabricator.wikimedia.org/T250108 (10AntiCompositeNumber) After waiting a bit, I decided to try again. This is 50 minutes since the edit, and 20 minutes since the last attempt. 2. Navigate to h... [20:03:30] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [20:03:40] 10Operations, 10MediaWiki-Cache, 10Traffic: Cache not being invalidated on new edits on Vaginal steaming - https://phabricator.wikimedia.org/T250108 (10AntiCompositeNumber) [20:03:50] (03PS2) 10Ottomata: Remove now unused mediawiki/event-schemas repo [puppet] - 10https://gerrit.wikimedia.org/r/587255 (https://phabricator.wikimedia.org/T240985) [20:03:52] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:04:00] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:04:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:05:30] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:05:42] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:07:10] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:07:28] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:08:00] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:08:01] (03CR) 10Ottomata: [C: 03+2] Remove now unused mediawiki/event-schemas repo [puppet] - 10https://gerrit.wikimedia.org/r/587255 (https://phabricator.wikimedia.org/T240985) (owner: 10Ottomata) [20:08:16] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:09:14] (03PS2) 10Wolfgang Kandek: admin: add Wolfgang Kandek to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/588474 (https://phabricator.wikimedia.org/T249352) [20:09:18] 10Operations, 10MediaWiki-Cache, 10Traffic: Cache not being invalidated on new edits on Vaginal steaming - https://phabricator.wikimedia.org/T250108 (10CDanis) [20:09:28] 10Operations, 10MediaWiki-Cache, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 4 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10CDanis) [20:10:10] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [20:10:10] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:11:02] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [20:11:02] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/ [20:11:02] itoring/recommendation_api [20:11:32] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [20:11:32] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:11:48] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [20:11:48] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/ [20:11:48] itoring/recommendation_api [20:11:48] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [20:12:40] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [20:12:42] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:13:02] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:13:02] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:13:02] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:13:14] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:13:20] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:13:34] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [20:13:36] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [20:14:04] (03CR) 10RLazarus: [C: 03+2] admin: add Wolfgang Kandek to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/588474 (https://phabricator.wikimedia.org/T249352) (owner: 10Wolfgang Kandek) [20:14:26] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:14:38] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:14:46] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:14:46] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:14:46] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:14:58] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:15:04] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:15:10] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:15:20] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:15:24] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:15:30] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:15:36] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:16:18] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:16:38] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:17:04] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:20:18] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:20:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:20:36] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:21:02] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:21:08] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:21:44] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [20:22:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:22:20] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:22:42] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:22:46] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:22:56] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:23:03] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10wkandek) [20:23:25] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10wkandek) [20:24:26] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:25:22] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:25:42] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:29:02] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:31:16] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:31:32] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:31:58] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:32:38] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [20:33:02] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:33:16] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:34:54] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [20:35:08] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:35:12] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:35:24] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:35:32] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:35:42] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:36:42] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [20:36:54] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:36:58] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:37:14] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:37:28] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:37:42] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Cmjohnson) These are on 1G racks. If you need 10G they will have to be moved. [20:39:56] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:40:47] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) >>! In T244506#6053265, @Cmjohnson wrote: > These are on 1G racks. If you need 10G they will have to be moved. Yep we'd need 10G, but regardless... [20:40:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:41:06] 10Operations, 10MediaWiki-Cache, 10Traffic: Cache not being invalidated on new edits on Vaginal steaming - https://phabricator.wikimedia.org/T250108 (10Peachey88) [20:41:08] (03PS2) 10Mholloway: MachineVision label blacklist updates, 2020-04-09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587994 (https://phabricator.wikimedia.org/T249895) [20:41:27] (03CR) 10Hashar: "Thanks :] The error output was really confusing previously." [puppet] - 10https://gerrit.wikimedia.org/r/587990 (owner: 10Hashar) [20:42:19] (03PS3) 10Mholloway: MachineVision label blocklist updates, 2020-04-13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587994 (https://phabricator.wikimedia.org/T249895) [20:44:45] (03CR) 10Mholloway: [C: 03+2] MachineVision label blocklist updates, 2020-04-13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587994 (https://phabricator.wikimedia.org/T249895) (owner: 10Mholloway) [20:46:01] (03Merged) 10jenkins-bot: MachineVision label blocklist updates, 2020-04-13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587994 (https://phabricator.wikimedia.org/T249895) (owner: 10Mholloway) [20:46:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:49:00] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision blocklist update (T249895) (duration: 00m 59s) [20:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:08] T249895: CAT blocklist update, 2020-04-09 - https://phabricator.wikimedia.org/T249895 [20:49:48] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:53:24] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:53:53] 10Operations, 10Internet-Archive, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10Quiddity) [20:55:34] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:55:35] 10Operations, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10Quiddity) [20:56:26] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:00:04] Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200413T2100). [21:02:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:03:44] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [21:04:28] 10Operations, 10serviceops, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10Joe) I don't know enough about php-ldap at the moment to have an opinion. In itself, adding a php extension to production is a big deal, but it's also easy to undo. Se... [21:07:13] 10Operations, 10serviceops, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10bd808) >>! In T237889#6053313, @Joe wrote: > Also: how temporary? Do you have a tentative timeline for transitioning wikitech to SUL? Geologically short, but maybe not s... [21:07:24] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:12:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:12:52] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:13:50] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:14:42] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:20:14] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:21:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:23:00] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:23:52] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:25:20] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:26:21] 10Operations, 10serviceops: Request for a in-memory caching data set for caching research - https://phabricator.wikimedia.org/T240503 (10leila) [21:27:04] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:27:09] 10Operations, 10serviceops: Request for a in-memory caching data set for caching research - https://phabricator.wikimedia.org/T240503 (10leila) I removed the Research tag as it refers to the work of the research team in WMF. However, if I can be of any help to the SRE team with this particular request, please... [21:32:22] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.27/extensions/MachineVision: Add script to apply blacklist to current labels (T249273) (duration: 00m 58s) [21:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:29] T249273: [S/M]Explore ways to apply CAT blacklist updates to previously tagged images - https://phabricator.wikimedia.org/T249273 [21:32:32] 10Operations, 10Wikimedia-Mailing-lists: Password reset for admin of wikiwomencamp mailing list - https://phabricator.wikimedia.org/T250035 (10Quiddity) 05Open→03Resolved a:03Quiddity Done. New pw sent to the address listed at https://lists.wikimedia.org/mailman/listinfo/wikiwomencamp :) [21:33:31] (03CR) 10Andrew Bogott: [C: 03+1] clouddb.sql.erb: Add GRANTs file [puppet] - 10https://gerrit.wikimedia.org/r/588202 (owner: 10Marostegui) [21:34:40] !log ran extensions/MachineVision/maintenance/removeBlacklistedSuggestions.php on testcommonswiki [21:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:21] !log ran extensions/MachineVision/maintenance/removeBlacklistedSuggestions.php on commonswiki (T249273) [21:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:27] T249273: [S/M]Explore ways to apply CAT blacklist updates to previously tagged images - https://phabricator.wikimedia.org/T249273 [21:58:54] (03PS1) 10CDanis: depool codfw, connectivity issues? [dns] - 10https://gerrit.wikimedia.org/r/588502 [22:04:44] 10Operations, 10Research, 10Traffic: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10leila) @BBlack can we close this task? [22:08:05] (03CR) 10CDanis: [C: 03+2] depool codfw, connectivity issues? [dns] - 10https://gerrit.wikimedia.org/r/588502 (owner: 10CDanis) [22:08:42] !log depool codfw [22:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:00] 10Operations, 10Analytics, 10Research, 10Traffic: Wikipedia Accessibility, check false positives and false negatives of traffic alarms - https://phabricator.wikimedia.org/T245166 (10leila) @Nuria do you need our team's support in any way for this task? (I'm reviewing our tasks in Staged.) [22:19:06] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 116.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [22:35:20] !log restart elasticsearch_6@production-search-psi-eqiad on elastic1052 [22:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:58] !log restart elasticsearch_6@production-search-psi-eqiad on elastic1052 for excessive old gc over last few hours [22:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:53] (03PS1) 10CDanis: Revert "depool codfw, connectivity issues?" [dns] - 10https://gerrit.wikimedia.org/r/588505 [22:41:16] (03CR) 10CDanis: [C: 03+2] Revert "depool codfw, connectivity issues?" [dns] - 10https://gerrit.wikimedia.org/r/588505 (owner: 10CDanis) [22:41:37] !log repool codfw [22:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:18] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [23:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200413T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:05:10] (03CR) 10RLazarus: "Looks good! Two structural suggestions (first and third comments) and I don't feel super strongly about either -- if you decided to keep b" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588431 (owner: 10CDanis) [23:06:29] (03PS1) 10Mholloway: MachineVision: Withholding list additions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588508 (https://phabricator.wikimedia.org/T249939) [23:09:08] (03CR) 10Mholloway: [C: 03+2] MachineVision: Withholding list additions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588508 (https://phabricator.wikimedia.org/T249939) (owner: 10Mholloway) [23:10:19] (03Merged) 10jenkins-bot: MachineVision: Withholding list additions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588508 (https://phabricator.wikimedia.org/T249939) (owner: 10Mholloway) [23:14:36] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision withholding list additions (T249939) (duration: 00m 59s) [23:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:44] T249939: Don't include images of humans in Special:SuggestedTags - https://phabricator.wikimedia.org/T249939 [23:24:31] !log re-ran extensions/MachineVision/maintenance/withholdImages.php on commonswiki [23:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:34] 10Operations, 10Research, 10Traffic: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10bmansurov) 05Open→03Resolved Yes.