[00:00:09] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:27] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:41] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:25] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:47] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:48] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624852 (https://phabricator.wikimedia.org/T262174) (owner: 10MarcoAurelio) [00:25:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [00:25:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [00:26:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 15475 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [00:27:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 85507 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [00:41:53] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:33:23] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:33:53] PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [01:40:15] PROBLEM - High average POST latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=POST [01:42:13] RECOVERY - High average POST latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=POST [01:43:33] RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [01:44:57] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:34:59] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [04:35:51] (03PS1) 10Marostegui: db1087: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/625582 [04:40:01] (03CR) 10Marostegui: [C: 03+2] db1087: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/625582 (owner: 10Marostegui) [04:43:27] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [04:53:02] !log Deploy schema change on db1109 (eqiad wikidata master) - T256685 [04:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:09] T256685: pl_namespace index on pagelinks is unique only in s8 - https://phabricator.wikimedia.org/T256685 [04:56:17] !log Compress InnoDB on s1 eqiad master - this will generate a few day of lag on s1 and labsdb for enwiki T254462 [04:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:22] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [05:08:08] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [05:10:58] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) @Papaul do you want me to attempt to get es2026 installed? [05:11:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1087 after MCR schema change', diff saved to https://phabricator.wikimedia.org/P12501 and previous config saved to /var/cache/conftool/dbconfig/20200907-051157-marostegui.json [05:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:53] PROBLEM - HP RAID on ms-be2019 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [05:22:55] ACKNOWLEDGEMENT - HP RAID on ms-be2019 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T262182 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Run [05:22:55] aid_Information_Gathering [05:22:59] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10ops-monitoring-bot) [05:55:07] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "The issue you encounter has nothing to do with this class and all to do with the code in both" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) (owner: 10Dzahn) [05:59:55] 10Operations: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10Joe) The problem described here is - I think - the same as the one in the "simplelap" role: you're not including the `httpd::mpm` class that is designed to take care of things for you. [06:00:07] 10Operations: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10Joe) a:03Joe [06:03:45] (03PS3) 10KartikMistry: Update cxserver to 2020-08-30-011854-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/623475 (https://phabricator.wikimedia.org/T253439) [06:07:30] (03PS1) 10Giuseppe Lavagetto: tendril::webserver: configure mpm [puppet] - 10https://gerrit.wikimedia.org/r/625583 (https://phabricator.wikimedia.org/T224589) [06:07:32] (03PS1) 10Giuseppe Lavagetto: mobileapps: add tls-enabled endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625584 (https://phabricator.wikimedia.org/T255876) [06:46:23] (03PS1) 10Giuseppe Lavagetto: deployment_server::helmfile: add service_proxy to the general file [puppet] - 10https://gerrit.wikimedia.org/r/625585 [06:55:27] (03CR) 10Muehlenhoff: [C: 03+2] Yarn: Remove exception for OPTIONS [puppet] - 10https://gerrit.wikimedia.org/r/624712 (owner: 10Muehlenhoff) [06:56:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24971/deploy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/625585 (owner: 10Giuseppe Lavagetto) [07:04:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10elukey) 05Resolved→03Open [07:06:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10elukey) Re-opening since some things need attention (mostly from the Analytics team): * these hosts don't have the flex-bay 2 disk hw raid II... [07:10:28] (03PS1) 10Elukey: install_server: set stretch for all new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/625588 [07:10:50] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) [07:12:36] (03CR) 10Elukey: [C: 03+2] install_server: set stretch for all new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/625588 (owner: 10Elukey) [07:12:46] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10hashar) [07:18:06] (03PS1) 10Muehlenhoff: Remove auth parameter for Graphite base class [puppet] - 10https://gerrit.wikimedia.org/r/625589 [07:18:18] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10jcrespo) Another thing I tracked is that dashboards doesn't seem to do proper filtering. Do dashboards have to be fully redone for 7? I am assuming dashborad definitions have... [07:20:13] jouncebot now [07:20:14] No deployments scheduled for the next 3 hour(s) and 9 minute(s) [07:20:17] jouncebot next [07:20:17] In 3 hour(s) and 9 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200907T1030) [07:20:19] 10Operations, 10Patch-For-Review: Fix "Blog" link on noc.wikimedia.org - https://phabricator.wikimedia.org/T259978 (10Aklapper) p:05Medium→03Low I have no idea what's "medium" priority about this. [07:20:55] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) @Cmjohnson I'd need these to be on Stretch, I have updated dhcp accordingly, will try to reimage :) [07:25:22] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24972/" [puppet] - 10https://gerrit.wikimedia.org/r/625589 (owner: 10Muehlenhoff) [07:29:17] (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 7:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623464 (owner: 10Dzahn) [07:29:19] (03PS1) 10Muehlenhoff: Switch debmonitor to debmonitor1002 [dns] - 10https://gerrit.wikimedia.org/r/625591 (https://phabricator.wikimedia.org/T261489) [07:32:49] (03PS1) 10Giuseppe Lavagetto: profile::services_proxy::envoy: add zotero as a backend [puppet] - 10https://gerrit.wikimedia.org/r/625592 (https://phabricator.wikimedia.org/T255868) [07:36:21] (03PS2) 10Muehlenhoff: Switch debmonitor to debmonitor1002 [dns] - 10https://gerrit.wikimedia.org/r/625591 (https://phabricator.wikimedia.org/T261489) [07:37:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24973/" [puppet] - 10https://gerrit.wikimedia.org/r/625592 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [07:37:57] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:39:53] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:56:48] (03PS1) 10JMeybohm: Revert "Revert "Convert proton to the new layout"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/624854 [07:57:50] (03PS2) 10JMeybohm: Revert "Revert "Convert proton to the new layout"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/624854 (https://phabricator.wikimedia.org/T258572) [07:58:13] (03CR) 10JMeybohm: [C: 03+2] Revert "Revert "Convert proton to the new layout"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/624854 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [07:59:37] (03Merged) 10jenkins-bot: Revert "Revert "Convert proton to the new layout"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/624854 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [08:01:22] 10Operations, 10Continuous-Integration-Infrastructure: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10hashar) [08:02:29] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [08:03:39] !log Compress InnoDB on s8 eqiad master (db1109) - T232446 [08:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:46] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [08:05:03] 10Operations, 10Continuous-Integration-Infrastructure: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10hashar) [08:07:33] 10Operations, 10Gerrit, 10Phabricator, 10Traffic, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655 (10hashar) [08:07:49] (03PS1) 10Giuseppe Lavagetto: citoid: make the zotero port configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/625595 (https://phabricator.wikimedia.org/T255868) [08:07:51] (03PS1) 10Giuseppe Lavagetto: citoid: use the service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/625596 (https://phabricator.wikimedia.org/T255868) [08:09:07] (03CR) 10jerkins-bot: [V: 04-1] citoid: use the service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/625596 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [08:10:08] !log jayme@deploy2001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [08:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:39] (03PS2) 10Giuseppe Lavagetto: citoid: use the service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/625596 (https://phabricator.wikimedia.org/T255868) [08:12:09] (03CR) 10JMeybohm: [C: 03+1] Add entries for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623541 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [08:16:31] (03PS1) 10Muehlenhoff: reboot-groups (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [08:17:38] (03CR) 10jerkins-bot: [V: 04-1] reboot-groups (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [08:18:26] (03CR) 10Filippo Giunchedi: Multiple instances of msearch_daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [08:18:50] !log jayme@deploy2001 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [08:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/625589 (owner: 10Muehlenhoff) [08:19:55] !log Upgrade and restart pc1010 [08:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:10] (03PS1) 10Kormat: admin: Fix obsolete name. [puppet] - 10https://gerrit.wikimedia.org/r/625598 [08:22:43] effie: for you (it re-adds the missing newline too :) ^ [08:29:35] !log jayme@deploy2001 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [08:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:49] (03CR) 10Effie Mouzeli: [C: 03+1] admin: Fix obsolete name. [puppet] - 10https://gerrit.wikimedia.org/r/625598 (owner: 10Kormat) [08:31:16] (03CR) 10Kormat: [C: 03+2] admin: Fix obsolete name. [puppet] - 10https://gerrit.wikimedia.org/r/625598 (owner: 10Kormat) [08:35:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] OpenStack Neutron config: remove the 'tld' variable [puppet] - 10https://gerrit.wikimedia.org/r/624763 (owner: 10Andrew Bogott) [08:37:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] designate: stop creating 'legacy' entries (that is, things under wmflabs) [puppet] - 10https://gerrit.wikimedia.org/r/620937 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [08:37:17] (03PS1) 10Giuseppe Lavagetto: citoid: add LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625600 (https://phabricator.wikimedia.org/T255868) [08:37:20] (03PS1) 10Giuseppe Lavagetto: citoid: promote https lvs to production status [puppet] - 10https://gerrit.wikimedia.org/r/625601 (https://phabricator.wikimedia.org/T255868) [08:37:22] (03PS1) 10Giuseppe Lavagetto: service_proxy: switch citoid to TLS [puppet] - 10https://gerrit.wikimedia.org/r/625602 (https://phabricator.wikimedia.org/T255868) [08:37:24] (03PS1) 10Giuseppe Lavagetto: citoid: remove unencrypted LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625603 (https://phabricator.wikimedia.org/T255868) [08:37:40] (03PS1) 10Filippo Giunchedi: codfw-prod: add ms-be2057 at object weight 100 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/625604 (https://phabricator.wikimedia.org/T261633) [08:38:49] (03CR) 10JMeybohm: [C: 03+2] sre.discovery: Refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/621721 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [08:39:56] (03Merged) 10jenkins-bot: sre.discovery: Refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/621721 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [08:40:16] (03PS2) 10Giuseppe Lavagetto: citoid: add TLS LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625600 (https://phabricator.wikimedia.org/T255868) [08:40:18] (03PS2) 10Giuseppe Lavagetto: citoid: promote https lvs to production status [puppet] - 10https://gerrit.wikimedia.org/r/625601 (https://phabricator.wikimedia.org/T255868) [08:40:20] (03PS2) 10Giuseppe Lavagetto: service_proxy: switch citoid to TLS [puppet] - 10https://gerrit.wikimedia.org/r/625602 (https://phabricator.wikimedia.org/T255868) [08:40:22] (03PS2) 10Giuseppe Lavagetto: citoid: remove unencrypted LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625603 (https://phabricator.wikimedia.org/T255868) [08:40:55] 10Operations, 10SRE-tools, 10serviceops, 10Patch-For-Review: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) Merged the current version as is but the cookbook should be updated in short term with: ` [03.09.20 14:05]... [08:41:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "@ottomata can you merge and deploy this? I'm happy to assist with any issue you might encounter." [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [08:43:31] (03CR) 10JMeybohm: [C: 03+1] citoid: make the zotero port configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/625595 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [08:44:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] citoid: make the zotero port configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/625595 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [08:45:29] (03Merged) 10jenkins-bot: citoid: make the zotero port configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/625595 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [08:47:07] (03CR) 10JMeybohm: [C: 04-1] citoid: use the service proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/625596 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [08:49:17] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [08:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:38] (03CR) 10Muehlenhoff: [C: 03+2] Remove auth parameter for Graphite base class [puppet] - 10https://gerrit.wikimedia.org/r/625589 (owner: 10Muehlenhoff) [08:52:26] (03PS2) 10Effie Mouzeli: WIP php::admin: export additional opcache metrics [puppet] - 10https://gerrit.wikimedia.org/r/625224 [08:53:12] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [08:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1096.eqiad.wmnet'] ` The l... [08:59:58] (03PS1) 10Muehlenhoff: graphite: Remove obsolete LDAP template [puppet] - 10https://gerrit.wikimedia.org/r/625605 [09:02:33] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [09:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:35] (03CR) 10Giuseppe Lavagetto: citoid: use the service proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/625596 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [09:03:44] (03PS3) 10Giuseppe Lavagetto: citoid: use the service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/625596 (https://phabricator.wikimedia.org/T255868) [09:04:16] (03CR) 10Volans: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/625591 (https://phabricator.wikimedia.org/T261489) (owner: 10Muehlenhoff) [09:05:38] 10Operations, 10serviceops: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) [09:06:10] (03PS3) 10Effie Mouzeli: php::admin: export additional opcache metrics [puppet] - 10https://gerrit.wikimedia.org/r/625224 (https://phabricator.wikimedia.org/T261009) [09:06:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] citoid: use the service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/625596 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [09:06:53] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [09:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:17] (03CR) 10jerkins-bot: [V: 04-1] php::admin: export additional opcache metrics [puppet] - 10https://gerrit.wikimedia.org/r/625224 (https://phabricator.wikimedia.org/T261009) (owner: 10Effie Mouzeli) [09:07:17] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10fgiunchedi) @papaul looks like a BBU problem to me, can we order/install a new battery ? thanks! [09:08:15] (03Merged) 10jenkins-bot: citoid: use the service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/625596 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [09:09:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:01] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [09:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:59] !log dcausse@deploy1001 Started deploy [wdqs/wdqs@c96b49e]: deploy wdqs-0.3.47 to wdqs1009 (test server) [09:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:15] (03PS25) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [09:12:32] !log dcausse@deploy1001 Finished deploy [wdqs/wdqs@c96b49e]: deploy wdqs-0.3.47 to wdqs1009 (test server) (duration: 00m 33s) [09:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:23] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:14:23] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:17:28] (03CR) 10JMeybohm: [C: 03+1] citoid: add TLS LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625600 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [09:19:42] (03CR) 10JMeybohm: [C: 04-1] citoid: promote https lvs to production status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625601 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [09:21:04] 10Operations, 10Citoid, 10Prod-Kubernetes, 10serviceops, 10Services (watching): Citoid automated monitoring times out due to Zotero v2 - https://phabricator.wikimedia.org/T211411 (10Mvolz) >>! In T211411#4913374, @mobrovac wrote: > There have been no timeouts recorded by the automatic check scripts since... [09:21:28] (03PS5) 10Hnowlan: api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) [09:21:32] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10fgiunchedi) a:03Papaul [09:22:02] (03CR) 10JMeybohm: [C: 03+1] citoid: remove unencrypted LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625603 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [09:22:19] (03PS1) 10Elukey: install_server: change partition scheme for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/625608 (https://phabricator.wikimedia.org/T254892) [09:23:02] (03PS1) 10Muehlenhoff: graphite: Modernise Apache config [puppet] - 10https://gerrit.wikimedia.org/r/625609 [09:23:18] (03CR) 10Hnowlan: api-portal: required extended configuration (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) (owner: 10Hnowlan) [09:23:20] (03CR) 10Elukey: [C: 03+2] install_server: change partition scheme for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/625608 (https://phabricator.wikimedia.org/T254892) (owner: 10Elukey) [09:23:48] (03PS26) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [09:24:22] (03CR) 10Muehlenhoff: [C: 03+2] Switch debmonitor to debmonitor1002 [dns] - 10https://gerrit.wikimedia.org/r/625591 (https://phabricator.wikimedia.org/T261489) (owner: 10Muehlenhoff) [09:24:34] (03CR) 10Elukey: [C: 03+1] graphite: Modernise Apache config [puppet] - 10https://gerrit.wikimedia.org/r/625609 (owner: 10Muehlenhoff) [09:27:04] (03CR) 10Jbond: "> Patch Set 12: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [09:27:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "Prometheus part LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [09:30:53] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [09:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:14] (03CR) 10Filippo Giunchedi: [C: 03+1] graphite: Remove obsolete LDAP template [puppet] - 10https://gerrit.wikimedia.org/r/625605 (owner: 10Muehlenhoff) [09:35:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1096.eqiad.wmnet'] ` and were **ALL** successful. [09:36:22] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [09:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1096.eqiad.wmnet'] ` The l... [09:39:55] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [09:41:21] (03PS3) 10Effie Mouzeli: Add discovery records for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623544 (https://phabricator.wikimedia.org/T256973) [09:41:42] (03CR) 10jerkins-bot: [V: 04-1] Add discovery records for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623544 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [09:47:35] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1102.eqiad.wm... [10:00:34] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [10:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1096.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1096.eqiad.wmn... [10:06:44] jouncebot: next [10:06:44] In 0 hour(s) and 23 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200907T1030) [10:09:37] (03PS1) 10Elukey: install_server: rewrite partman recipe for hadoop worker nodes with gpus [puppet] - 10https://gerrit.wikimedia.org/r/625616 [10:10:37] (03CR) 10Elukey: [C: 03+2] install_server: rewrite partman recipe for hadoop worker nodes with gpus [puppet] - 10https://gerrit.wikimedia.org/r/625616 (owner: 10Elukey) [10:10:50] 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10MGerlach) @kostajh @Joe some current estimates (@DED please correct/add): - once we have a fully running vers... [10:14:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1096.eqiad.wmnet'] ` The l... [10:25:15] 10Operations: Upgrade debmonitor to Buster - https://phabricator.wikimedia.org/T261489 (10MoritzMuehlenhoff) debmonitor.wikimedia.org is now served by debmonitor1002 running Buster and everything is working well. I'm keeping the old VMs throughout the week before tearing them down. [10:26:04] (03CR) 10Muehlenhoff: [C: 03+2] graphite: Remove obsolete LDAP template [puppet] - 10https://gerrit.wikimedia.org/r/625605 (owner: 10Muehlenhoff) [10:26:56] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624855 (https://phabricator.wikimedia.org/T262181) (owner: 10MarcoAurelio) [10:28:43] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1102.eqiad.wmnet'] ` and were **ALL** successful. [10:30:04] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200907T1030). [10:31:31] (03PS1) 10Jbond: profile::java: add default undef to hiera [puppet] - 10https://gerrit.wikimedia.org/r/625618 [10:31:53] (03CR) 10jerkins-bot: [V: 04-1] profile::java: add default undef to hiera [puppet] - 10https://gerrit.wikimedia.org/r/625618 (owner: 10Jbond) [10:34:22] (03PS2) 10Jbond: profile::java: add default undef to hiera [puppet] - 10https://gerrit.wikimedia.org/r/625618 [10:35:06] (03CR) 10Jbond: [C: 03+2] profile::java: add default undef to hiera [puppet] - 10https://gerrit.wikimedia.org/r/625618 (owner: 10Jbond) [10:35:37] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [10:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:14] (03PS1) 10Giuseppe Lavagetto: mobileapps: use the service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876) [10:37:25] (03CR) 10jerkins-bot: [V: 04-1] mobileapps: use the service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [10:37:46] (03PS1) 10Jbond: profile::java: actualy drop undef from lookup function [puppet] - 10https://gerrit.wikimedia.org/r/625620 [10:37:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:56] (03CR) 10Giuseppe Lavagetto: "@Mholloway I will need you to check specifically:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [10:46:29] (03PS2) 10Giuseppe Lavagetto: mobileapps: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876) [10:47:03] (03CR) 10Jbond: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24976/" [puppet] - 10https://gerrit.wikimedia.org/r/625620 (owner: 10Jbond) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European mid-day backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200907T1100). [11:00:04] hauskatze: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] \o/ [11:00:16] hauskatze: I can deploy today [11:00:27] I'm here [11:00:44] (03CR) 10Urbanecm: [C: 03+2] [eswiki] Create an `abusefilter` user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624852 (https://phabricator.wikimedia.org/T262174) (owner: 10MarcoAurelio) [11:01:25] (03Merged) 10jenkins-bot: [eswiki] Create an `abusefilter` user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624852 (https://phabricator.wikimedia.org/T262174) (owner: 10MarcoAurelio) [11:01:45] !log Reboot pc1007 for upgrade [11:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:09] !log [urbanecm@mwmaint2001 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=hewiktionary wikilove # T262181 [11:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:15] T262181: Install wikilove in hewiktionary - https://phabricator.wikimedia.org/T262181 [11:02:44] hauskatze: eswiki abusefilter group is ready at mwdebug2001 [11:02:52] testing [11:03:16] (03PS3) 10Urbanecm: [hewiktionary] Enable wikilove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624855 (https://phabricator.wikimedia.org/T262181) (owner: 10MarcoAurelio) [11:03:41] (03CR) 10Urbanecm: [C: 03+2] [hewiktionary] Enable wikilove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624855 (https://phabricator.wikimedia.org/T262181) (owner: 10MarcoAurelio) [11:03:44] Urbanecm: LGTM, special:listgrouprights shows the new group and requested perms [11:03:53] hauskatze: cool, syncing then [11:04:16] I am also testing https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/609748 which I've not listed yet; please don't close the window early if possible [11:04:25] We might sync. that one as well [11:04:25] (03Merged) 10jenkins-bot: [hewiktionary] Enable wikilove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624855 (https://phabricator.wikimedia.org/T262181) (owner: 10MarcoAurelio) [11:04:58] sure hauskatze [11:05:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1096.eqiad.wmnet'] ` and were **ALL** successful. [11:06:09] !log urbanecm@deploy1001 Synchronized wmf-config/abusefilter.php: 35224f43f1c461d42da5c963bb60d28fbe1992ee: [eswiki] Create an `abusefilter` user group (T262174; 1/2) (duration: 01m 20s) [11:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:15] T262174: Create an 'abusefilter' group for eswiki - https://phabricator.wikimedia.org/T262174 [11:07:25] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 35224f43f1c461d42da5c963bb60d28fbe1992ee: [eswiki] Create an `abusefilter` user group (T262174; 2/2) (duration: 00m 57s) [11:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:32] hauskatze: should be live :) [11:07:51] checking [11:07:59] hauskatze: wikilove is at mwdebug2001 now [11:08:22] going to he.wikt now [11:09:35] (03PS3) 10Giuseppe Lavagetto: mobileapps: use the reserved port for TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/623740 [11:09:45] option of "showing appreciation" appears in Special:Preferences on mwdebug2001 Urbanecm [11:09:51] I guess that's all I can test [11:10:28] oh well, and 'wikilove' tabs in the talk page as well [11:10:33] but not on self [11:10:58] okay, syncing then [11:12:23] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 7b512d3a27c4c33949389cbbe7823cc534fbff9a: [hewiktionary] Enable wikilove (T262181) (duration: 00m 57s) [11:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:29] T262181: Install wikilove in hewiktionary - https://phabricator.wikimedia.org/T262181 [11:12:30] hauskatze: both done now :) [11:12:39] waiting for the help URL patch then [11:13:08] I've exported a file from meta to commons but didn't saw the url [11:13:19] so my guess this appears on not-yet-configured projects [11:13:55] https://commons.wikimedia.org/wiki/Special:ImportFile?clientUrl=https%3A%2F%2Fcs.wikipedia.org%2Fwiki%2FSoubor%3AWiki.png&importSource=FileExporter [11:13:57] positive? [11:14:28] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=204 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:14:41] uhohoh [11:15:13] oh, "in eqiad"? [11:15:45] _joe_: ^ should i stop with deploying? ^ [11:16:01] Urbanecm: looks like that indeed is the link [11:16:20] hauskatze: the interesting thing is that https://www.mediawiki.org/wiki/Extension:FileImporter/List_of_configured_wikis/cs.wikipedia doesn't exist [11:16:34] (while https://www.mediawiki.org/wiki/Extension:FileImporter/Data/cs.wikipedia does) [11:16:42] lul [11:16:55] Urbanecm: I am looking into it [11:17:11] thanks effie - please let me know when it is safe to deploy [11:17:12] Urbanecm: yeah, they moved the root page but not the subpages [11:17:23] so we'll end with lots of broken links [11:17:29] nah, -1ing that [11:17:33] I think we shouldn't merge then [11:18:01] but maybe what displays is wgFileImporterCommonsHelperBasePageName hauskatze ? [11:18:32] $wgFileImporterCommonsHelperBasePageName = 'Extension:FileImporter/Data/'; [11:18:39] right [11:19:12] Urbanecm: go ahead [11:19:18] thanks effie [11:21:14] Urbanecm: https://commons.wikimedia.org/wiki/Special:ImportFile?clientUrl=https%3A%2F%2Fes.wiktionary.org%2Fwiki%2FArchivo%3AGriego.jpg&importSource=FileExporter [11:21:35] that seems to be it [11:21:54] so I guess it's okay to sync. that one [11:22:07] we can always revert :) [11:22:15] (03PS3) 10Urbanecm: Update help URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609748 (https://phabricator.wikimedia.org/T256623) (owner: 10Awight) [11:22:16] you take the blame :P [11:22:22] (03CR) 10Urbanecm: [C: 03+2] Update help URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609748 (https://phabricator.wikimedia.org/T256623) (owner: 10Awight) [11:22:26] I'll list it in the deployments page [11:23:03] (03Merged) 10jenkins-bot: Update help URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609748 (https://phabricator.wikimedia.org/T256623) (owner: 10Awight) [11:23:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] mobileapps: use the reserved port for TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/623740 (owner: 10Giuseppe Lavagetto) [11:24:10] hauskatze: mwdebug2001 has it now :) [11:24:39] and the cswiki one still works [11:24:40] link updated in mw2001 [11:25:00] it looks like it appears only for totally-unconfigured wikis [11:25:04] indeed [11:25:05] like es.wiktionary [11:25:27] lgtm [11:26:08] syncing then [11:27:04] (03CR) 10Urbanecm: [C: 03+2] noc: Remove link to outdated blog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625357 (https://phabricator.wikimedia.org/T259978) (owner: 10Aklapper) [11:27:21] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: ff9f1042529bd332effc0fcd18db70f417c2e939: Update help URL (T256623) (duration: 00m 56s) [11:27:25] hauskatze: done :) [11:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:27] T256623: Translations of the help page should not appear in the wiki configuration list - https://phabricator.wikimedia.org/T256623 [11:27:46] :D [11:27:51] (03Merged) 10jenkins-bot: noc: Remove link to outdated blog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625357 (https://phabricator.wikimedia.org/T259978) (owner: 10Aklapper) [11:30:33] !log urbanecm@deploy1001 Synchronized docroot/noc/index.html: bbfe2ce61014f616d89bc0c21a380c15777b62e3: noc: Remove link to outdated blog (T259978) (duration: 00m 57s) [11:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:38] T259978: Fix "Blog" link on noc.wikimedia.org - https://phabricator.wikimedia.org/T259978 [11:31:03] 10Operations, 10Patch-For-Review: Fix "Blog" link on noc.wikimedia.org - https://phabricator.wikimedia.org/T259978 (10Urbanecm) @Aklapper I guess this can be considered resolved now? [11:31:33] 10Operations, 10serviceops: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) [11:31:34] hauskatze: anything else? [11:31:45] Urbanecm: Thanks! hauskatze: Sorry to keep everyone wondering about the change. There was a conflict between page translation subpages and the per-wiki configuration subpages. [11:31:57] I think the deplyed change is correct, still. [11:32:05] (03PS1) 10Marostegui: production-m2.sql: Add debmonitor grants. [puppet] - 10https://gerrit.wikimedia.org/r/625622 [11:32:19] awight: indeed, it looks correctly for both wikis we tested [11:33:09] (03CR) 10Marostegui: "These are the hosts resolution:" [puppet] - 10https://gerrit.wikimedia.org/r/625622 (owner: 10Marostegui) [11:33:34] Urbanecm: nothing else from me awight: happy to help [11:36:09] !log EU B&C done [11:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:13] hauskatze: closing then:) [11:36:15] lunch now [11:36:16] (03CR) 10Volans: [C: 04-1] "I think there is a typo in one IP" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/625622 (owner: 10Marostegui) [11:36:20] bbl [11:37:24] (03PS2) 10Marostegui: production-m2.sql: Add debmonitor grants. [puppet] - 10https://gerrit.wikimedia.org/r/625622 [11:37:33] (03CR) 10Marostegui: production-m2.sql: Add debmonitor grants. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625622 (owner: 10Marostegui) [11:40:21] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:40:24] (03PS1) 10Jbond: java: add define to update the java trust store [puppet] - 10https://gerrit.wikimedia.org/r/625623 [11:40:26] (03PS1) 10Jbond: profile::java: add param to toggle puppet ca trust [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) [11:40:28] (03PS1) 10Jbond: role:idp_test: add the puppet CA to the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625625 (https://phabricator.wikimedia.org/T253957) [11:40:46] (03PS3) 10Marostegui: production-m2.sql: Add debmonitor grants. [puppet] - 10https://gerrit.wikimedia.org/r/625622 [11:41:25] (03PS1) 10Hnowlan: changeprop: migrate to using the new simplified helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/625626 [11:42:23] (03PS2) 10Jbond: java: add define to update the java trust store [puppet] - 10https://gerrit.wikimedia.org/r/625623 (https://phabricator.wikimedia.org/T253957) [11:43:37] (03PS4) 10Marostegui: production-m2.sql: Add debmonitor grants. [puppet] - 10https://gerrit.wikimedia.org/r/625622 [11:45:02] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for adding them and cleaning up the existing ones." [puppet] - 10https://gerrit.wikimedia.org/r/625622 (owner: 10Marostegui) [11:52:34] (03PS1) 10Vgutierrez: Release 1.5.3-1 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/625629 (https://phabricator.wikimedia.org/T261632) [11:53:01] (03CR) 10jerkins-bot: [V: 04-1] Release 1.5.3-1 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/625629 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [11:53:04] (03CR) 10Marostegui: [C: 03+2] production-m2.sql: Add debmonitor grants. [puppet] - 10https://gerrit.wikimedia.org/r/625622 (owner: 10Marostegui) [11:53:07] (03Abandoned) 10Vgutierrez: Release 1.3.1-4 [software/varnish/libvmod-re2] (debian) - 10https://gerrit.wikimedia.org/r/623614 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [11:53:59] (03PS2) 10Vgutierrez: Release 1.5.3-1 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/625629 (https://phabricator.wikimedia.org/T261632) [11:54:09] (03CR) 10jerkins-bot: [V: 04-1] Release 1.5.3-1 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/625629 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [11:54:44] (03PS3) 10Jbond: java: add define to update the java trust store [puppet] - 10https://gerrit.wikimedia.org/r/625623 (https://phabricator.wikimedia.org/T253957) [11:55:09] (03PS2) 10Jbond: profile::java: add param to toggle puppet ca trust [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) [11:55:24] (03PS2) 10Jbond: role:idp_test: add the puppet CA to the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625625 (https://phabricator.wikimedia.org/T253957) [11:58:16] !log Reboot pc1008 for upgrade [11:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:26] 10Operations, 10Patch-For-Review: Fix "Blog" link on noc.wikimedia.org - https://phabricator.wikimedia.org/T259978 (10Aklapper) 05Open→03Resolved Yessss! Thanks a lot! [12:00:10] (03PS3) 10Jbond: role:idp_test: add the puppet CA to the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625625 (https://phabricator.wikimedia.org/T253957) [12:01:08] !log restart uwsgi on debmonitor1002 to test db reconnection [12:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:20] 10Operations, 10Traffic: Create a second text-lb IP address for test purposes - https://phabricator.wikimedia.org/T237492 (10faidon) @BBlack @ayounsi I think this is done and can be resolved, right? Anything left here? [12:04:23] (03PS3) 10Jbond: profile::java: add param to toggle puppet ca trust [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) [12:05:33] (03PS4) 10Jbond: role:idp_test: add the puppet CA to the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625625 (https://phabricator.wikimedia.org/T253957) [12:07:13] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24977/" [puppet] - 10https://gerrit.wikimedia.org/r/625623 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:08:40] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/24978/" [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:12:42] (03PS1) 10Jbond: role:idp_test: add remove puppet CA from the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625630 (https://phabricator.wikimedia.org/T253957) [12:12:42] (03PS1) 10Jbond: profile::java: add the puppet CA cert to the java truststore by default [puppet] - 10https://gerrit.wikimedia.org/r/625631 (https://phabricator.wikimedia.org/T253957) [12:14:22] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24979/" [puppet] - 10https://gerrit.wikimedia.org/r/625625 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:16:07] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Gehel) a:05RKemper→03wiki_willy Going through dmesg, I see: [ 5.799850] scsi 0:0:0:0: Direct-Access Gener... [12:16:11] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good to me" [puppet] - 10https://gerrit.wikimedia.org/r/625625 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:17:11] ACKNOWLEDGEMENT - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=204 handler=proxy:unix:/run/php/fpm-www.sock Effie Mouzeli This is not a real alert, tests are causing it T262202 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=no [12:17:11] eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:17:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one typo inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:18:28] !log restarting elasticsearch on elastic2029 (high GC) [12:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:14] (03PS4) 10Jbond: profile::java: add param to toggle puppet ca trust [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) [12:24:56] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:25:04] (03PS5) 10Jbond: role:idp_test: add the puppet CA to the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625625 (https://phabricator.wikimedia.org/T253957) [12:25:11] (03PS2) 10Jbond: role:idp_test: add remove puppet CA from the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625630 (https://phabricator.wikimedia.org/T253957) [12:25:18] (03PS2) 10Jbond: profile::java: add the puppet CA cert to the java truststore by default [puppet] - 10https://gerrit.wikimedia.org/r/625631 (https://phabricator.wikimedia.org/T253957) [12:28:24] (03CR) 10Muehlenhoff: "Looks good, two comments/questions inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/625623 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:28:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:29:35] !log Upgrade and reboot db2094 and db2095 (sanitarium hosts in codfw) [12:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:03] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=mysql-labs site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:37:55] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:38:42] (03PS4) 10Jbond: java: add define to update the java trust store [puppet] - 10https://gerrit.wikimedia.org/r/625623 (https://phabricator.wikimedia.org/T253957) [12:42:20] !log kormat@cumin1001 START - Cookbook sre.hosts.reboot-single [12:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:34] (03PS1) 10Hnowlan: changeprop-jobqueue: convert to new helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/625632 [12:42:39] Hey... I'm trying to inspect a raw data blob in ES, but I'm havign trouble running fetchText.php. [12:42:47] Fatal error: no version entry for `maintenance/fetchText.php`. [12:43:06] Is there some bit of config missing in MultiVersion, or am I doing it wrong? [12:43:27] !log kormat@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) [12:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=204 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:44:17] (03CR) 10Jbond: "Thanks updated" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/625623 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:44:27] (03PS5) 10Jbond: profile::java: add param to toggle puppet ca trust [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) [12:44:34] (03PS6) 10Jbond: role:idp_test: add the puppet CA to the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625625 (https://phabricator.wikimedia.org/T253957) [12:44:42] (03PS3) 10Jbond: role:idp_test: add remove puppet CA from the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625630 (https://phabricator.wikimedia.org/T253957) [12:44:50] (03PS3) 10Jbond: profile::java: add the puppet CA cert to the java truststore by default [puppet] - 10https://gerrit.wikimedia.org/r/625631 (https://phabricator.wikimedia.org/T253957) [12:45:49] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:46:39] (03CR) 10JMeybohm: [C: 03+1] mobileapps: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [12:47:47] (03CR) 10Gehel: [C: 03+1] "LGTM. @Ryan, can you test and merge?" [puppet] - 10https://gerrit.wikimedia.org/r/624704 (https://phabricator.wikimedia.org/T258835) (owner: 10DCausse) [12:49:37] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:51:31] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:53:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: use the reserved port for TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/623740 (owner: 10Giuseppe Lavagetto) [12:54:27] (03Merged) 10jenkins-bot: mobileapps: use the reserved port for TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/623740 (owner: 10Giuseppe Lavagetto) [12:59:00] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [12:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1097.eqiad.wmnet', 'an-wor... [13:01:58] (03PS2) 10Reedy: Add security.wikimedia.org pointing to dyna [dns] - 10https://gerrit.wikimedia.org/r/612278 (https://phabricator.wikimedia.org/T257831) [13:03:34] (03CR) 10Kormat: [C: 03+1] Add security.wikimedia.org pointing to dyna [dns] - 10https://gerrit.wikimedia.org/r/612278 (https://phabricator.wikimedia.org/T257831) (owner: 10Reedy) [13:04:55] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [13:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:18] (03PS1) 10Muehlenhoff: Remove now obsolete cas-graphite and cas-icinga DNS entries [dns] - 10https://gerrit.wikimedia.org/r/625635 [13:05:36] (03PS2) 10Muehlenhoff: Remove now obsolete cas-graphite and cas-icinga DNS entries [dns] - 10https://gerrit.wikimedia.org/r/625635 [13:06:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] mobileapps: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [13:07:48] (03CR) 10Kormat: [C: 03+2] Add security.wikimedia.org pointing to dyna [dns] - 10https://gerrit.wikimedia.org/r/612278 (https://phabricator.wikimedia.org/T257831) (owner: 10Reedy) [13:14:29] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [13:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:55] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:15:51] PROBLEM - kubelet operational latencies on kubernetes1013 is CRITICAL: instance=kubernetes1013.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:16:07] PROBLEM - kubelet operational latencies on kubernetes1009 is CRITICAL: instance=kubernetes1009.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:16:45] hmm...looking ^^ [13:18:12] RECOVERY - kubelet operational latencies on kubernetes1013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:19:30] RECOVERY - kubelet operational latencies on kubernetes1009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:19:46] (03CR) 10Elukey: "Looks good to me, left a question about perms!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625623 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:20:18] (03CR) 10Gehel: "Minor comment inline, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625623 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:21:24] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:22:01] (03PS2) 10Giuseppe Lavagetto: mobileapps: add tls-enabled endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625584 (https://phabricator.wikimedia.org/T255876) [13:22:34] !log hashar@deploy1001 Started deploy [integration/docroot@11ab4a0]: (no justification provided) [13:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:44] !log hashar@deploy1001 Finished deploy [integration/docroot@11ab4a0]: (no justification provided) (duration: 00m 10s) [13:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:32] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:23:44] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:23:53] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [13:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:50] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:25:27] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [13:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:57] !log hashar@deploy1001 Started deploy [integration/docroot@e4e3af9]: Support published documents outside of the git checkout # T149924 [13:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:03] T149924: Clear /srv/.git on contint1001; move integration.wikimedia.org docroot to new location - https://phabricator.wikimedia.org/T149924 [13:28:03] !log hashar@deploy1001 Finished deploy [integration/docroot@e4e3af9]: Support published documents outside of the git checkout # T149924 (duration: 00m 05s) [13:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:28] (03CR) 10Elukey: [C: 03+1] profile::java: add param to toggle puppet ca trust [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:29:41] (03CR) 10JMeybohm: [C: 04-1] mobileapps: add tls-enabled endpoint (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/625584 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [13:29:48] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:32:28] (03PS4) 10Reedy: Add security.wikimedia.org microsite [puppet] - 10https://gerrit.wikimedia.org/r/612279 (https://phabricator.wikimedia.org/T257834) [13:32:47] (03CR) 10JMeybohm: [C: 03+1] changeprop: migrate to using the new simplified helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/625626 (owner: 10Hnowlan) [13:33:29] (03CR) 10Elukey: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623361 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:33:40] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/625635 (owner: 10Muehlenhoff) [13:34:27] (03CR) 10Elukey: [C: 03+1] "If pcc is ok, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/623362 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:35:38] (03CR) 10Elukey: [C: 03+1] puppet ssl p12: enable generation of puppet p12 cert on test cluster [puppet] - 10https://gerrit.wikimedia.org/r/623363 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:36:04] (03CR) 10JMeybohm: [C: 03+1] changeprop-jobqueue: convert to new helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/625632 (owner: 10Hnowlan) [13:39:16] 10Operations, 10DNS, 10Security-Team, 10Traffic: Create dns for security.wikimedia.org - https://phabricator.wikimedia.org/T257831 (10Reedy) 05Open→03Resolved [13:39:43] (03CR) 10Giuseppe Lavagetto: mobileapps: add tls-enabled endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625584 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [13:40:39] (03PS3) 10Giuseppe Lavagetto: mobileapps: add tls-enabled endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625584 (https://phabricator.wikimedia.org/T255876) [13:40:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: add tls-enabled endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625584 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [13:44:47] <_joe_> !log restarting pybal in eqiad to pick up the new mobileapps TLS endpoint [13:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:52] (03PS5) 10Jbond: java: add define to update the java trust store [puppet] - 10https://gerrit.wikimedia.org/r/625623 (https://phabricator.wikimedia.org/T253957) [13:48:34] <_joe_> !log restarting pybal in codfw to pick up the new mobileapps TLS endpoint [13:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:16] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10colewhite) @jcrespo Thanks for bringing this to our attention. The filters on that dashboard indicate they are broken because the filter pattern between logstash-* cannot be... [13:55:45] (03PS1) 10Giuseppe Lavagetto: mobileapps: enable monitoring on the TLS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625643 (https://phabricator.wikimedia.org/T255876) [13:56:16] (03PS3) 10Vgutierrez: Release 1.5.3-1 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/625629 (https://phabricator.wikimedia.org/T261632) [13:56:31] (03CR) 10jerkins-bot: [V: 04-1] Release 1.5.3-1 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/625629 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [13:58:25] PROBLEM - gdnsd checkconf on authdns2001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:59:03] (03PS1) 10Hashar: doc: relocate published documents to /srv/doc [puppet] - 10https://gerrit.wikimedia.org/r/625644 (https://phabricator.wikimedia.org/T149924) [13:59:34] * volans checking [14:00:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: enable monitoring on the TLS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625643 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [14:00:48] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/620368 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [14:00:50] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/625644 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [14:01:01] _joe_: what's changing on mobileapps? [14:01:09] gdnsd is failing the confing check [14:01:09] error: plugin_geoip: Invalid resource name 'disc-mobileapps' detected from zonefile lookup [14:01:13] error: Name 'mobileapps.discovery.wmnet.': resolver plugin 'geoip' rejected resource name 'disc-mobileapps' [14:01:14] <_joe_> what do you mean? [14:01:18] <_joe_> uhm [14:01:27] <_joe_> is that failing? [14:01:38] <_joe_> that's very very strange [14:01:38] is refusing to reload AFAIK [14:01:43] <_joe_> ok [14:01:45] I was looking at the icinga critical above [14:02:18] <_joe_> nothing should've changed [14:02:25] <_joe_> so let's see what is actually broken [14:03:21] <_joe_> this makes zero sense [14:03:39] <_joe_> I just added another LVS endpoint for it [14:03:44] I dpon't see it the definition in discovery-geo-resources [14:03:54] disc-mobileapps that is [14:03:59] <_joe_> it's reappearing with another puppet run [14:04:05] <_joe_> this is just... [14:04:08] wut?!?!? [14:04:11] RECOVERY - gdnsd checkconf on authdns2001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:04:18] <_joe_> see 2001... [14:04:35] somthing wrong in puppet-land :) [14:04:42] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10elukey) >>! In T241791#6440442, @Gehel wrote: > Going through dmesg, I see: > > [ 5.799850] scsi 0:0:0:0: Direct-... [14:04:57] <_joe_> not in the code though [14:05:13] <_joe_> I can't explain what the heck was going on tbh [14:05:28] <_joe_> oh shit, no, I know [14:05:30] <_joe_> my bad [14:05:43] <_joe_> I made a mistake trying to be too cautious [14:06:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1097.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1099.eqiad.wmn... [14:06:24] <_joe_> so I moved the new stuff into lvs-setup mode [14:06:38] <_joe_> and moved the discovery records there because I'll eventually remove the old stuff [14:12:37] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['relforge1003.e... [14:16:57] (03CR) 10Jbond: "updated" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/625623 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [14:21:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625623 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [14:21:53] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['relforge1003.eqiad.wmnet', 'relforge1004.eqiad.wmnet'] ` Of whi... [14:23:25] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [14:23:27] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [14:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:29] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:39] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:46] buuuu [14:27:47] (03PS1) 10Elukey: Fix relforge100[3,4] definitions in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/625646 (https://phabricator.wikimedia.org/T241791) [14:33:05] (03PS1) 10Marostegui: instances.yaml: Remove db1133 [puppet] - 10https://gerrit.wikimedia.org/r/625648 (https://phabricator.wikimedia.org/T253217) [14:33:17] kormat: ^ [14:33:33] (03CR) 10Kormat: [C: 03+1] instances.yaml: Remove db1133 [puppet] - 10https://gerrit.wikimedia.org/r/625648 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [14:33:45] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1133 [puppet] - 10https://gerrit.wikimedia.org/r/625648 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [14:35:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1133 from dbctl T253217', diff saved to https://phabricator.wikimedia.org/P12504 and previous config saved to /var/cache/conftool/dbconfig/20200907-143507-marostegui.json [14:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:15] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [14:35:59] kormat: ^ fixed, thanks for the heads up [14:36:13] np :) [14:37:43] (03CR) 10Elukey: [C: 03+2] Fix relforge100[3,4] definitions in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/625646 (https://phabricator.wikimedia.org/T241791) (owner: 10Elukey) [14:38:05] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [14:38:05] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:17] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) Hello John, Thank you for the AHS. However, as per the AHS we see no hardware errors and the log event page also seems to be empty. Hence, request you to assist us with the sc... [14:44:02] (03PS1) 10Hashar: doc: stop backup for old doc directory [puppet] - 10https://gerrit.wikimedia.org/r/625649 (https://phabricator.wikimedia.org/T149924) [14:44:05] (03PS1) 10Hashar: doc: remove legacy doc directory [puppet] - 10https://gerrit.wikimedia.org/r/625650 (https://phabricator.wikimedia.org/T149924) [14:47:12] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10elukey) Current layout: ` elukey@relforge1003:~$ df -h Filesystem Size Used Avail Use% Mounted on udev... [14:47:36] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:06] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10elukey) Set both hosts to "Staged" in netbox. [14:53:26] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10elukey) [14:53:39] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10elukey) [14:54:11] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10elukey) @RKemper @Gehel does the layout works for you? If so I think this task is done :) [14:55:56] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove now obsolete cas-graphite and cas-icinga DNS entries [dns] - 10https://gerrit.wikimedia.org/r/625635 (owner: 10Muehlenhoff) [15:01:31] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) @Marostegui go for it [15:01:58] (03PS1) 10Muehlenhoff: Retire the HTTP listener for debmonitor (along with ferm rules) [puppet] - 10https://gerrit.wikimedia.org/r/625658 [15:02:26] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [15:02:27] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:52] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es2026.codfw.wmnet... [15:03:07] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [15:03:08] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:11] !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12506 and previous config saved to /var/cache/conftool/dbconfig/20200907-150310-kormat.json [15:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:19] !log rebooting poolcounter1004/1005 [15:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:39] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [15:03:40] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:12] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:53] 10Operations, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10Vgutierrez) [15:06:34] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:05] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Gehel) >>! In T241791#6440734, @elukey wrote: > @RKemper @Gehel does the layout works for you? If so I think this task is... [15:08:57] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Gehel) a:05wiki_willy→03RKemper [15:09:02] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12507 and previous config saved to /var/cache/conftool/dbconfig/20200907-150901-kormat.json [15:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:51] (03PS1) 10Vgutierrez: 1.7-4: Rebuild against Varnish 6 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625659 (https://phabricator.wikimedia.org/T261632) [15:12:56] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:22] (03CR) 10jerkins-bot: [V: 04-1] 1.7-4: Rebuild against Varnish 6 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625659 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [15:14:20] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) @Papaul can you check the cable/switch/interface? ` PXE-E61: Media test failure, check cable ` [15:14:57] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:22] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:15] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [15:17:16] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:19] !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12508 and previous config saved to /var/cache/conftool/dbconfig/20200907-151718-kormat.json [15:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:52] (03CR) 10Volans: [C: 03+1] "LGTM, will the ferm rule be automatically deleted?" [puppet] - 10https://gerrit.wikimedia.org/r/625658 (owner: 10Muehlenhoff) [15:19:11] (03PS1) 10Filippo Giunchedi: Add Alertmanager client [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625660 (https://phabricator.wikimedia.org/T258948) [15:19:13] (03PS1) 10Filippo Giunchedi: Add Icinga AM client [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625661 (https://phabricator.wikimedia.org/T258948) [15:21:18] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12509 and previous config saved to /var/cache/conftool/dbconfig/20200907-152117-kormat.json [15:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:42] (03CR) 10Muehlenhoff: "> LGTM, will the ferm rule be automatically deleted?" [puppet] - 10https://gerrit.wikimedia.org/r/625658 (owner: 10Muehlenhoff) [15:25:19] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) @Marostegui holiday today in the U.S so not at the DC. It is not a cable problem ` papaul@asw-a-codfw# run show interfac... [15:26:17] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) @Papaul sure, no need to get it done today - you shouldn't be checking phab even! :-) Enjoy your day off! :) [15:26:56] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [15:32:04] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [15:32:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:08] !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12510 and previous config saved to /var/cache/conftool/dbconfig/20200907-153206-kormat.json [15:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:51] (03PS1) 10Marostegui: install_server: Change es2026 MAC [puppet] - 10https://gerrit.wikimedia.org/r/625706 (https://phabricator.wikimedia.org/T260373) [15:36:06] (03PS1) 10Giuseppe Lavagetto: service_proxy: switch mobileapps to the TLS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625707 (https://phabricator.wikimedia.org/T255876) [15:36:52] (03CR) 10Marostegui: [C: 03+2] install_server: Change es2026 MAC [puppet] - 10https://gerrit.wikimedia.org/r/625706 (https://phabricator.wikimedia.org/T260373) (owner: 10Marostegui) [15:38:58] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12511 and previous config saved to /var/cache/conftool/dbconfig/20200907-153857-kormat.json [15:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:38] (03PS1) 10Jbond: pki: add vhosts for pki and ocsp which will proxy to the backend cfssl [puppet] - 10https://gerrit.wikimedia.org/r/625708 (https://phabricator.wikimedia.org/T259117) [15:45:27] (03PS2) 10Jbond: pki: add vhosts for pki and ocsp which will proxy to the backend cfssl [puppet] - 10https://gerrit.wikimedia.org/r/625708 (https://phabricator.wikimedia.org/T259117) [15:48:45] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1003/24984/ only restbase will change." [puppet] - 10https://gerrit.wikimedia.org/r/625707 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [15:52:28] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) No luck there @Papaul, things I have noticed: - the mac address on the DHCP file was pointing to the 10G interface.... [15:57:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1110.eqiad.wmnet', 'an-wor... [15:57:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service_proxy: switch mobileapps to the TLS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625707 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [16:03:17] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2026.codfw.wmnet'] ` Of which those **FAILED**: ` ['es2026.codfw.wmne... [16:03:58] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) @Cmjohnson I think that the two SSDs in the flex bay are not configured with hardware RAID1 (like all the other hadoop wo... [16:08:05] (03PS1) 10Effie Mouzeli: push-notifications: add proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/625709 (https://phabricator.wikimedia.org/T256973) [16:10:18] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [16:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:54] (03PS3) 10Jbond: pki: add vhosts for pki and ocsp which will proxy to the backend cfssl [puppet] - 10https://gerrit.wikimedia.org/r/625708 (https://phabricator.wikimedia.org/T259117) [16:21:59] (03CR) 10jerkins-bot: [V: 04-1] pki: add vhosts for pki and ocsp which will proxy to the backend cfssl [puppet] - 10https://gerrit.wikimedia.org/r/625708 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [16:27:58] (03PS4) 10Jbond: pki: add vhosts for pki and ocsp which will proxy to the backend cfssl [puppet] - 10https://gerrit.wikimedia.org/r/625708 (https://phabricator.wikimedia.org/T259117) [16:32:45] (03PS1) 10Vgutierrez: Fix includes to build against Varnish 6 [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/625713 (https://phabricator.wikimedia.org/T261632) [16:45:19] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, 10Patch-For-Review: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10fgiunchedi) [16:46:17] 10Operations, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, and 3 others: Move mobileapps logging to new logging pipeline - https://phabricator.wikimedia.org/T219924 (10fgiunchedi) 05Stalled→03Resolved a:03fgiunchedi AFAICS mobileapps now is logging fully to k8s... [16:48:32] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Wikimedia-Logstash, and 3 others: Move proton logging to new logging pipeline - https://phabricator.wikimedia.org/T219925 (10fgiunchedi) Looks like mobileapps has been fixed in T219924, though I'm not seeing any `logging` configuration for pr... [17:00:04] ryankemper: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200907T1700). [17:05:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1110.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1101.eqiad.wmn... [17:08:55] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [17:09:03] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:11:37] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [17:14:05] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:14:24] ^ me [17:14:29] and *sigh* [17:32:44] (03CR) 10Cicalese: [C: 03+1] api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) (owner: 10Hnowlan) [17:53:37] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:59:21] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=204 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200907T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:01:22] effie: is it fine to deploy? [18:01:32] yes, I downtimed it [18:01:44] sorry for the noise, I will try to work something tomorrow [18:01:48] to avoid those alerts [18:02:18] thanks effie - I'm just checking :-) [18:03:05] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:04:00] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 00m 56s) [18:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:45] * Urbanecm is done with deploying [18:15:06] (03CR) 10Hashar: "git buildpackage complains due to lack of tags:" [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [19:25:35] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:27:33] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:00:04] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200907T2000). [21:00:04] Reedy and sbassett: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200907T2100). [21:12:09] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10bcampbell) Hey folks. Unfortunately, we discovered that the SRV/DNS solution is blocking our abili... [21:12:19] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10bcampbell) 05Resolved→03Open [21:19:08] !log reedy@deploy1001 Synchronized private/PrivateSettings.php: Remove old mitigation (duration: 00m 55s) [21:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:56] nray: Aha, didn't see you there :) [22:43:12] hi nray ! [22:43:12] Reedy: o/ hello! [22:43:21] hi Platonides :) [22:48:38] Ugh, typical [22:49:42] what's up? [22:50:00] PM [23:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200907T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:35:10] !log Deployed patch for T262213 [23:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log