[00:00:04] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200611T0000). Please do the needful. [00:41:44] (03PS4) 10Dave Pifke: [WIP] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) [00:50:15] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:14:11] (03CR) 10Tim Starling: [C: 03+2] FP: Improve EntityLinkTargetEntityIdLookup exception message [extensions/Wikibase] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604524 (https://phabricator.wikimedia.org/T255078) (owner: 10Addshore) [01:24:39] (03PS5) 10Dave Pifke: [WIP] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) [01:25:41] Latest train blocker - T255088 [01:25:42] T255088: Internal error on meta's Special:RecentChanges - https://phabricator.wikimedia.org/T255088 [01:33:28] (03Merged) 10jenkins-bot: FP: Improve EntityLinkTargetEntityIdLookup exception message [extensions/Wikibase] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604524 (https://phabricator.wikimedia.org/T255078) (owner: 10Addshore) [02:01:59] (03CR) 10BryanDavis: "> My only question about this is re: the commit msg. It says it makes" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [02:15:29] (03PS2) 10Reedy: Remove OAuthReplaceMessage hook subscriber [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601910 (https://phabricator.wikimedia.org/T254301) [02:34:00] (03PS6) 10Dave Pifke: [WIP] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) [02:43:40] !log tstarling@deploy1001 Synchronized php-1.35.0-wmf.36/extensions/Wikibase/lib/includes/Store/EntityLinkTargetEntityIdLookup.php: investigate UBN T255078 (duration: 01m 07s) [02:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:43] T255078: RuntimeException when trying to view history of [[c:Template talk:Wikidata Infobox]] - https://phabricator.wikimedia.org/T255078 [02:45:31] (03PS7) 10Dave Pifke: [WIP] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) [02:47:10] (03PS3) 10Dave Pifke: Add passwords for Cloud VPS XHGui database [labs/private] - 10https://gerrit.wikimedia.org/r/604498 (https://phabricator.wikimedia.org/T180761) [03:13:58] !log removing WDQS-Streaming-Updater-POC metrics on graphite1004 - T255044 [03:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:02] T255044: Many new metrics in Graphite for WDQS-Streaming-Updater-POC - https://phabricator.wikimedia.org/T255044 [04:25:17] can someone get the stack trace for `571489e6-9647-4fb9-a3c1-a5d53d6ea40b` please? [04:45:13] (03PS1) 10Marostegui: db1127: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/604550 (https://phabricator.wikimedia.org/T253217) [04:46:43] (03CR) 10Marostegui: [C: 03+2] db1127: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/604550 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [04:47:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1084 and slowly repool db1127 T253217', diff saved to https://phabricator.wikimedia.org/P11462 and previous config saved to /var/cache/conftool/dbconfig/20200611-044725-marostegui.json [04:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:30] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [04:48:21] (03CR) 10Marostegui: "Both are needed I believe. If the event scheduler isn't running and the events are installed, we would have the same issue as if the event" [puppet] - 10https://gerrit.wikimedia.org/r/604379 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [04:50:52] !log Deploy schema change on testwiki - T254371 [04:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:55] T254371: CentralNotice: Update DB schema on Meta for new features - https://phabricator.wikimedia.org/T254371 [04:52:18] marostegui: hi! thanks so much :) [04:54:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078', diff saved to https://phabricator.wikimedia.org/P11463 and previous config saved to /var/cache/conftool/dbconfig/20200611-045426-marostegui.json [04:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:40] AndyRussG: can you confirm that the tables look good? (I pasted them on the task) [05:02:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1078', diff saved to https://phabricator.wikimedia.org/P11464 and previous config saved to /var/cache/conftool/dbconfig/20200611-050200-marostegui.json [05:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:04] Internal Error on Special:RC [05:02:23] https://meta.wikimedia.org/wiki/Special:RecentChanges has an Internal Error oO [05:02:41] Bsadowski1: works for me [05:02:57] its working for me now, but its a known issue T255088 [05:02:58] T255088: Internal error on meta's Special:RecentChanges - https://phabricator.wikimedia.org/T255088 [05:03:17] Can someone check the stack trace for `571489e6-9647-4fb9-a3c1-a5d53d6ea40b` please? It may be related [05:04:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1127 T253217', diff saved to https://phabricator.wikimedia.org/P11465 and previous config saved to /var/cache/conftool/dbconfig/20200611-050446-marostegui.json [05:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:50] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [05:06:02] marostegui: one sec :) [05:06:20] AndyRussG: no rush! :) [05:08:59] (03PS1) 10Marostegui: mariadb: Reimage dbstore1003 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/604551 (https://phabricator.wikimedia.org/T254870) [05:11:16] marostegui: seems fine... I guess most important is to look at the changed columns? [05:11:37] AndyRussG: yeah, essentially that the new additions are there and I've not missed anything :) [05:11:43] marostegui: I wasn't aware until now of the varchar/varbinary difference here... Should we change things in this extension [05:11:55] K yeah looks fine! [05:12:04] excellent, I will proceed with metawiki during the day [05:12:09] thanks for checking [05:13:15] K likewise thanks much! [05:15:05] marostegui: also I see the blob vs text difference, assuming that's correct too [05:15:41] marostegui: I will be afk (sleep) from approximate 30 minutes from now to 8hrs 30 minutes from now [05:15:58] in case you want someone to poke at things on production or anything [05:16:16] AndyRussG: No worries, I should have metawiki ready for when you wake up :) [05:16:43] AndyRussG: it is blob because our tables are binary, so that's why you see blob instead of text :) [05:21:14] marostegui: ok cool! so nothing we need to change in the code, correct? [05:21:16] thanks again! [05:21:32] AndyRussG: nope [05:25:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1127 T253217', diff saved to https://phabricator.wikimedia.org/P11466 and previous config saved to /var/cache/conftool/dbconfig/20200611-052535-marostegui.json [05:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:39] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [05:55:14] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:55:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1127 T253217', diff saved to https://phabricator.wikimedia.org/P11467 and previous config saved to /var/cache/conftool/dbconfig/20200611-055536-marostegui.json [05:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:40] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [05:57:45] (03PS7) 10Elukey: memcached: allow more tunables to avoid implicit settings [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) [05:59:15] disabling puppet on all hosts with memcached just in case --^ [06:00:31] (03CR) 10Elukey: [C: 03+2] memcached: allow more tunables to avoid implicit settings [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [06:29:55] completed the roll out, no-op everywhere [06:37:47] !log make asw2-esams interfaces Homer like - T250429 [06:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:26] T250429: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429 [07:03:24] (03CR) 10Kormat: [C: 03+1] mariadb: Reimage dbstore1003 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/604551 (https://phabricator.wikimedia.org/T254870) (owner: 10Marostegui) [07:05:34] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage dbstore1003 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/604551 (https://phabricator.wikimedia.org/T254870) (owner: 10Marostegui) [07:07:29] (03PS1) 10Marostegui: dbstore1003: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/604585 [07:07:52] !log Stop MySQL on dbstore1003 for reimage - T254870 [07:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:56] T254870: Upgrade analytics dbstore databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254870 [07:08:09] (03CR) 10Marostegui: [C: 03+2] dbstore1003: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/604585 (owner: 10Marostegui) [07:20:23] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:21:25] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:24:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:26] (03PS1) 10KartikMistry: IssueTrackingTool: Fix js error in getCurrentNodeId method [extensions/ContentTranslation] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604587 (https://phabricator.wikimedia.org/T254965) [07:33:16] (03PS2) 10Kormat: install_server: Better error reporting for reuse-parts [puppet] - 10https://gerrit.wikimedia.org/r/604413 (https://phabricator.wikimedia.org/T254982) [07:33:26] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Vasanthi Hargyono - https://phabricator.wikimedia.org/T254961 (10Dzahn) 05Resolved→03Open When people are added to LDAP groups they also need to be added to the admin module in puppet. [07:34:18] !log upgrading remaining app servers in eqiad to PHP 7.2.31 [07:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:36] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [07:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:55] (03PS1) 10Marostegui: dbstore1003: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/604591 [07:49:46] (03CR) 10Marostegui: [C: 03+2] dbstore1003: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/604591 (owner: 10Marostegui) [07:51:06] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "beta cherry picked" [labs/private] - 10https://gerrit.wikimedia.org/r/604498 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [07:54:16] (03CR) 10Dzahn: "Thanks! I merged the beta-cherry-picked password change in labs/private. Ping me to add the real private passwords into prod/private. More" [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [07:54:27] (03PS11) 10Ayounsi: Netbox driven switch interfaces configuration [homer/public] - 10https://gerrit.wikimedia.org/r/547584 (https://phabricator.wikimedia.org/T250429) [07:54:29] (03PS3) 10Ayounsi: Netbox driven routers disabled interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/592246 [07:54:31] (03PS3) 10Ayounsi: Chassis: more generic, add ae count [homer/public] - 10https://gerrit.wikimedia.org/r/592251 [07:54:33] (03PS3) 10Ayounsi: Add graceful-switchover to multiple RE devices [homer/public] - 10https://gerrit.wikimedia.org/r/592938 (https://phabricator.wikimedia.org/T191667) [07:54:35] (03PS3) 10Ayounsi: add graceful-restart to CRs [homer/public] - 10https://gerrit.wikimedia.org/r/577564 (https://phabricator.wikimedia.org/T191667) (owner: 10CDanis) [07:56:58] (03CR) 10Ayounsi: [C: 03+1] "Ready to merge once the plugin is merged." (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/547584 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [07:59:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment online!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/604413 (https://phabricator.wikimedia.org/T254982) (owner: 10Kormat) [07:59:39] !log Restarted Zuul on contint2001 for config change # T253263 [07:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:43] T253263: Add a second Gerrit connection in Zuul config - https://phabricator.wikimedia.org/T253263 [07:59:48] !log upgrading remaining job runners in eqiad to PHP 7.2.31 [07:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:06] (03CR) 10Dzahn: [WIP] webperf: Remove XHGui dependency on MongoDB (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [08:01:17] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [08:01:17] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [08:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:16] (03PS1) 10Marostegui: mariadb: Allow reimage db2092 to Buster. [puppet] - 10https://gerrit.wikimedia.org/r/604598 [08:03:28] (03PS1) 10Gergő Tisza: Help panel: Update guidance behavior rules [extensions/GrowthExperiments] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604599 (https://phabricator.wikimedia.org/T244431) [08:03:55] (03CR) 10Marostegui: [C: 03+2] mariadb: Allow reimage db2092 to Buster. [puppet] - 10https://gerrit.wikimedia.org/r/604598 (owner: 10Marostegui) [08:04:13] (03CR) 10Dzahn: "to my surprise the compiler shows no difference on tungsten even though it clearly uses this role" [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [08:04:45] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [08:06:24] (03CR) 10Dzahn: "ah, yea, it's because the role is replaced by another role. we'd have to compile it with tungsten actually being switched to the new role," [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [08:13:30] (03CR) 10Dzahn: [C: 03+1] "with listen_address and allow-from being 127.0.0.1/8 it seems safe. i looked at allow_from_listen but that would just add 127.0.0.1 anothe" [puppet] - 10https://gerrit.wikimedia.org/r/604452 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [08:15:23] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: include notification type in host alert email subjects [puppet] - 10https://gerrit.wikimedia.org/r/604448 (owner: 10Herron) [08:16:02] (03PS3) 10Dzahn: phabricator: change sender address of community_metrics mail [puppet] - 10https://gerrit.wikimedia.org/r/603445 [08:16:07] (03PS1) 10Muehlenhoff: Create repository component for memcached 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/604603 (https://phabricator.wikimedia.org/T233933) [08:16:19] (03CR) 10Dzahn: [C: 03+2] phabricator: change sender address of community_metrics mail [puppet] - 10https://gerrit.wikimedia.org/r/603445 (owner: 10Dzahn) [08:17:31] (03CR) 10Muehlenhoff: [C: 03+2] Create repository component for memcached 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/604603 (https://phabricator.wikimedia.org/T233933) (owner: 10Muehlenhoff) [08:18:00] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [08:18:01] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [08:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:03] RECOVERY - Thanos swift https on thanos-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.009 second response time https://wikitech.wikimedia.org/wiki/Thanos [08:22:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:01] (03PS1) 10Filippo Giunchedi: conftool-data: add thanos-fe eqiad [puppet] - 10https://gerrit.wikimedia.org/r/604608 (https://phabricator.wikimedia.org/T233956) [08:23:36] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [08:23:36] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [08:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:06] RECOVERY - Thanos swift https on thanos-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.011 second response time https://wikitech.wikimedia.org/wiki/Thanos [08:24:38] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:24:38] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [08:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:43] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:24:43] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:06] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:25:06] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:34] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:25:34] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:43] (03PS2) 10Filippo Giunchedi: conftool-data: add thanos-fe eqiad [puppet] - 10https://gerrit.wikimedia.org/r/604608 (https://phabricator.wikimedia.org/T233956) [08:27:55] (03CR) 10Filippo Giunchedi: [C: 03+2] conftool-data: add thanos-fe eqiad [puppet] - 10https://gerrit.wikimedia.org/r/604608 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [08:28:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] Renumber 3 kubernetes etcd nodes [dns] - 10https://gerrit.wikimedia.org/r/604512 (owner: 10Alexandros Kosiaris) [08:30:12] (03PS1) 10Marostegui: db2092: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/604612 [08:30:40] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [08:30:42] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] Cleanup some old leftovers [dns] - 10https://gerrit.wikimedia.org/r/604513 (owner: 10Alexandros Kosiaris) [08:31:34] (03CR) 10Marostegui: [C: 03+2] db2092: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/604612 (owner: 10Marostegui) [08:33:16] (03PS1) 10Filippo Giunchedi: hieradata: add eqiad for thanos-query / thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/604613 (https://phabricator.wikimedia.org/T233956) [08:33:55] (03CR) 10Dzahn: [C: 03+1] "I haven't tested this, as it's only needed for setting up new ganeti hosts and changes network/interfaces, but let's just merge it. The "d" [puppet] - 10https://gerrit.wikimedia.org/r/602350 (https://phabricator.wikimedia.org/T228924) (owner: 10Alexandros Kosiaris) [08:37:39] 10Operations, 10MediaWiki-Vagrant, 10phan: It should be possible to install php-ast using apt-get on MediaWiki-Vagrant - https://phabricator.wikimedia.org/T234240 (10Lokal_Profil) @Mainframe98 Would you mind describing how you did get ast-php installed in your Vagrant? (Since I'm struggling with the same). [08:38:28] (03PS1) 10Marostegui: mariadb: Reimage es2024 with Buster and 10.4 [puppet] - 10https://gerrit.wikimedia.org/r/604616 (https://phabricator.wikimedia.org/T250666) [08:39:00] RECOVERY - Memcached on thanos-fe1001 is OK: TCP OK - 0.000 second response time on 10.64.0.136 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [08:39:07] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage es2024 with Buster and 10.4 [puppet] - 10https://gerrit.wikimedia.org/r/604616 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [08:39:28] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [08:39:58] !log Reimage es2024 to buster [08:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:59] (03CR) 10Dzahn: [C: 03+1] "from reading https://veerasundar.com/blog/2009/08/log4j-tutorial-additivity-what-and-why/ it seems to me this is still useful as it would " [puppet] - 10https://gerrit.wikimedia.org/r/508657 (owner: 10Paladox) [08:41:28] 10Operations, 10MediaWiki-Vagrant, 10phan: It should be possible to install php-ast using apt-get on MediaWiki-Vagrant - https://phabricator.wikimedia.org/T234240 (10Mainframe98) >>! In T234240#6213969, @Lokal_Profil wrote: > @Mainframe98 Would you mind describing how you did get ast-php installed in your Va... [08:42:02] (03CR) 10Dzahn: [C: 04-1] "> Patch Set 4: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/539211 (owner: 10Paladox) [08:42:16] !log imported memcached 1.6.6-1~wmf10u1 [08:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:35] (03PS5) 10Dzahn: Revert "Gerrit: Set base url for commitlink" [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [08:45:20] (03PS1) 10Ema: ATS: use HTTP/1.1 and return in Lua healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/604618 (https://phabricator.wikimedia.org/T255015) [08:45:28] (03CR) 10Dzahn: [C: 03+1] "as qchris said, we should do this after the version upgrade. we should definitely land it though. recently we had a user comment about not" [puppet] - 10https://gerrit.wikimedia.org/r/556270 (owner: 10Paladox) [08:45:50] (03CR) 10Ayounsi: "Thanks!" (0315 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/589406 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [08:46:00] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 51 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:46:10] (03PS5) 10Ayounsi: WMF specific Netbox plugin for interfaces config [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/589406 (https://phabricator.wikimedia.org/T250429) [08:46:39] (03CR) 10Dzahn: "@paladox @qchris Let's confirm whether this is a requirement for the upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) (owner: 10Paladox) [08:47:05] (03CR) 10Vgutierrez: [C: 03+1] ATS: use HTTP/1.1 and return in Lua healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/604618 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [08:48:00] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [08:48:00] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [08:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:29] (03PS6) 10Ayounsi: WMF specific Netbox plugin for interfaces config [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/589406 (https://phabricator.wikimedia.org/T250429) [08:49:23] (03CR) 10Ema: [C: 03+2] ATS: use HTTP/1.1 and return in Lua healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/604618 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [08:49:38] (03CR) 10Dzahn: [C: 04-1] "this will need https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/548549 to be merged first.. and both are not really needed an" [puppet] - 10https://gerrit.wikimedia.org/r/548552 (owner: 10Paladox) [08:50:24] (03CR) 10Dzahn: [C: 04-1] "per my last comment.. please feel free to re-add me once/if there is consensus for this." [puppet] - 10https://gerrit.wikimedia.org/r/569627 (https://phabricator.wikimedia.org/T215360) (owner: 10Zoranzoki21) [08:51:58] (03CR) 10Dzahn: [C: 04-1] "@Paladox ping, let's talk about phab in devtools and if this is still needed or not" [puppet] - 10https://gerrit.wikimedia.org/r/565712 (owner: 10Paladox) [08:53:49] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 41 probes of 658 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:55:06] (03CR) 10Ayounsi: [C: 03+1] "Changes for 1 devices: ['asw2-ulsfo.mgmt.ulsfo.wmnet']" [homer/public] - 10https://gerrit.wikimedia.org/r/547584 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [08:58:33] (03PS1) 10Dzahn: wmflib: add type for a valid PHP version [puppet] - 10https://gerrit.wikimedia.org/r/604622 [08:58:48] (03PS1) 10Filippo Giunchedi: swift: remove swift-container-sharder unit [puppet] - 10https://gerrit.wikimedia.org/r/604623 (https://phabricator.wikimedia.org/T252186) [08:58:54] (03CR) 10Dzahn: php: Create profile::php to handle fpm/mod_php integration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479575 (owner: 10Paladox) [08:59:12] (03CR) 10jerkins-bot: [V: 04-1] wmflib: add type for a valid PHP version [puppet] - 10https://gerrit.wikimedia.org/r/604622 (owner: 10Dzahn) [09:00:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [09:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:32] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/23158/" [puppet] - 10https://gerrit.wikimedia.org/r/604623 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:00:52] (03CR) 10Dzahn: [C: 04-1] "please re-add me once this is ready / has +1" [puppet] - 10https://gerrit.wikimedia.org/r/577656 (owner: 10C. Scott Ananian) [09:01:31] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 47 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:02:05] (03PS2) 10Dzahn: wmflib: add type for a valid PHP version [puppet] - 10https://gerrit.wikimedia.org/r/604622 [09:02:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:09] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 11 probes of 658 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:08:09] (03PS1) 10Muehlenhoff: Make memcached 1.6 an option for the memcached class and enable for the IDPs [puppet] - 10https://gerrit.wikimedia.org/r/604626 (https://phabricator.wikimedia.org/T233933) [09:09:45] (03PS3) 10Kormat: install_server: Better error reporting for reuse-parts [puppet] - 10https://gerrit.wikimedia.org/r/604413 (https://phabricator.wikimedia.org/T254982) [09:10:12] (03CR) 10Kormat: install_server: Better error reporting for reuse-parts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/604413 (https://phabricator.wikimedia.org/T254982) (owner: 10Kormat) [09:12:58] (03CR) 10Muehlenhoff: [C: 03+1] install_server: Better error reporting for reuse-parts [puppet] - 10https://gerrit.wikimedia.org/r/604413 (https://phabricator.wikimedia.org/T254982) (owner: 10Kormat) [09:17:23] 10Operations, 10ORES, 10Scoring-platform-team: Move ORES to redis misc cluster - https://phabricator.wikimedia.org/T254226 (10akosiaris) 05Stalled→03Resolved a:03akosiaris Everything is fine after a week, resolving this. [09:17:46] (03CR) 10Kormat: [C: 03+2] install_server: Better error reporting for reuse-parts [puppet] - 10https://gerrit.wikimedia.org/r/604413 (https://phabricator.wikimedia.org/T254982) (owner: 10Kormat) [09:18:02] (03PS1) 10Alexandros Kosiaris: oresrdb: decommision [puppet] - 10https://gerrit.wikimedia.org/r/604629 (https://phabricator.wikimedia.org/T254238) [09:18:51] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10Kormat) [09:18:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] oresrdb: decommision [puppet] - 10https://gerrit.wikimedia.org/r/604629 (https://phabricator.wikimedia.org/T254238) (owner: 10Alexandros Kosiaris) [09:18:54] 10Operations, 10Patch-For-Review: reuse-parts.sh: provide feedback to user when something fails - https://phabricator.wikimedia.org/T254982 (10Kormat) 05Open→03Resolved Fixed by https://gerrit.wikimedia.org/r/604413 [09:19:24] (03CR) 10Jbond: Add analytics-product system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [09:20:50] (03PS3) 10Jbond: Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [09:21:40] (03CR) 10jerkins-bot: [V: 04-1] Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [09:23:29] (03PS3) 10Elukey: Switch backend for piwik.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/603366 (https://phabricator.wikimedia.org/T252740) [09:25:29] (03PS1) 10Kormat: install_server: Revert d-i-test to original partition scheme. [puppet] - 10https://gerrit.wikimedia.org/r/604636 (https://phabricator.wikimedia.org/T252027) [09:25:34] (03PS1) 10Alexandros Kosiaris: oresrdb: Cleanup profiles, hieradata, cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/604637 [09:25:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: decomission oresrdb100[12] - https://phabricator.wikimedia.org/T254238 (10akosiaris) [09:26:07] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Decomission oresrdb2002.codfw.wmnet - https://phabricator.wikimedia.org/T254240 (10akosiaris) [09:27:57] (03PS1) 10Marostegui: es2024: Enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/604639 [09:28:29] (03CR) 10Dzahn: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/602052 (owner: 10Dzahn) [09:29:31] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10jbond) > Wikitech username: Andrew Kuznetsov Im still unable to find this as a wikitech account, you need to... [09:29:35] (03CR) 10Dzahn: [C: 04-1] "@Jbond do you see why this is "Could not find any files from role/icinga/sync_icinga_state.sh"? I kept staring at it but it seems right to" [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [09:32:01] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/23162/" [puppet] - 10https://gerrit.wikimedia.org/r/604626 (https://phabricator.wikimedia.org/T233933) (owner: 10Muehlenhoff) [09:32:22] (03CR) 10Marostegui: [C: 03+2] es2024: Enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/604639 (owner: 10Marostegui) [09:32:45] (03CR) 10Elukey: [C: 03+2] Switch backend for piwik.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/603366 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [09:32:56] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [09:33:12] (03PS1) 10Arturo Borrero Gonzalez: cloud: cleanup unused records for old DNS servers [dns] - 10https://gerrit.wikimedia.org/r/604640 (https://phabricator.wikimedia.org/T254496) [09:33:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM but see the comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/604613 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:34:16] (03CR) 10Muehlenhoff: "Why does this drop 7.0? 7.0 is the default PHP version in stretch." [puppet] - 10https://gerrit.wikimedia.org/r/604622 (owner: 10Dzahn) [09:35:14] (03PS3) 10Dzahn: wmflib: add type for a valid PHP version [puppet] - 10https://gerrit.wikimedia.org/r/604622 [09:35:44] (03PS1) 10Jbond: admin: add vhargyono to ldap only group [puppet] - 10https://gerrit.wikimedia.org/r/604641 (https://phabricator.wikimedia.org/T254961) [09:36:00] !log switch piwik.wikimedia.org from matomo1001 to matomo1002 (new buster node) [09:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:38] (03CR) 10Jbond: [C: 03+2] admin: add vhargyono to ldap only group [puppet] - 10https://gerrit.wikimedia.org/r/604641 (https://phabricator.wikimedia.org/T254961) (owner: 10Jbond) [09:37:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] oresrdb: Cleanup profiles, hieradata, cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/604637 (owner: 10Alexandros Kosiaris) [09:37:41] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Vasanthi Hargyono - https://phabricator.wikimedia.org/T254961 (10jbond) 05Open→03Resolved thanks @Dzahn added now [09:38:16] (03PS1) 10Jon Harald Søby: Add import sources for gomwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604642 (https://phabricator.wikimedia.org/T255098) [09:40:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/604452 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [09:42:47] (03CR) 10Dzahn: [C: 03+2] ATS: add planet.wikimedia.org to also map to its backend [puppet] - 10https://gerrit.wikimedia.org/r/599323 (owner: 10Dzahn) [09:42:51] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/604626 (https://phabricator.wikimedia.org/T233933) (owner: 10Muehlenhoff) [09:44:49] (03PS2) 10Arturo Borrero Gonzalez: cloud: cleanup unused records for old DNS servers [dns] - 10https://gerrit.wikimedia.org/r/604640 (https://phabricator.wikimedia.org/T254496) [09:45:01] (03PS2) 10Dzahn: ATS: add planet.wikimedia.org to also map to its backend [puppet] - 10https://gerrit.wikimedia.org/r/599323 [09:45:41] (03CR) 10Dzahn: [C: 03+1] admin: add vhargyono to ldap only group [puppet] - 10https://gerrit.wikimedia.org/r/604641 (https://phabricator.wikimedia.org/T254961) (owner: 10Jbond) [09:45:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, the commit message is outdated, though." [puppet] - 10https://gerrit.wikimedia.org/r/604622 (owner: 10Dzahn) [09:46:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: cleanup unused records for old DNS servers [dns] - 10https://gerrit.wikimedia.org/r/604640 (https://phabricator.wikimedia.org/T254496) (owner: 10Arturo Borrero Gonzalez) [09:46:17] (03PS4) 10Dzahn: wmflib: add type for a valid PHP version [puppet] - 10https://gerrit.wikimedia.org/r/604622 [09:46:29] !log upgrading labweb* PHP 7.2.31 [09:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:47] (03CR) 10Dzahn: [C: 03+2] ATS: add planet.wikimedia.org to also map to its backend [puppet] - 10https://gerrit.wikimedia.org/r/599323 (owner: 10Dzahn) [09:47:30] (03CR) 10Dzahn: "this is to avoid the error message when people enter http://planet.wikimedia.org without a language prefix" [puppet] - 10https://gerrit.wikimedia.org/r/599323 (owner: 10Dzahn) [09:49:06] (03CR) 10Volans: "Thanks for the quick fixes, few minor nits inline and it's good to go for me." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [09:50:57] (03PS1) 10Alexandros Kosiaris: otrs1001: Renumber to row B [dns] - 10https://gerrit.wikimedia.org/r/604645 [09:52:02] (03CR) 10Volans: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/547584 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [09:52:28] (03PS7) 10Jbond: cookbooks sre.pdus: add uptime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) [09:52:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs1001: Renumber to row B [dns] - 10https://gerrit.wikimedia.org/r/604645 (owner: 10Alexandros Kosiaris) [09:53:07] (03CR) 10Jbond: "updated thanks :)" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [09:54:20] !log upgrading mwmaint* to PHP 7.2.31 [09:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:52] !log Upgrade es2025 [09:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:05] (03CR) 10Filippo Giunchedi: hieradata: add eqiad for thanos-query / thanos-swift (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/604613 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:58:02] !log upgrading netmon* to PHP 7.2.31 [09:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:05] (03PS1) 10Dzahn: ATS: fix backend for piwik caused by a bad rebase [puppet] - 10https://gerrit.wikimedia.org/r/604646 [09:59:17] (03CR) 10jerkins-bot: [V: 04-1] ATS: fix backend for piwik caused by a bad rebase [puppet] - 10https://gerrit.wikimedia.org/r/604646 (owner: 10Dzahn) [10:00:11] (03PS2) 10Dzahn: ATS: fix backend for piwik caused by a bad rebase [puppet] - 10https://gerrit.wikimedia.org/r/604646 [10:00:20] elukey: i messed up the piwiki backend, fixing it. sorry, bad rebase [10:00:26] piwik [10:00:42] :( np [10:01:18] actually i dont know what happens if there are 2 lines like this [10:01:27] maybe the first one just won [10:01:59] (03CR) 10Dzahn: [C: 03+2] ATS: fix backend for piwik caused by a bad rebase [puppet] - 10https://gerrit.wikimedia.org/r/604646 (owner: 10Dzahn) [10:02:27] it may try to load balance them in someway [10:02:38] true [10:02:40] are you going to run puppet on cp-text now? [10:02:53] if so ping #traffic [10:02:59] they were doing some maintenance [10:03:00] (03CR) 10Volans: [C: 03+1] "LGTM! We should add at this point some CI to this repo too." (034 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/589406 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [10:03:19] elukey: oh, yea, let me ping them [10:03:50] 10Operations, 10Traffic: ATS memory leak upon removing healthchecks.so from configuration - https://phabricator.wikimedia.org/T255120 (10ema) [10:04:41] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:08:01] elukey: puppet running now (after talk on -traffic) [10:08:54] super tahnks [10:11:30] elukey: the result of it was that the actual remap.config written by puppet removed the 1002 line and added the 1001 line. so it was effectly just back to 1001 and not balancing, afaict [10:11:42] ahh okok [10:14:30] !log Applying temporary changes on mwdebug1001 [10:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:57] 10Operations, 10Traffic: ATS memory leak upon removing healthchecks.so from configuration - https://phabricator.wikimedia.org/T255120 (10ema) [10:16:27] (03CR) 10Mvolz: eventgate, eventstreams, citoid: Log with namedlevels (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594492 (https://phabricator.wikimedia.org/T239459) (owner: 10Alexandros Kosiaris) [10:18:22] (03CR) 10Jbond: [C: 03+2] cookbooks sre.pdus: add uptime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:21:07] !log Run scap pull at mwdebug1001 to revert temporary changes [10:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:39] !log restarting gerrit on gerrit-replica (gerrit2001) - java.lang.OutOfMemoryError: Java heap space [10:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:16] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) Thanks a lot! We don't have a lot of time during these days, would it be ok to schedule something early next Q? (In July I mean) [10:22:43] 10Operations, 10observability: Invalid apache configuration on profile::prometheus::ops hosts - https://phabricator.wikimedia.org/T255124 (10fgiunchedi) [10:23:40] (03PS1) 10Filippo Giunchedi: prometheus: enable httpd mod_rewrite [puppet] - 10https://gerrit.wikimedia.org/r/604648 (https://phabricator.wikimedia.org/T255124) [10:23:41] 10Operations, 10Traffic: ATS memory leak upon removing healthchecks.so from configuration - https://phabricator.wikimedia.org/T255120 (10ema) p:05Triage→03Medium [10:24:40] (03PS1) 10Arturo Borrero Gonzalez: toolforge: relocate nginx-ingress config from kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/604649 (https://phabricator.wikimedia.org/T195217) [10:25:26] 10Operations, 10observability, 10Patch-For-Review: Invalid apache configuration on profile::prometheus::ops hosts - https://phabricator.wikimedia.org/T255124 (10fgiunchedi) Also `httpd` module should complain loudly / make puppet fail if it has just deployed an invalid configuration (i.e. `apache2ctl configt... [10:29:03] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add eqiad for thanos-query / thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/604613 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [10:30:06] (03CR) 10Ema: [C: 04-1] "This has to wait a bit, it looks like icinga also is using "Host: varnishcheck":" [puppet] - 10https://gerrit.wikimedia.org/r/604364 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [10:32:24] !log roll-restart pybal in eqiad lvs low-traffic [10:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:56] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 67 connections established with conf1004.eqiad.wmnet:4001 (min=69) https://wikitech.wikimedia.org/wiki/PyBal [10:36:11] expected ^ [10:36:30] (03PS1) 10Ema: varnish: use Host:varnishcheck.wm.org for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/604653 (https://phabricator.wikimedia.org/T255015) [10:36:45] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] "Discussed over IRC, good to merged and will follow up later on with the possible improvements." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/589406 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [10:37:17] !log installing buster kernel security updates (no reboots yet, on hold for regression-free microcode update) [10:37:18] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.54:443, 10.2.2.53:80]) https://wikitech.wikimedia.org/wiki/PyBal [10:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:48] (03PS1) 10Matthias Mullie: $aliases should be an array of strings, not AliasGroup objects [extensions/MachineVision] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604654 [10:37:55] (03CR) 10Matthias Mullie: [C: 03+2] $aliases should be an array of strings, not AliasGroup objects [extensions/MachineVision] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604654 (owner: 10Matthias Mullie) [10:37:58] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.54:443, 10.2.2.53:80]) https://wikitech.wikimedia.org/wiki/PyBal [10:39:12] (03PS2) 10Ema: varnish: narrow down healthchecks definition [puppet] - 10https://gerrit.wikimedia.org/r/604364 (https://phabricator.wikimedia.org/T255015) [10:39:45] (03PS1) 10Mvolz: Update citoid to include change Ia5bc189 [deployment-charts] - 10https://gerrit.wikimedia.org/r/604655 [10:40:06] (03CR) 10Dzahn: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [10:40:18] (03PS3) 10Ema: varnish: narrow down healthchecks definition [puppet] - 10https://gerrit.wikimedia.org/r/604364 (https://phabricator.wikimedia.org/T255015) [10:40:51] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: cluster=thanos,service=thanos-query [10:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:09] (03PS3) 10Dzahn: icinga: convert sync_icinga_state.sh.erb to file with config [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) [10:41:10] PROBLEM - Confd template for /srv/config-master/pybal/codfw/thanos-query on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/thanos-query is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:41:42] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: cluster=thanos,service=thanos-swift [10:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:14] PROBLEM - Confd template for /srv/config-master/pybal/codfw/thanos-query on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/thanos-query is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:42:16] PROBLEM - Confd template for /srv/config-master/pybal/codfw/thanos-swift on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/thanos-swift is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:42:44] also known ^ should recover shortly [10:42:54] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:43:38] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:43:53] 10Operations, 10Traffic, 10Patch-For-Review: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10ema) [10:45:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:46:18] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 69 connections established with conf1004.eqiad.wmnet:4001 (min=69) https://wikitech.wikimedia.org/wiki/PyBal [10:47:08] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.2.3 [software/homer] - 10https://gerrit.wikimedia.org/r/604656 [10:47:30] 10Operations, 10Traffic, 10Patch-For-Review: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10ema) [10:47:40] !log repooling mw1318,mw2139,mw2145,mw2147,mw2221,mw2219,mw2250,mw2350 (these were depooled, but seem all fine in Icinga and were probably just forgotten) [10:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:09] (03PS1) 10KartikMistry: Update cxserver to 2020-06-10-044445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/604657 (https://phabricator.wikimedia.org/T254959) [10:48:22] oooof I think the mw errors might have been the pybal restart and bgp flap [10:49:17] it has reconverged and lvs1015 has taken over again from the looks of it [10:49:56] (03CR) 10Hnowlan: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [10:53:27] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10faidon) Approved. [10:53:27] (03CR) 10Ayounsi: [C: 03+1] CHANGELOG: add changelogs for release v0.2.3 [software/homer] - 10https://gerrit.wikimedia.org/r/604656 (owner: 10Volans) [10:54:52] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.2.3 [software/homer] - 10https://gerrit.wikimedia.org/r/604656 (owner: 10Volans) [10:54:54] (03Merged) 10jenkins-bot: $aliases should be an array of strings, not AliasGroup objects [extensions/MachineVision] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604654 (owner: 10Matthias Mullie) [10:55:33] matthiasmullie: can I also +2 my backport change now? [10:56:22] sure [10:56:28] Thanks. [10:57:34] (03PS7) 10Alexandros Kosiaris: ganeti: Add a ganeti_init.sh script [puppet] - 10https://gerrit.wikimedia.org/r/602350 (https://phabricator.wikimedia.org/T228924) [10:58:06] (03CR) 10KartikMistry: [C: 03+2] "Backport to wmf.36" [extensions/ContentTranslation] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604587 (https://phabricator.wikimedia.org/T254965) (owner: 10KartikMistry) [10:58:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Sure! Merging with a couple of minor doc changes. It would be great if we could ditch it instead and manage to automate all these steps." [puppet] - 10https://gerrit.wikimedia.org/r/602350 (https://phabricator.wikimedia.org/T228924) (owner: 10Alexandros Kosiaris) [10:58:40] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.2.3 [software/homer] - 10https://gerrit.wikimedia.org/r/604656 (owner: 10Volans) [10:59:52] (03CR) 10Volans: "Wouldn't a cookbook approach work in this case?" [puppet] - 10https://gerrit.wikimedia.org/r/602350 (https://phabricator.wikimedia.org/T228924) (owner: 10Alexandros Kosiaris) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European Mid-day backport window(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200611T1100). [11:00:04] matthiasmullie, kart_, and kostajh: A patch you scheduled for European Mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] o/ [11:00:23] * kart_ is here and +2ed backport change. [11:00:26] (03PS1) 10Filippo Giunchedi: templates: add PTR for thanos-swift / thanos-query [dns] - 10https://gerrit.wikimedia.org/r/604664 (https://phabricator.wikimedia.org/T252186) [11:00:52] matthiasmullie: I think you are ready to deploy. [11:01:19] starting now [11:01:22] \o [11:04:06] !log mlitn@deploy1001 Synchronized php-1.35.0-wmf.36/extensions/MachineVision: $aliases should be an array of strings, not AliasGroup objects (duration: 01m 07s) [11:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:17] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition=5 site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging- [11:04:17] ll&var-consumer_group=All [11:04:32] done [11:04:49] kart_ & kostajh are you deploying yourselves, or would you rather I do so? [11:05:03] matthiasmullie: I need someone (you?) to deploy for me please [11:05:03] matthiasmullie: I'm deploying my patch. [11:05:15] kostajh: sure [11:05:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] "> Wouldn't a cookbook approach work in this case?" [puppet] - 10https://gerrit.wikimedia.org/r/602350 (https://phabricator.wikimedia.org/T228924) (owner: 10Alexandros Kosiaris) [11:05:33] kart_: go ahead & LMK when you're done! [11:05:50] matthiasmullie: sure. CI will take few more minutes to finish.. [11:05:59] (03CR) 10Matthias Mullie: [C: 03+2] Help panel: Update guidance behavior rules [extensions/GrowthExperiments] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604599 (https://phabricator.wikimedia.org/T244431) (owner: 10Gergő Tisza) [11:06:11] (03CR) 10Volans: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/602350 (https://phabricator.wikimedia.org/T228924) (owner: 10Alexandros Kosiaris) [11:06:42] kostajh: do you want to test this one on mwdebug? [11:06:58] matthiasmullie: yes please [11:08:31] PROBLEM - Confd template for /srv/config-master/pybal/codfw/thanos-swift on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/thanos-swift is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:43] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [11:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:54] Hey people. https://meta.wikimedia.org/wiki/Special:RecentChanges gives "Internal error", repeatedly [11:10:04] hauskatze: T253098 [11:10:04] T253098: "PHP Warning: A non-numeric value encountered" from NamespaceInfo.php via SpecialRecentChanges.php - NamespaceInfo::isTalk called with non-integer (string) namespace '-1' - https://phabricator.wikimedia.org/T253098 [11:10:10] hauskatze: hola, from an earlier comment, it looks like it is known https://phabricator.wikimedia.org/T255088 [11:10:29] hola marostegui / hi kostajh - taking a look at that task [11:10:57] (03PS7) 10Volans: WMF specific Netbox plugin for interfaces config [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/589406 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [11:11:30] marostegui: that one was closed as a dupe of T253098 [11:11:42] I see it's UBN but affects meta, as well as ltwiki [11:13:56] hauskatze: try https://meta.wikimedia.org/wiki/Special:RecentChanges?hidebots=1&translations=filter&hidecategorization=1&limit=50&days=7&enhanced=1&urlversion=2 rather than whatever your saved preference is trying for [11:18:55] (03Merged) 10jenkins-bot: IssueTrackingTool: Fix js error in getCurrentNodeId method [extensions/ContentTranslation] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604587 (https://phabricator.wikimedia.org/T254965) (owner: 10KartikMistry) [11:18:57] (03Merged) 10jenkins-bot: Help panel: Update guidance behavior rules [extensions/GrowthExperiments] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604599 (https://phabricator.wikimedia.org/T244431) (owner: 10Gergő Tisza) [11:19:11] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: rename hiera key for ingress nodes [puppet] - 10https://gerrit.wikimedia.org/r/604665 (https://phabricator.wikimedia.org/T195217) [11:19:26] OK. deploying now. [11:20:49] (03PS2) 10Arturo Borrero Gonzalez: kubeadm: rename hiera key for ingress nodes [puppet] - 10https://gerrit.wikimedia.org/r/604665 (https://phabricator.wikimedia.org/T250172) [11:23:31] (03CR) 10Volans: [V: 03+2 C: 03+2] "Just rebased and resolved the conflict." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/589406 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [11:23:57] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [11:26:31] (03PS1) 10Ayounsi: Junos json output is curious [cookbooks] - 10https://gerrit.wikimedia.org/r/604668 [11:28:15] matthiasmullie: When I did git fetch, it also pulled your +2'ed changes. I hope that's fine. [11:28:29] yeah that's ok [11:28:32] I've single file to deploy only. [11:28:34] Cool. [11:28:52] !log kartik@deploy1001 Synchronized php-1.35.0-wmf.36/extensions/ContentTranslation/modules/tools/mw.cx.tools.IssueTrackingTool.js: Backport: [[gerrit|604587|IssueTrackingTool: Fix js error in getCurrentNodeId method (T254965)]] (duration: 01m 07s) [11:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:56] T254965: ContentTranslation: Uncaught TypeError: Cannot read property 'contains' of undefined - https://phabricator.wikimedia.org/T254965 [11:29:02] (03CR) 10Volans: [C: 03+1] "Looks good given the example json you gave me." [cookbooks] - 10https://gerrit.wikimedia.org/r/604668 (owner: 10Ayounsi) [11:29:05] matthiasmullie: I'm done. [11:29:18] thanks! [11:29:25] kostajh: still around? [11:29:29] matthiasmullie: yep [11:30:00] ok, should be on mwdebug in a minute [11:30:59] k [11:31:05] kostajh: should be on mwdebug1001 - let me know when you're done testing! [11:31:14] matthiasmullie: thanks! looking [11:31:32] (03CR) 10Ayounsi: [C: 03+2] Junos json output is curious [cookbooks] - 10https://gerrit.wikimedia.org/r/604668 (owner: 10Ayounsi) [11:32:52] matthiasmullie: looks good! [11:34:16] (03PS1) 10Volans: Upstream release v0.2.3 including the new plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/604669 [11:34:20] ok, proceeding [11:34:23] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [11:34:23] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [11:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:38] !log mlitn@deploy1001 Synchronized php-1.35.0-wmf.36/extensions/GrowthExperiments: Help panel: Update guidance behavior rules (duration: 01m 06s) [11:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:50] kostajh: done! [11:36:14] !log EU BACON done [11:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:43] (03CR) 10Volans: [V: 03+2 C: 03+2] "As agreed on IRC" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/604669 (owner: 10Volans) [11:36:43] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [11:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:53] that's another test ^ [11:38:08] (03CR) 10Muehlenhoff: [C: 03+2] Make memcached 1.6 an option for the memcached class and enable for the IDPs [puppet] - 10https://gerrit.wikimedia.org/r/604626 (https://phabricator.wikimedia.org/T233933) (owner: 10Muehlenhoff) [11:38:27] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10Aklapper) @AndrewKuznetsov: And if [links on] https://phabricator.wikimedia.org/project/profile/956/ are uncl... [11:38:43] matthiasmullie: thank you! [11:39:59] (03PS1) 10MarcoAurelio: [enwikivoyage] Undeploy the Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604670 [11:41:02] (03PS2) 10MarcoAurelio: [enwikivoyage] Undeploy the Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604670 (https://phabricator.wikimedia.org/T254820) [11:41:51] (03PS7) 10Ayounsi: Juniper to Netbox import script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566812 [11:42:24] (03CR) 10jerkins-bot: [V: 04-1] Juniper to Netbox import script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566812 (owner: 10Ayounsi) [11:43:35] (03PS1) 10Jforrester: NamespaceInfo::makeValidNamespace: Don't throw for -1 or -2 [core] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604675 (https://phabricator.wikimedia.org/T253098) [11:43:46] (03CR) 10Jforrester: [C: 03+2] NamespaceInfo::makeValidNamespace: Don't throw for -1 or -2 [core] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604675 (https://phabricator.wikimedia.org/T253098) (owner: 10Jforrester) [11:44:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: rename hiera key for ingress nodes [puppet] - 10https://gerrit.wikimedia.org/r/604665 (https://phabricator.wikimedia.org/T250172) (owner: 10Arturo Borrero Gonzalez) [11:44:01] !log volans@deploy1001 Started deploy [homer/deploy@df83901]: Release v0.2.3 [11:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:26] !log volans@deploy1001 Finished deploy [homer/deploy@df83901]: Release v0.2.3 (duration: 00m 25s) [11:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:31] (03PS1) 10Muehlenhoff: Fix installation of memcached 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/604676 [11:44:43] (03CR) 10jerkins-bot: [V: 04-1] Fix installation of memcached 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/604676 (owner: 10Muehlenhoff) [11:46:23] !log Deploy schema change on s6 codfw - T250066 [11:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:27] T250066: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 [11:47:25] (03PS1) 10Jbond: cookbook sre.pdus: add reboot script [cookbooks] - 10https://gerrit.wikimedia.org/r/604678 (https://phabricator.wikimedia.org/T246890) [11:49:16] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.pdus: add reboot script [cookbooks] - 10https://gerrit.wikimedia.org/r/604678 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [11:49:18] (03PS8) 10Ayounsi: Juniper to Netbox import script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566812 [11:50:55] (03PS2) 10Muehlenhoff: Fix installation of memcached 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/604676 [11:53:10] (03CR) 10Muehlenhoff: [C: 03+2] Fix installation of memcached 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/604676 (owner: 10Muehlenhoff) [11:53:47] (03CR) 10Ayounsi: Netbox driven routers disabled interfaces (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/592246 (owner: 10Ayounsi) [11:54:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2075 T254139', diff saved to https://phabricator.wikimedia.org/P11469 and previous config saved to /var/cache/conftool/dbconfig/20200611-115430-marostegui.json [11:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:35] T254139: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 [11:55:55] (03CR) 10Jforrester: [C: 03+1] [enwikivoyage] Undeploy the Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604670 (https://phabricator.wikimedia.org/T254820) (owner: 10MarcoAurelio) [11:56:04] (03PS1) 10Awight: Migrate QuickSurveys `layout` param [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604681 (https://phabricator.wikimedia.org/T255130) [12:00:02] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10jbond) [12:00:25] (03CR) 10Jbond: [C: 03+2] admin: add shell account for lmata and add to ops group [puppet] - 10https://gerrit.wikimedia.org/r/603950 (https://phabricator.wikimedia.org/T254818) (owner: 10Jbond) [12:01:35] (03Merged) 10jenkins-bot: NamespaceInfo::makeValidNamespace: Don't throw for -1 or -2 [core] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604675 (https://phabricator.wikimedia.org/T253098) (owner: 10Jforrester) [12:02:40] (03PS1) 10Marostegui: mariadb: Reimage es2023 [puppet] - 10https://gerrit.wikimedia.org/r/604682 (https://phabricator.wikimedia.org/T250666) [12:03:19] PROBLEM - Memcached on idp-test2001 is CRITICAL: connect to address 208.80.153.25 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [12:03:50] !log Reimage es2023 (es5 codfw master) [12:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:20] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.36/includes/title/NamespaceInfo.php: T253098 NamespaceInfo::makeValidNamespace: Don't throw for -1 or -2 (duration: 01m 06s) [12:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:24] T253098: "PHP Warning: A non-numeric value encountered" from NamespaceInfo.php via SpecialRecentChanges.php - NamespaceInfo::isTalk called with non-integer (string) namespace '-1' - https://phabricator.wikimedia.org/T253098 [12:04:25] (03CR) 10Kormat: [C: 03+1] mariadb: Reimage es2023 [puppet] - 10https://gerrit.wikimedia.org/r/604682 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [12:04:31] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage es2023 [puppet] - 10https://gerrit.wikimedia.org/r/604682 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [12:04:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10jbond) 05Open→03Resolved a:03jbond @lmata this has been merged now, access should be enabled as puppet runs max ~30mins. Please reopen iof there are any issues [12:08:27] (03CR) 10Jbond: [C: 03+1] "Thanks <3 LGTM optional nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [12:10:55] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10jbond) @Nuria are you able to approve the access to researchers statistics-privatedata-users analytics... [12:11:08] (03PS2) 10Kormat: install_server: Revert d-i-test to original partition scheme. [puppet] - 10https://gerrit.wikimedia.org/r/604636 (https://phabricator.wikimedia.org/T252027) [12:11:53] (03CR) 10Marostegui: [C: 03+1] install_server: Revert d-i-test to original partition scheme. [puppet] - 10https://gerrit.wikimedia.org/r/604636 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [12:12:25] (03CR) 10Kormat: [C: 03+2] install_server: Revert d-i-test to original partition scheme. [puppet] - 10https://gerrit.wikimedia.org/r/604636 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [12:13:36] (03PS12) 10Ayounsi: Netbox driven switch interfaces configuration [homer/public] - 10https://gerrit.wikimedia.org/r/547584 (https://phabricator.wikimedia.org/T250429) [12:13:38] (03PS4) 10Ayounsi: Netbox driven routers disabled interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/592246 [12:13:40] (03PS4) 10Ayounsi: Chassis: more generic, add ae count [homer/public] - 10https://gerrit.wikimedia.org/r/592251 [12:13:42] (03PS4) 10Ayounsi: Add graceful-switchover to multiple RE devices [homer/public] - 10https://gerrit.wikimedia.org/r/592938 (https://phabricator.wikimedia.org/T191667) [12:13:44] (03PS4) 10Ayounsi: add graceful-restart to CRs [homer/public] - 10https://gerrit.wikimedia.org/r/577564 (https://phabricator.wikimedia.org/T191667) (owner: 10CDanis) [12:15:34] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [12:15:34] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [12:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:28] (03PS2) 10Jbond: cookbook sre.pdus: add reboot script [cookbooks] - 10https://gerrit.wikimedia.org/r/604678 (https://phabricator.wikimedia.org/T246890) [12:22:48] 10Operations: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10MoritzMuehlenhoff) [12:23:37] ACKNOWLEDGEMENT - Memcached on idp-test2001 is CRITICAL: connect to address 208.80.153.25 and port 11000: Connection refused John Bond service is being installed https://wikitech.wikimedia.org/wiki/Memcached [12:25:28] jbond42 is a service now? :) [12:25:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [12:25:38] hehe [12:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:15] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition=5 site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging- [12:26:15] ll&var-consumer_group=All [12:27:03] (03PS1) 10Elukey: Remove matomo1002 specific overrides [puppet] - 10https://gerrit.wikimedia.org/r/604690 [12:27:44] (03CR) 10Elukey: [C: 03+2] Remove matomo1002 specific overrides [puppet] - 10https://gerrit.wikimedia.org/r/604690 (owner: 10Elukey) [12:28:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:31] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [12:28:31] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [12:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:44] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [12:31:20] PROBLEM - Check systemd state on thanos-be1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:24] PROBLEM - Check systemd state on thanos-be1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:41] that's me ^ [12:31:42] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:20] RECOVERY - Check systemd state on thanos-be1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:26] RECOVERY - Check systemd state on thanos-be1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:27] (03PS1) 10Elukey: Assign role archiva to archiva1002 [puppet] - 10https://gerrit.wikimedia.org/r/604691 (https://phabricator.wikimedia.org/T252767) [12:32:46] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:18] RECOVERY - Memcached on idp-test2001 is OK: TCP OK - 0.037 second response time on 208.80.153.25 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [12:35:04] RECOVERY - Confd template for /srv/config-master/pybal/codfw/thanos-query on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:35:40] RECOVERY - Confd template for /srv/config-master/pybal/codfw/thanos-swift on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:35:56] RECOVERY - Confd template for /srv/config-master/pybal/codfw/thanos-swift on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:36:00] 10Operations: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10MoritzMuehlenhoff) The current setup can even lead to startup failures, on one of the IDP hosts I ran into the issue that the Puppet run for the systemd unit raced with the service restart triggered by the package u... [12:36:01] (03PS2) 10Jforrester: [Beta Cluster] Add visualeditor-realtime.wmflabs.org to CSP's approved domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604358 (owner: 10Esanders) [12:36:05] !log updated pcc facts [12:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:57] (03PS6) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 [12:37:32] (03CR) 10Kormat: Add native mysql spicerack module. (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat) [12:39:12] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Add visualeditor-realtime.wmflabs.org to CSP's approved domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604358 (owner: 10Esanders) [12:39:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [12:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:35] (03Merged) 10jenkins-bot: [Beta Cluster] Add visualeditor-realtime.wmflabs.org to CSP's approved domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604358 (owner: 10Esanders) [12:42:03] 10Operations, 10DBA: In-place conversion from LVM to normal partition - https://phabricator.wikimedia.org/T252195 (10Kormat) 05Stalled→03Declined With {T252027} being resolved, there is no longer a need for this. [12:42:11] (03PS1) 10Marostegui: es2023: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/604695 [12:43:15] (03CR) 10Marostegui: [C: 03+2] es2023: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/604695 (owner: 10Marostegui) [12:45:58] RECOVERY - Confd template for /srv/config-master/pybal/codfw/thanos-query on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:47:14] (03CR) 10Jforrester: [C: 03+1] "For after the wmf.36 train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601910 (https://phabricator.wikimedia.org/T254301) (owner: 10Reedy) [12:48:47] (03CR) 10Vgutierrez: [C: 03+1] varnish: use Host:varnishcheck.wm.org for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/604653 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [12:53:56] 10Operations, 10Traffic, 10Patch-For-Review: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10ema) [12:53:57] (03CR) 10Vgutierrez: [C: 03+1] varnish: narrow down healthchecks definition [puppet] - 10https://gerrit.wikimedia.org/r/604364 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [12:54:31] (03CR) 10Ema: [C: 03+2] varnish: use Host:varnishcheck.wm.org for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/604653 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [12:54:56] (03PS1) 10Jbond: labtestpuppet: puppet::servers [puppet] - 10https://gerrit.wikimedia.org/r/604696 [12:55:11] (03CR) 10Muehlenhoff: Add analytics-product system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [12:55:47] (03PS2) 10Jbond: labtestpuppet: puppet::servers [puppet] - 10https://gerrit.wikimedia.org/r/604696 [12:58:06] (03CR) 10Muehlenhoff: Add analytics-product system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [13:00:04] longma and liw: I, the Bot under the Fountain, allow thee, The Deployer, to do Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200611T1300). [13:00:54] (03PS3) 10Jbond: labtestpuppet: puppet::servers [puppet] - 10https://gerrit.wikimedia.org/r/604696 (https://phabricator.wikimedia.org/T254491) [13:03:15] 10Operations, 10RESTBase, 10serviceops, 10RESTBase-architecture, 10Service-Architecture: Use the service proxy in restbase - https://phabricator.wikimedia.org/T255133 (10Joe) [13:04:36] (03PS2) 10Elukey: Assign role archiva to archiva1002 [puppet] - 10https://gerrit.wikimedia.org/r/604691 (https://phabricator.wikimedia.org/T252767) [13:04:38] (03PS1) 10Elukey: archiva::proxy: raise TLS ciphersuite requirements [puppet] - 10https://gerrit.wikimedia.org/r/604698 (https://phabricator.wikimedia.org/T252767) [13:04:48] 10Operations, 10RESTBase, 10serviceops, 10RESTBase-architecture, 10Service-Architecture: Use the service proxy in restbase - https://phabricator.wikimedia.org/T255133 (10Joe) we will have to add some more refinement to the service proxy - specifically we don't need to install all of the remote cluster ha... [13:05:07] (03CR) 10jerkins-bot: [V: 04-1] archiva::proxy: raise TLS ciphersuite requirements [puppet] - 10https://gerrit.wikimedia.org/r/604698 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [13:05:08] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review, 10Product-Analytics (Kanban): Creation of a new POSIX group and system user for the Product Analytics team - https://phabricator.wikimedia.org/T255039 (10jbond) p:05Triage→03Medium [13:05:42] 10Operations: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10jbond) p:05Triage→03Medium [13:07:40] (03PS2) 10Elukey: archiva::proxy: raise TLS ciphersuite requirements [puppet] - 10https://gerrit.wikimedia.org/r/604698 (https://phabricator.wikimedia.org/T252767) [13:07:42] (03PS3) 10Elukey: Assign role archiva to archiva1002 [puppet] - 10https://gerrit.wikimedia.org/r/604691 (https://phabricator.wikimedia.org/T252767) [13:09:43] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/23168/archiva1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/604698 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [13:09:49] 10Operations, 10observability, 10Patch-For-Review: Invalid apache configuration on profile::prometheus::ops hosts - https://phabricator.wikimedia.org/T255124 (10jbond) >>! In T255124#6214522, @fgiunchedi wrote: > Also `httpd` module should complain loudly / make puppet fail if it has just deployed an invalid... [13:11:00] (03PS4) 10Elukey: Assign role archiva to archiva1002 [puppet] - 10https://gerrit.wikimedia.org/r/604691 (https://phabricator.wikimedia.org/T252767) [13:11:46] (03CR) 10Elukey: [C: 03+2] Assign role archiva to archiva1002 [puppet] - 10https://gerrit.wikimedia.org/r/604691 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [13:14:11] 10Operations, 10observability, 10Patch-For-Review: Invalid apache configuration on profile::prometheus::ops hosts - https://phabricator.wikimedia.org/T255124 (10jbond) p:05Triage→03Medium [13:18:26] PROBLEM - Check systemd state on archiva1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:18:43] this is me, new host --^ [13:22:20] 10Operations, 10ORES, 10Scoring-platform-team: Move ORES to redis misc cluster - https://phabricator.wikimedia.org/T254226 (10Halfak) Thank you, @akosiaris! [13:23:39] (03CR) 10Dzahn: icinga: convert sync_icinga_state.sh.erb to file with config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [13:23:56] (03PS4) 10Dzahn: icinga: convert sync_icinga_state.sh.erb to file with config [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) [13:25:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [13:29:18] (03CR) 10Ema: [C: 03+2] varnish: narrow down healthchecks definition [puppet] - 10https://gerrit.wikimedia.org/r/604364 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:29:54] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23169/icinga2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [13:30:30] PROBLEM - DPKG on archiva1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:35:09] !log filippo@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=thanos-swift,name=eqiad [13:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:16] !log filippo@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=thanos-query,name=eqiad [13:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:12] (03CR) 10Dzahn: "[icinga2001:~] $ cat /etc/icinga/active_host" [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [13:39:26] (03CR) 10Filippo Giunchedi: [C: 03+2] templates: add PTR for thanos-swift / thanos-query [dns] - 10https://gerrit.wikimedia.org/r/604664 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:40:17] 10Operations, 10Traffic, 10Patch-For-Review: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10ema) [13:41:43] (03CR) 10Dzahn: "Jun 11 13:41:03 icinga2001 puppet-agent[20603]: Disabling Puppet." [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [13:43:05] 10Operations, 10observability, 10Patch-For-Review: Invalid apache configuration on profile::prometheus::ops hosts - https://phabricator.wikimedia.org/T255124 (10fgiunchedi) >>! In T255124#6215034, @jbond wrote: >>>! In T255124#6214522, @fgiunchedi wrote: >> Also `httpd` module should complain loudly / make p... [13:44:43] 10Operations, 10observability, 10Patch-For-Review: Invalid apache configuration on profile::prometheus::ops hosts - https://phabricator.wikimedia.org/T255124 (10jbond) >>! In T255124#6215264, @fgiunchedi wrote: > > Indeed that's a bummer we can't really use validate_cmd without copying the whole config :( I... [13:45:39] 10Operations, 10Domains, 10Traffic: wikibase.org should redirect to wikiba.se - https://phabricator.wikimedia.org/T254957 (10jbond) p:05Triage→03Medium [13:46:22] (03PS1) 10Dzahn: sync_icinga_state: remove superfluous $ [puppet] - 10https://gerrit.wikimedia.org/r/604709 [13:46:34] (03CR) 10jerkins-bot: [V: 04-1] sync_icinga_state: remove superfluous $ [puppet] - 10https://gerrit.wikimedia.org/r/604709 (owner: 10Dzahn) [13:46:53] (03PS2) 10Dzahn: sync_icinga_state: remove superfluous $ [puppet] - 10https://gerrit.wikimedia.org/r/604709 [13:47:10] (03CR) 10Dzahn: [C: 03+2] sync_icinga_state: remove superfluous $ [puppet] - 10https://gerrit.wikimedia.org/r/604709 (owner: 10Dzahn) [13:47:56] (03CR) 10Ssingh: [C: 03+2] wikidough: set up the pdns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/604452 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:50:54] (03PS1) 10Ema: ATS: use X-Cache-Status 'int' for responses without lookup [puppet] - 10https://gerrit.wikimedia.org/r/604710 (https://phabricator.wikimedia.org/T255015) [13:56:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:56:33] Urbanecm: thanks! ^ [13:57:36] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) Technically this is done because both c... [13:59:33] (03PS1) 10Muehlenhoff: Remove unused/dead code [cookbooks] - 10https://gerrit.wikimedia.org/r/604713 [14:00:34] RECOVERY - Check systemd state on archiva1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:37] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/604713 (owner: 10Muehlenhoff) [14:01:10] RECOVERY - DPKG on archiva1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:01:51] (03PS1) 10Volans: Add support for buster in the build process [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/604715 (https://phabricator.wikimedia.org/T245114) [14:02:11] (03CR) 10jerkins-bot: [V: 04-1] Remove unused/dead code [cookbooks] - 10https://gerrit.wikimedia.org/r/604713 (owner: 10Muehlenhoff) [14:03:41] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single [14:03:42] !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [14:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:46] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single [14:03:46] !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [14:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:01] (03PS1) 10Ema: purged: restart upon configuration changes [puppet] - 10https://gerrit.wikimedia.org/r/604716 [14:04:47] 10Operations, 10observability, 10Patch-For-Review: Invalid apache configuration on profile::prometheus::ops hosts - https://phabricator.wikimedia.org/T255124 (10fgiunchedi) >>! In T255124#6215266, @jbond wrote: >>>! In T255124#6215264, @fgiunchedi wrote: >> >> Indeed that's a bummer we can't really use vali... [14:06:17] jouncebot: now [14:06:18] For the next 0 hour(s) and 53 minute(s): Mediawiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200611T1300) [14:07:20] <_joe_> I just got paged? [14:07:29] paged, nothing here? looking [14:07:31] same, it ws an ack though? [14:07:32] icinga2001 [14:07:53] oh, no I guess it wasn't [14:08:35] looks like the external monitoring for icinga triggered [14:08:58] yep [14:09:15] godog: we should try to merge that change for VO and the external monitoring [14:09:21] eh.. i just tested syncing the icing state to 2001 and saw some duplicate definitions. i did not get a text [14:09:42] (03PS2) 10Muehlenhoff: Remove unused/dead code [cookbooks] - 10https://gerrit.wikimedia.org/r/604713 [14:09:45] mutante: no SMS pages anymore, it's only victorops now [14:10:05] but you can setup VO to page you ;) [14:10:05] if you didn't get a page, VO may not be configured correctly for you [14:10:13] I mean to SMS you [14:10:17] in addition to push page [14:10:31] is there anything to do here? [14:10:43] did the external monitoring used to alert Icinga before or is that a regression induced by some VO change? [14:11:00] a regression [14:11:04] alert IRC I meant [14:11:05] I'm not sure, mutante was that temporary testing ? [14:11:20] the external monitoring was not paging for the passive icinga host, but we now send an email to VO that pages [14:11:27] ack [14:11:33] i ran the exact command that is normally running in cron, nothing else [14:11:46] ['Passive Host Checks Being Accept [14:11:46] ed? MISSING_KEY (expected YES)', "Last External Command Check: MISSING_KEY (expected failed to calculate expected value: time data 'MISSING_KEY' does not match format '%Y-%m [14:11:50] -%d %H:%M:%S')", 'Event Handlers Enabled? MISSING_KEY (expected No)', 'Notifications Enabled? MISSING_KEY (expected NO)', 'Service Checks Being Executed? MISSING_KEY (expect [14:11:54] ed YES)', 'Passive Service Checks Being Accepted? MISSING_KEY (expected YES)'] [14:12:04] but it recovered [14:12:13] mutante: yes, that command stops icinga [14:12:20] mutante: ack [14:12:23] if it took too much time it might have triggere it [14:12:32] Jun 11 14:10:06 wikitech-static check_icinga[15448]: 2020-06-11 14:10:06,680 [INFO] Check for host icinga2001.wikimedia.org: OK [14:12:43] yea? but this is running all the time [14:12:51] volans: which VO change ? [14:12:53] I don't see the recovery emails though [14:13:34] mmhh I got the recovery from check_icinga four minutes ago [14:13:38] godog: the fact that we use the email notification and not the other contact1, it's a regression in behaviour comppared to before [14:13:38] I do have a recovery email, 1408 UTC [14:14:36] I don't, perhaps because rzl you're the one who ack'ed the page? [14:14:51] no, should be separate, the recovery email wasn't via victorops [14:14:54] so the monitoring sent the alert at 14:06:41,927 to VO and the recovery at 14:08:18,764 [14:15:10] I don't have a recovery email or text [14:15:10] mmmm or maybe not? it does say to:rlazarus@wikimedia.org which is weird [14:15:12] the emails will arrive [14:15:18] volans: yep sounds good to me [14:15:24] it's the rackspace service that takes a while sometimes [14:15:29] might be in the slow queue [14:15:29] I got a recovery email from icinga but none from VO [14:15:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] maintenance: Migrate wikidata prune jobs to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/599956 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [14:16:52] ema: I have it resolved on VO [14:17:51] recovery received at 2020-06-11T14:09:53Z [14:18:08] volans: in the app I do too, my complaint is that I got a VO email for the beginning of the incident but no VO email for the end of it [14:18:23] (03PS3) 10MSantos: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) [14:18:47] that might be related on how you configured it, not sure [14:18:51] email received [14:19:07] (03CR) 10Ppchelko: "1. Both of the partitiners don't really belong there - they partition according to cirrus cluster or DB shard which are the property of pr" [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:19:08] (check_icinga only) [14:19:09] i don't understand why this would not happen all the time. that cron is hourly [14:19:12] (03CR) 10Ppchelko: [C: 04-1] changeprop-jobqueue: add beta configuration skeleton [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:19:16] and stops icinga and then starts it again [14:19:48] maybe most of the time it's not stopped long enough to trigger [14:20:11] must be.. [14:20:23] mutante: there is a trick [14:20:39] the cron runs at 33, the external monitoring checks every 2 minutes ;) [14:21:08] heh, ok. yea, that would explain it. i ran it manually at a random time [14:21:10] if you picked the wrong time to restart it, that aligned with the external check... it explain it [14:21:25] to be clear, the external monitoring does multipe retries with sleep [14:21:26] i just wanted to double-check it works normal after i made a change to it [14:21:38] fair enough [14:21:41] which is that it is not an .sh.erb anymore [14:22:28] you can disable the meta-monitoring crontab anytime for those tests [14:22:31] just one of many for T254480 [14:22:31] T254480: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 [14:22:31] see https://wikitech.wikimedia.org/wiki/Wikitech-static#Meta-monitoring [14:22:47] ack [14:23:32] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition=5 site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging- [14:23:32] ll&var-consumer_group=All [14:24:09] is anyone available to merge https://gerrit.wikimedia.org/r/#/c/integration/config/+/604719/ - `Add GlobalWatchlist extension to CI`? [14:26:06] (03Abandoned) 10Paladox: phabricator: use new profile::php module [puppet] - 10https://gerrit.wikimedia.org/r/479580 (owner: 10Paladox) [14:26:11] (03Abandoned) 10Paladox: php: Create profile::php to handle fpm/mod_php integration [puppet] - 10https://gerrit.wikimedia.org/r/479575 (owner: 10Paladox) [14:26:30] 10Operations, 10Domains, 10Traffic: wikibase.org should redirect to wikiba.se - https://phabricator.wikimedia.org/T254957 (10jbond) Currently wikiba.se is not managed by wikimedia foundation, we don't have control over the hosting environment or management of the DNS. wikibase.org is a Wikimedia foundation... [14:26:44] mmhh the kafka lag for logstash seems like a logstash instance being unhappy, looking [14:27:42] (03CR) 10Paladox: "@Dzahn this is definitely a requirement for the upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) (owner: 10Paladox) [14:27:52] 10Operations, 10Domains, 10Traffic: wikibase.org should redirect to wikiba.se - https://phabricator.wikimedia.org/T254957 (10Dzahn) For the reason why wikiba.se is not under the control of the WMF you can read history on T99531. [14:28:38] log bounce logstash on logstash1009, apparent GC death spiral [14:30:44] (03CR) 10Dzahn: "let's have a topic branch that lists all the things actually needed for the upgrade, as opposed to other pending changes we want AFTER the" [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) (owner: 10Paladox) [14:30:45] !log bounce logstash on logstash1009, apparent GC death spiral [14:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:37] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/q/topic:%22gerrit-3-upgrade%22+(status:open%20OR%20status:merged)" [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) (owner: 10Paladox) [14:31:43] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [14:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:36] (03Abandoned) 10Paladox: Gerrit: Symlink lib/mysql-connector to gerrit deployment repo [puppet] - 10https://gerrit.wikimedia.org/r/548552 (owner: 10Paladox) [14:32:41] (03Abandoned) 10Paladox: Add mysql-connector-java [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/548549 (owner: 10Paladox) [14:33:03] ema: we currently have recovery notifications disabled in the global settings, but could probably turn that on and let folks set at an individual level their preference about it [14:34:02] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:06] 10Operations, 10ops-codfw, 10DC-Ops: Decomission oresrdb2002.codfw.wmnet - https://phabricator.wikimedia.org/T254240 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `oresrdb[2001-2002].codfw.wmnet` - oresrdb2001.codfw.wmnet (**PASS**) - Downtimed host on I... [14:34:26] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [14:35:37] herron: ah, I see! ty [14:36:42] sure thing, np [14:36:56] 10Operations, 10observability, 10Patch-For-Review: Invalid apache configuration on profile::prometheus::ops hosts - https://phabricator.wikimedia.org/T255124 (10jbond) > The puppet failure will also result in a icinga failure, I don't think we'd need a specific check for this Well i guess it depends how its... [14:37:09] !log enabled VO incident resolution notification in global settings [14:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:17] (03PS1) 10Alexandros Kosiaris: oresrdb: Remove the last references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/604730 (https://phabricator.wikimedia.org/T254238) [14:38:53] (03CR) 10Herron: [C: 03+2] icinga: include notification type in host alert email subjects [puppet] - 10https://gerrit.wikimedia.org/r/604448 (owner: 10Herron) [14:39:53] (03PS1) 10Alexandros Kosiaris: oresrdb: Remove all DNS entries except mgmt [dns] - 10https://gerrit.wikimedia.org/r/604731 (https://phabricator.wikimedia.org/T254238) [14:40:17] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [14:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:59] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:04] (03CR) 10Dzahn: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [14:42:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: decomission oresrdb100[12] - https://phabricator.wikimedia.org/T254238 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `oresrdb[1001-1002].eqiad.wmnet` - oresrdb1001.eqiad.wmnet (**PASS**) - Downti... [14:42:30] (03PS2) 10Alexandros Kosiaris: oresrdb: Remove all DNS entries except mgmt [dns] - 10https://gerrit.wikimedia.org/r/604731 (https://phabricator.wikimedia.org/T254238) [14:42:57] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Decomission oresrdb2002.codfw.wmnet - https://phabricator.wikimedia.org/T254240 (10akosiaris) [14:43:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: decomission oresrdb100[12] - https://phabricator.wikimedia.org/T254238 (10akosiaris) [14:43:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] oresrdb: Remove the last references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/604730 (https://phabricator.wikimedia.org/T254238) (owner: 10Alexandros Kosiaris) [14:43:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] oresrdb: Remove all DNS entries except mgmt [dns] - 10https://gerrit.wikimedia.org/r/604731 (https://phabricator.wikimedia.org/T254238) (owner: 10Alexandros Kosiaris) [14:45:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: decomission oresrdb100[12] - https://phabricator.wikimedia.org/T254238 (10akosiaris) >>! In T254238#6211187, @Cmjohnson wrote: > @akosiaris Have any the initial steps been completed with this decom task? They have all been done now. [14:46:35] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [14:48:04] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/604648 (https://phabricator.wikimedia.org/T255124) (owner: 10Filippo Giunchedi) [14:50:36] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable httpd mod_rewrite [puppet] - 10https://gerrit.wikimedia.org/r/604648 (https://phabricator.wikimedia.org/T255124) (owner: 10Filippo Giunchedi) [14:51:09] (03PS1) 10Elukey: Add archiva-new.wikimedia.org as CNAME to archiva1002 [dns] - 10https://gerrit.wikimedia.org/r/604734 (https://phabricator.wikimedia.org/T252767) [14:53:19] (03CR) 10Muehlenhoff: [C: 03+2] Remove unused/dead code [cookbooks] - 10https://gerrit.wikimedia.org/r/604713 (owner: 10Muehlenhoff) [14:53:26] (03CR) 10Volans: "replies inline" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat) [14:53:34] (03PS10) 10Cwhite: puppetmaster,icinga: naggen2 cleanup and update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/549222 [14:54:31] (03CR) 10Ayounsi: [C: 03+2] Netbox driven switch interfaces configuration [homer/public] - 10https://gerrit.wikimedia.org/r/547584 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [14:56:23] !log bounced elasticsearch on logstash1012 [14:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:43] (03PS1) 10Ema: purged: make Kafka cluster name configurable [puppet] - 10https://gerrit.wikimedia.org/r/604743 (https://phabricator.wikimedia.org/T254844) [14:57:03] (03CR) 10Cwhite: puppetmaster,icinga: naggen2 cleanup and update to python3 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [14:57:33] (03CR) 10Andrew Bogott: [C: 03+1] "There's no risk of important data loss on this host, so we can go ahead and try this." [puppet] - 10https://gerrit.wikimedia.org/r/604696 (https://phabricator.wikimedia.org/T254491) (owner: 10Jbond) [14:58:38] (03CR) 10Ppchelko: "Maybe it would be easier to just require it? The 'nearest main production' feels a bit hacky tbh" [puppet] - 10https://gerrit.wikimedia.org/r/604743 (https://phabricator.wikimedia.org/T254844) (owner: 10Ema) [14:58:57] (03PS2) 10Ema: purged: make Kafka cluster name configurable [puppet] - 10https://gerrit.wikimedia.org/r/604743 (https://phabricator.wikimedia.org/T254844) [14:59:07] (03PS1) 10Ayounsi: Homer: enable wmf-plugin [puppet] - 10https://gerrit.wikimedia.org/r/604746 (https://phabricator.wikimedia.org/T250429) [14:59:42] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) Reseat for the second time today the riser resolved the problem [15:01:41] (03CR) 10Volans: [C: 03+1] "LGTM, ship it! :)" [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [15:02:19] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single [15:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:31] !log mforns@deploy1001 Started deploy [analytics/refinery@c969b56]: Regular analytics weekly train [analytics/refinery@c969b56afae1b2532e07f0ff699c2ce161360966] [15:02:31] (03PS8) 10Alexandros Kosiaris: rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 [15:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:31] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10fgiunchedi) [15:04:05] !log root@cumin1001 START - Cookbook sre.network.prepare-upgrade [15:04:05] !log root@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [15:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:11] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 (10fgiunchedi) [15:04:11] !log mforns@deploy1001 Finished deploy [analytics/refinery@c969b56]: Regular analytics weekly train [analytics/refinery@c969b56afae1b2532e07f0ff699c2ce161360966] (duration: 01m 39s) [15:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:42] !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [15:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:47] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/604743 (https://phabricator.wikimedia.org/T254844) (owner: 10Ema) [15:05:26] (03CR) 10jerkins-bot: [V: 04-1] rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [15:06:07] (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1002/23170/" [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [15:06:15] (03PS11) 10Cwhite: puppetmaster,icinga: naggen2 cleanup and update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/549222 [15:06:39] (03PS2) 10Ayounsi: Homer: enable wmf-plugin [puppet] - 10https://gerrit.wikimedia.org/r/604746 (https://phabricator.wikimedia.org/T250429) [15:07:23] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/604746 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [15:07:35] (03CR) 10Ayounsi: [C: 03+2] Homer: enable wmf-plugin [puppet] - 10https://gerrit.wikimedia.org/r/604746 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [15:09:12] (03PS1) 10Muehlenhoff: Actually pass a host to Icinga status check [cookbooks] - 10https://gerrit.wikimedia.org/r/604751 [15:10:51] (03CR) 10Jdlrobson: [C: 03+1] "Seems like this would be a sensible default. What was the reason for making in mandatory?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604681 (https://phabricator.wikimedia.org/T255130) (owner: 10Awight) [15:11:20] (03CR) 10Volans: [C: 04-1] "I think there's a small error." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/604751 (owner: 10Muehlenhoff) [15:14:52] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [15:18:27] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 (10Papaul) [15:19:00] (03CR) 10Muehlenhoff: Actually pass a host to Icinga status check (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/604751 (owner: 10Muehlenhoff) [15:19:12] (03PS2) 10Muehlenhoff: Actually pass a host to Icinga status check [cookbooks] - 10https://gerrit.wikimedia.org/r/604751 [15:19:20] (03PS1) 10Ayounsi: Sort access vlans to workaround old Python limitation [homer/public] - 10https://gerrit.wikimedia.org/r/604755 [15:19:28] (03CR) 10jerkins-bot: [V: 04-1] Sort access vlans to workaround old Python limitation [homer/public] - 10https://gerrit.wikimedia.org/r/604755 (owner: 10Ayounsi) [15:19:48] (03CR) 10Volans: [C: 03+1] "LGTM, sorry for not have catch it earlier" [cookbooks] - 10https://gerrit.wikimedia.org/r/604751 (owner: 10Muehlenhoff) [15:20:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Yay, this now works (jenkins now fails on purpose correctly), I 'll merge that last couple of patches to fix that and rebase this on top o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [15:20:58] (03PS1) 10Arturo Borrero Gonzalez: paws: add welcome banner [puppet] - 10https://gerrit.wikimedia.org/r/604756 [15:22:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] paws: add welcome banner [puppet] - 10https://gerrit.wikimedia.org/r/604756 (owner: 10Arturo Borrero Gonzalez) [15:22:58] (03PS1) 10Ayounsi: Sort access vlans for consistency on Python 3.5 [homer/public] - 10https://gerrit.wikimedia.org/r/604759 [15:23:00] (03Abandoned) 10Ayounsi: Sort access vlans to workaround old Python limitation [homer/public] - 10https://gerrit.wikimedia.org/r/604755 (owner: 10Ayounsi) [15:23:43] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 (10Papaul) [15:23:57] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/604759 (owner: 10Ayounsi) [15:24:21] (03CR) 10Muehlenhoff: [C: 03+2] Actually pass a host to Icinga status check [cookbooks] - 10https://gerrit.wikimedia.org/r/604751 (owner: 10Muehlenhoff) [15:24:23] (03CR) 10Ayounsi: [C: 03+2] Sort access vlans for consistency on Python 3.5 [homer/public] - 10https://gerrit.wikimedia.org/r/604759 (owner: 10Ayounsi) [15:24:32] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Actually pass a host to Icinga status check [cookbooks] - 10https://gerrit.wikimedia.org/r/604751 (owner: 10Muehlenhoff) [15:25:46] !log installing buster kernel security updates (no reboots yet) [15:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:28] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single [15:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:27] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:57] (03PS1) 10Jbond: httpd: add validate_cmd to apache configs [puppet] - 10https://gerrit.wikimedia.org/r/604764 (https://phabricator.wikimedia.org/T255124) [15:40:59] (03PS1) 10Jbond: httpd: test validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/604765 (https://phabricator.wikimedia.org/T255124) [15:42:03] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: fix typo in hiera key name [puppet] - 10https://gerrit.wikimedia.org/r/604766 [15:42:10] (03CR) 10jerkins-bot: [V: 04-1] httpd: add validate_cmd to apache configs [puppet] - 10https://gerrit.wikimedia.org/r/604764 (https://phabricator.wikimedia.org/T255124) (owner: 10Jbond) [15:42:22] (03CR) 10jerkins-bot: [V: 04-1] httpd: test validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/604765 (https://phabricator.wikimedia.org/T255124) (owner: 10Jbond) [15:44:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: fix typo in hiera key name [puppet] - 10https://gerrit.wikimedia.org/r/604766 (owner: 10Arturo Borrero Gonzalez) [15:44:33] (03PS1) 10Ladsgroup: meet: Add ferm rule to open port 5000 to the cloud proxy [puppet] - 10https://gerrit.wikimedia.org/r/604773 (https://phabricator.wikimedia.org/T251034) [15:45:46] (03CR) 10jerkins-bot: [V: 04-1] meet: Add ferm rule to open port 5000 to the cloud proxy [puppet] - 10https://gerrit.wikimedia.org/r/604773 (https://phabricator.wikimedia.org/T251034) (owner: 10Ladsgroup) [15:46:04] (03PS2) 10Jbond: httpd: test validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/604765 (https://phabricator.wikimedia.org/T255124) [15:47:06] mutante: https://gerrit.wikimedia.org/r/604773 :D [15:47:14] (03CR) 10jerkins-bot: [V: 04-1] httpd: test validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/604765 (https://phabricator.wikimedia.org/T255124) (owner: 10Jbond) [15:50:13] (03PS2) 10Ladsgroup: meet: Add ferm rule to open port 5000 to the cloud proxy [puppet] - 10https://gerrit.wikimedia.org/r/604773 (https://phabricator.wikimedia.org/T251034) [15:50:44] (03PS7) 10Krinkle: Use PDO for XHGui storage if configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [15:51:19] (03CR) 10Krinkle: [C: 03+1] "LGTM. Once PrivSet in beta (and production!) are set ahead of this change, then this is good to land." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [15:52:28] Amir1: cool, but aren't we now resuing the existing hiera key [15:53:12] (03CR) 10Krinkle: Use PDO for XHGui storage if configured (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [15:53:43] Amir1: so the lookup would just ave to be 'cache_hosts' and it should work [15:53:49] (03PS8) 10Krinkle: profiler: Add PDO driver for XHGui and enable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [15:53:51] oh okay [15:54:47] Array[Stdlib::IP::Address] $cache_hosts = lookup('cache_hosts', [15:54:53] i see that being used already [15:55:10] it should just work both in prod and cloud [15:56:20] (03PS1) 10Ejegg: Remove ContributionTracking extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604778 [16:00:04] godog and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200611T1600). [16:00:37] (03PS3) 10Ladsgroup: meet: Add ferm rule to open port 5000 to the cloud proxy [puppet] - 10https://gerrit.wikimedia.org/r/604773 (https://phabricator.wikimedia.org/T251034) [16:02:03] (03CR) 10Jforrester: "This will need splitting into deployable commits; happy to do that for you, as and when this is good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604778 (owner: 10Ejegg) [16:05:45] (03CR) 10Krinkle: [C: 03+1] "Confirmed in Beta, and I've added placeholders to production to avoid variable undefined errors." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [16:05:59] (03PS1) 10Arturo Borrero Gonzalez: wmcs: kubeadm: introduce different haproxy port frontend/backend [puppet] - 10https://gerrit.wikimedia.org/r/604783 (https://phabricator.wikimedia.org/T195217) [16:06:02] (03PS5) 10Ayounsi: Netbox driven routers disabled interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/592246 [16:06:04] (03PS5) 10Ayounsi: Chassis: more generic, add ae count [homer/public] - 10https://gerrit.wikimedia.org/r/592251 [16:06:06] (03PS5) 10Ayounsi: Add graceful-switchover to multiple RE devices [homer/public] - 10https://gerrit.wikimedia.org/r/592938 (https://phabricator.wikimedia.org/T191667) [16:06:08] (03PS5) 10Ayounsi: add graceful-restart to CRs [homer/public] - 10https://gerrit.wikimedia.org/r/577564 (https://phabricator.wikimedia.org/T191667) (owner: 10CDanis) [16:09:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC just fine:" [puppet] - 10https://gerrit.wikimedia.org/r/604783 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [16:10:03] !log downtimed labstore1004 for upgrades T224582 [16:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:07] T224582: Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 [16:11:02] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm) Ah ok. Good to know :) [16:11:13] (03CR) 10Dzahn: meet: Add ferm rule to open port 5000 to the cloud proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/604773 (https://phabricator.wikimedia.org/T251034) (owner: 10Ladsgroup) [16:12:51] (03PS4) 10Ladsgroup: meet: Add ferm rule to open port 5000 to the cloud proxy [puppet] - 10https://gerrit.wikimedia.org/r/604773 (https://phabricator.wikimedia.org/T251034) [16:12:52] !log downtimed labstore1005 for upgrades on 1004 since that will alert as well T224582 [16:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:07] (03CR) 10Dzahn: [C: 03+2] wmflib: add type for a valid PHP version [puppet] - 10https://gerrit.wikimedia.org/r/604622 (owner: 10Dzahn) [16:35:57] (03PS1) 10Arturo Borrero Gonzalez: wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) [16:36:26] (03PS1) 10Ema: purged: fix Kafka brokers TCP port if TLS is disabled [puppet] - 10https://gerrit.wikimedia.org/r/604790 (https://phabricator.wikimedia.org/T254844) [16:36:36] !log rebooting labstore1004 for upgrades T224582 [16:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:41] T224582: Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 [16:36:55] (03PS2) 10Ema: purged: fix Kafka brokers TCP port if TLS is disabled [puppet] - 10https://gerrit.wikimedia.org/r/604790 (https://phabricator.wikimedia.org/T254844) [16:37:04] (03CR) 10jerkins-bot: [V: 04-1] wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [16:39:10] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/604790 (https://phabricator.wikimedia.org/T254844) (owner: 10Ema) [16:40:34] (03PS4) 10Ema: purged: make Kafka cluster name configurable [puppet] - 10https://gerrit.wikimedia.org/r/604743 (https://phabricator.wikimedia.org/T254844) [16:40:52] (03CR) 10Ema: [C: 03+2] purged: fix Kafka brokers TCP port if TLS is disabled [puppet] - 10https://gerrit.wikimedia.org/r/604790 (https://phabricator.wikimedia.org/T254844) (owner: 10Ema) [16:42:21] (03PS5) 10Ema: purged: make Kafka cluster name configurable [puppet] - 10https://gerrit.wikimedia.org/r/604743 (https://phabricator.wikimedia.org/T254844) [16:44:47] (03PS1) 10Ssingh: dnsdist: update the queries per second limit [puppet] - 10https://gerrit.wikimedia.org/r/604795 (https://phabricator.wikimedia.org/T252132) [16:47:02] (03CR) 10Ssingh: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23174/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/604795 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:49:06] !log doing stretch upgrade for labstore1004 T224582 [16:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:10] T224582: Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 [16:49:38] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/604797 [16:49:58] (03PS3) 10Hnowlan: changeprop-jobqueue: add beta configuration skeleton [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) [16:50:53] (03CR) 10Hnowlan: "> 5. I'm not sure how logs/stats are working in beta, but before we've had this config:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:51:07] (03CR) 10Hnowlan: "Output for updated values here https://phabricator.wikimedia.org/P11472" [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:57:23] (03PS16) 10Herron: elasticsearch: manage java dependencies with ::profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) [17:00:04] halfak and accraze: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200611T1700). [17:01:03] 10Operations, 10Core Platform Team, 10Traffic, 10Patch-For-Review: Configure purged in deployment-prep - https://phabricator.wikimedia.org/T254844 (10ema) I have cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604743/ on deployment-puppetmaster04 and added `profile::cache::purge::kafka... [17:06:46] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10Jdforrester-WMF) 05Open→03Declined Working as expected. [17:10:31] (03PS17) 10Herron: elasticsearch: manage java dependencies with ::profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) [17:12:26] !log reboot for stretch upgrade on labstore1004 T224582 [17:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:31] T224582: Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 [17:12:52] (03PS18) 10Herron: java: manage elasticsearch and kafka java dependencies with ::profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) [17:17:17] (03CR) 10Herron: "updated PCC https://puppet-compiler.wmflabs.org/compiler1002/23178/" [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [17:17:22] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "We could also just go with hiera everywhere, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/604743 (https://phabricator.wikimedia.org/T254844) (owner: 10Ema) [17:19:33] !log mbsantos@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:55] 10Operations, 10Core Platform Team, 10Traffic, 10Patch-For-Review: Configure purged in deployment-prep - https://phabricator.wikimedia.org/T254844 (10ema) >>! In T254844#6216088, @ema wrote: > However, by looking at the actual PURGE requests generated by purged, it seems that we're only sending both kafka... [17:22:43] !log mbsantos@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'proton' for release 'production' . [17:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:29] !log mbsantos@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:25] !log mbsantos@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'proton' for release 'production' . [17:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:43] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10AndrewKuznetsov) I don't think anything in particular was unclear, I was just following a different guide I w... [17:32:33] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10AndrewKuznetsov) [17:40:52] (03PS1) 10Elukey: Add kafka-jumbo100[7-9] to analytics-in4 and analytics-in6 filters [homer/public] - 10https://gerrit.wikimedia.org/r/604810 (https://phabricator.wikimedia.org/T252675) [17:41:03] ottomata: ==^ [17:41:08] (03PS4) 10Ppchelko: Beta: Switch from HTCP purging to kafka purging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603530 (https://phabricator.wikimedia.org/T250781) [17:43:09] (03CR) 10Ppchelko: Beta: Switch from HTCP purging to kafka purging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603530 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [17:48:12] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10Nuria) @AndrewKuznetsov do you have any description as to what your internship project entitles and the time... [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200611T1800). [18:00:04] Pchelolo: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:19] I'll do mine, it's a no-op in production [18:00:46] (03CR) 10Ppchelko: [C: 03+2] Beta: Switch from HTCP purging to kafka purging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603530 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:01:32] (03Merged) 10jenkins-bot: Beta: Switch from HTCP purging to kafka purging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603530 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:06:48] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Beta: Switch from HTCP purging to kafka purging gerrit:603530, IS-labs.php (duration: 01m 06s) [18:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:25] !log ppchelko@deploy1001 Synchronized wmf-config/reverse-proxy-staging.php: Beta: Switch from HTCP purging to kafka purging gerrit:603530, reverse-proxy-staging.php (duration: 01m 06s) [18:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:31] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm) [18:15:07] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm) Marking off the server since it is now running stretch (and the right kernel). Just finishing up work to get the cl... [18:15:57] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm) `lang=shell-session [bstorm@labstore1004]:~ $ sudo /usr/sbin/drbd-overview 1:test/0 Connected Secondary/Primary... [18:18:46] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 54 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:20:53] jouncebot: next [18:20:54] In 0 hour(s) and 39 minute(s): Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200611T1900) [18:30:24] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:41:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_eventgate_analytics_cluster_eqiad,swagger_check_eventgate_analytics_external_cluster_eqiad,swagger_check_eventgate_main_cluster_eqiad,swagger_check_mathoid_cluster_eqiad,swagger_check_sessionstore_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometh [18:42:08] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1001.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:42:19] PROBLEM - LVS sessionstore eqiad port 8081/tcp - Session store- sessionstore.svc.eqiad.wmnet IPv4 #page on sessionstore.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:42:20] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient.) timed out before a response was received: / (root with wrong query param) timed out before a response was received: /v1/dictionary/{word}/{from}/{to}{/provider} (Fetch dictionary meaning without specifying a provider) timed out before a response was received: /v2/suggest/source [18:42:20] ggest a source title to use for translation) timed out before a response was received: /v1/list/pair/{from}/{to} (Get the tools between two language pairs) timed out before a response was received: /_info/name (retrieve service name) timed out before a response was received: /v1/page/{language}/{title}{/revision} (Fetch enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [18:42:22] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:42:26] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:42:26] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:42:28] here [18:42:28] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:42:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:42:38] PROBLEM - LVS wikifeeds eqiad port 8889/tcp - A node webservice supporting featured wiki content feeds. termbox.svc.eqiad.wmnet IPv4 on wikifeeds.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:42:38] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [18:42:42] PROBLEM - LVS echostore eqiad port 8082/tcp - Echo store- echostore.svc.eqiad.wmnet IPv4 on echostore.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:42:45] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:42:54] PROBLEM - eventgate-main LVS eqiad on eventgate-main.svc.eqiad.wmnet is CRITICAL: / (root with no query params) timed out before a response was received: / (root with wrong query param) timed out before a response was received https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [18:42:56] PROBLEM - eventgate-logging-external LVS eqiad on eventgate-logging-external.svc.eqiad.wmnet is CRITICAL: / (root with no query params) timed out before a response was received: / (root with wrong query param) timed out before a response was received: /robots.txt (robots.txt check) timed out before a response was received https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [18:43:02] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:43:14] PROBLEM - Host kubernetes1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:43:18] PROBLEM - eventgate-analytics-external LVS eqiad on eventgate-analytics-external.svc.eqiad.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7efce6308518, Connection to eventgate-analytics-external.svc.eqiad.wmnet timed out. (connect timeout=15)): /?spec https://wi [18:43:18] org/wiki/Event_Platform/EventGate [18:43:48] PROBLEM - Host kubernetes1005 is DOWN: PING CRITICAL - Packet loss = 100% [18:43:56] PROBLEM - Host kubernetes1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:44:08] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:44:09] hmm networking issue? [18:44:12] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:44:12] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:44:14] RECOVERY - Host kubernetes1003 is UP: PING WARNING - Packet loss = 33%, RTA = 47.03 ms [18:44:18] RECOVERY - LVS wikifeeds eqiad port 8889/tcp - A node webservice supporting featured wiki content feeds. termbox.svc.eqiad.wmnet IPv4 on wikifeeds.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 945 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:44:18] ummm [18:44:20] RECOVERY - Host kubernetes1005 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [18:44:20] RECOVERY - LVS echostore eqiad port 8082/tcp - Echo store- echostore.svc.eqiad.wmnet IPv4 on echostore.svc.eqiad.wmnet is OK: HTTP OK: Status line output matched 200 - 258 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:44:22] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:44:24] here [18:44:28] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:44:28] RECOVERY - eventgate-analytics-external LVS eqiad on eventgate-analytics-external.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [18:44:34] RECOVERY - Host kubernetes1001 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [18:44:34] RECOVERY - eventgate-main LVS eqiad on eventgate-main.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [18:44:38] RECOVERY - eventgate-logging-external LVS eqiad on eventgate-logging-external.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [18:44:44] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:44:54] seems unlikely to be a networking issue, kube1001/03/05 are on different racks and rows [18:45:04] PROBLEM - Host kubernetes1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:26] looking [18:45:39] (weird, VO didn't fire for me) [18:46:10] RECOVERY - Host kubernetes1003 is UP: PING OK - Packet loss = 0%, RTA = 54.98 ms [18:47:00] <_joe_> uh what happened? [18:47:04] PROBLEM - MediaWiki edit session loss on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/edit-count?panelId=13&fullscreen&orgId=1 [18:47:04] I am not sure yet [18:47:08] hey [18:47:09] but I think sessionstore is still down [18:47:19] I'll be IC, starting a doc [18:47:24] I'm getting repeated session loss while trying to edit [18:47:24] thanks rzl I was about to [18:47:24] <_joe_> everything seems back? [18:47:25] * apergos peeks in [18:47:28] PROBLEM - k8s API server requests latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:47:29] o/ [18:47:32] _joe_: https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1 [18:47:49] o/ [18:48:05] <_joe_> can someone call akosiaris? I'm moving to a place where I can help [18:48:55] <_joe_> so, sessionstore is running on dedicated nodes [18:49:16] I can text him [18:49:18] RECOVERY - k8s API server requests latencies on argon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:49:19] https://phabricator.wikimedia.org/T255179 [18:49:27] T255179 - session failures [18:49:27] T255179: Session failures preventing edits, login, logout, etc - https://phabricator.wikimedia.org/T255179 [18:49:28] effie: thanks, lmk if you don't hear back and I can call [18:49:30] thanks DannyS712 [18:49:38] <_joe_> kubernetes1001 is down [18:49:52] all the kask pods are in crashloopbackoff [18:49:53] kubernetes1005 hit the oom killer [18:50:10] <_joe_> yes [18:50:17] <_joe_> we need to call urandom I guess [18:50:25] Killed process 26098 (kask) total-vm:1010956kB, anon-rss:297164kB, file-rss:0kB, shmem-rss:0kB [18:50:28] <_joe_> and raise the limits for kask memory immediately [18:50:38] <_joe_> cdanis: kask for sessionstore, right? [18:50:49] _joe_: yes [18:50:58] rzl: I can call as I am local [18:51:02] if needed. [18:51:03] kubernetes1001 just crashed again [18:51:06] apergos: ack thanks, coordinate with effie please [18:51:21] he is joining soonish [18:51:31] 👍 [18:51:38] effie: thanks [18:51:56] * akosiaris around [18:52:10] PROBLEM - Host kubernetes1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:52:11] akosiaris: kask @ sessionstore is OOM-looping and killing the whole machine [18:52:19] kube1001,3,5 bouncing [18:52:19] yep, that ^^ [18:52:23] * addshore reads up [18:52:59] sessionstore isn't though in kubernetes1003, nor kubernetes1001 ? [18:53:15] * addshore saw edit rate on wikidata.org drop to basically 0 [18:53:21] it's on kubernetes1005,kubernetes1006 [18:53:22] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10Mholloway) [18:53:39] akosiaris: okay, well, all the kask pods are in https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1 [18:53:41] <_joe_> akosiaris: yes but I think kube-proxy is going crazy [18:53:41] err [18:53:43] CrashLoopBackOff [18:54:16] PROBLEM - Host kubernetes1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:54:21] <_joe_> kubernetes1006 is basically unusable, I am logged in but I can't do much [18:54:31] same with 1005 [18:54:46] PROBLEM - Host kubernetes1005 is DOWN: PING CRITICAL - Packet loss = 100% [18:54:58] <_joe_> akosiaris: let's try to reboot one of those machines? [18:55:05] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10MSantos) [18:55:06] RECOVERY - Host kubernetes1003 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [18:55:06] RECOVERY - Host kubernetes1001 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [18:55:14] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [18:55:16] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=udp_localhost-err https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgI [18:55:16] e=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [18:55:18] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:55:20] RECOVERY - Host kubernetes1005 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [18:55:26] (03PS1) 10Papaul: DHCP: Add MAC address for rdb200[7-8] [puppet] - 10https://gerrit.wikimedia.org/r/604832 (https://phabricator.wikimedia.org/T251626) [18:55:45] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10MSantos) [18:56:10] <_joe_> Jun 11 18:55:55 kubernetes1006 kubelet[7637]: E0611 18:55:55.656486 7637 pod_workers.go:186] Error syncing pod 386d4ee6-620d-11ea-8892-aa0000fe6bdf ("kask-production-6bb494b8f7-qs8jx_sessionstore(386d4ee6-620 [18:56:24] what the what? [18:56:27] _joe_: they are all up [18:56:35] <_joe_> akosiaris: all what? [18:56:40] nodes* [18:56:57] let me increase sessionstore capacity to make sure we aren't going down under pressure [18:57:06] <_joe_> yeah let's try that [18:57:42] there's no quick way to switch back to redis at this point, is there? [18:57:48] just for fast mitigation [18:58:15] <_joe_> rzl: I don't think so, no, but urandom might know better [18:58:34] PROBLEM - Host kubernetes1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:58:38] PROBLEM - Host kubernetes1006 is DOWN: PING CRITICAL - Packet loss = 100% [18:58:45] I saw a smll burt of edits made on wikdata at 18:56 by logged in users, but it went away again then [18:58:48] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [18:58:50] *small burst [18:58:54] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=sessionstore [18:59:00] akosiaris@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [18:59:02] RECOVERY - Host kubernetes1006 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [18:59:04] RECOVERY - Host kubernetes1001 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [18:59:06] rzl: I think the migration is complete; double writes are no longer happening so it'd log everyone out [18:59:13] <_joe_> we can switch to codfw though [18:59:21] !log depool eqiad, switch to codfw [18:59:23] _joe_: already done [18:59:25] akosiaris: Failed to log message to wiki. Somebody should check the error logs. [18:59:27] <_joe_> akosiaris: ok [18:59:48] Pchelolo: any change w/ sessionstore? [18:59:52] FYI, !log won't work, I'll do my best to collect SAL entries in the status doc [18:59:57] s/any/anything/ [19:00:01] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [19:00:04] longma and liw: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200611T1900). [19:00:06] akosiaris@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:00:11] <_joe_> ok I see traffic going to codfw [19:00:12] I see another wave of edits by logged in users occouring at 18:59 [19:00:14] longma: liw: please hold the train [19:00:14] <_joe_> for sessionstore [19:00:19] cdanis: got it [19:00:20] urandom: no changes were made for sessionstore [19:00:25] !log increase sessionstore capacity in codfw from 4 pods to 6 [19:00:28] akosiaris: Failed to log message to wiki. Somebody should check the error logs. [19:00:37] can confirm codfw sessionstore is now seeing traffic [19:00:39] saved an edit [19:00:42] <_joe_> I am logged in FWIW [19:00:47] <_joe_> it should be vaguely slower [19:00:48] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:00:49] rzl: not able to help with the genral problem (better hands allready in play) but let me know if i can help [19:00:56] jbond42: ack, thanks [19:00:59] appears to be back to working [19:01:06] we got an IC already btw? [19:01:10] <_joe_> yes, rzl [19:01:14] cool [19:01:19] yes, status doc is in the other channel topic [19:02:13] <_joe_> so as of now everything should be migrated and sessions should still work [19:02:18] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:02:21] RECOVERY - LVS sessionstore eqiad port 8081/tcp - Session store- sessionstore.svc.eqiad.wmnet IPv4 #page on sessionstore.svc.eqiad.wmnet is OK: HTTP OK: Status line output matched 200 - 258 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:02:26] one pod in eqiad is running, the others are still in CrashLoopBackOff [19:02:28] <_joe_> because of the fact sessionstore is multi-dc [19:02:36] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:02:38] if anyone sees any continuing user-visible issues, speak up please [19:02:39] Wikidata edit rate is back at 800+, which is a good sign :) [19:02:43] yes, that saved us for now. Unfortunately my capacity increase did not work [19:02:57] <_joe_> akosiaris: still crashing? can we see messages? [19:03:01] <_joe_> like logs for the crashes [19:03:06] Still can't log in @rzl [19:03:12] codfw is fine for now btw [19:03:28] mdaniels5757: thanks, can you say more? are you getting an error message, what is it, etc [19:03:39] @mdaniels5757 @rzl I can log in [19:03:45] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:03:55] <_joe_> akosiaris: maybe tear down and up the deployment in eqiad? [19:04:14] these pods are the stateless service that talks between MW and cassandra? [19:04:17] Huh, that's progress: "No active login attempt is in progress for your session. [19:04:17] " It was a red csrf-related box that I can't recall. [19:04:21] Krinkle: yes [19:04:25] _joe_: will that preserve logs though? [19:04:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:04:31] I think it's important to understand what happened [19:04:33] cool, I didn't realise they ran on separate hardware. [19:04:42] I assumed cassandra/kask were coupled. [19:04:45] cool :) [19:04:53] Aand it's better now [19:04:53] 10Operations, 10fundraising-tech-ops, 10netops, 10WMF-NDA: Deploy pfw policy 1591901800 for T122104 - https://phabricator.wikimedia.org/T255185 (10Dwisehaupt) [19:04:55] <_joe_> akosiaris: restarts happening in codfw now [19:05:00] <_joe_> so not everything is ok [19:05:01] rzl: we have multiple users reporting csrf token issues [19:05:04] FYI I see wikidata edit rate dropping off again, could be nothing, but could be something [19:05:08] cdanis: ack [19:05:13] <_joe_> addshore: definitely something [19:05:35] <_joe_> akosiaris: we need more memory to the pods [19:05:54] Warning FailedScheduling 21s (x10 over 5m40s) default-scheduler 0/6 nodes are available: 2 Insufficient cpu, 4 node(s) didn't match node selector. [19:05:56] ok let me fix that [19:06:02] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-m [19:06:04] cdanis: yeah CPU :-( [19:06:48] wikidata is once again back at full pace [19:07:01] <_joe_> addshore: we're not fully stable [19:07:05] if we can't stabilize on sessionstore, we should consider switching back to redis even if it logs everyone out [19:07:12] some reports that users log out and log back in and then are ok [19:07:12] I'm not saying do it right now, but let's keep it on the table [19:07:19] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [19:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:26] <_joe_> I concur with rzl [19:07:34] !log increase memory limits for sessionstore in eqiad to 400Mi from 300Mi [19:07:39] akosiaris: Failed to log message to wiki. Somebody should check the error logs. [19:07:54] akosiaris: same lack of CPU in eqiad [19:07:54] reports of edit errors now [19:07:59] only 4 pods of 8 scheduled [19:08:07] <_joe_> yes [19:08:27] do we need to have the restriction to particular kubelet machines right now? [19:08:29] so CPU I can fix relatively easily though [19:08:33] seems like an easy thing to loosen [19:08:36] yes [19:08:40] I think you are right [19:08:43] on https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1 what's the deployment that happened at 18:36, right before this outage started? [19:08:45] let's actually fallback to that [19:08:52] have we already talked about that? [19:09:01] * akosiaris has no idea about that deployment [19:09:01] rzl: I think that was the initial crash, not a deployment [19:09:20] <_joe_> rzl: that's not a deployment I agree with cdanis [19:09:35] <_joe_> it's pods restarting [19:09:35] okay, just a restart? the red annotation is what I'm looking at [19:09:44] okay, thanks [19:09:55] rzl: yes, it comes from this query in the 'annotations' section of the dashboard config: resets((sum(http_request_duration_seconds_count{app="kask", kubernetes_namespace="$service"}))[1m:]) > bool 0 [19:10:02] PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [19:10:03] so it's just any kask process restart for any reason [19:10:07] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [19:10:08] cdanis: got it, thank you [19:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:25] I am boldly clarifying the annotation name [19:10:34] !log remove the podaffinity restrictions for sessionstore in eqiad [19:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:16] <_joe_> yes it seems the best option [19:11:28] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:11:50] RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [19:12:07] <_joe_> akosiaris: did you increase both cpu and memory limits for sessionstore? [19:12:07] akosiaris: okay should we repool eqiad now? lots more pods running there [19:12:32] deployment successful, yes doing so now [19:12:42] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=sessionstore [19:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:52] !log repool eqiad for sessionstore [19:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:03] ok let's see if it now holds up [19:13:46] <_joe_> so, it's interesting how one service (although the most called one) failing made kube-proxy suffer so much [19:13:47] we need to increase both the size and the number of those sessionstore dedicated VMs [19:13:51] <_joe_> yes [19:14:14] yes, that's my main question, how on earth did this affect machines that did not run those pods? [19:14:31] dns discovery ttl is 5 minutes right? [19:14:35] <_joe_> akosiaris: kube-proxy is the only rational explanation [19:14:38] <_joe_> cdanis: yes [19:14:48] Seeing quite a few "/JobQueueEventBus.php: Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable" in the logs as well [19:14:50] <_joe_> we could actively clean the caches [19:14:54] more than usual anyway [19:15:14] <_joe_> Krinkle: see above, eventgate runs on kubernetes and somehow sessionstore caused widespread issues [19:15:35] <_joe_> Krinkle: you see them right now, or during the peak of the incident? [19:15:54] 20-ish minutes ago [19:16:04] <_joe_> that fits my explanation, yes [19:16:07] it caused several doParserCacheSaveComplete attempts in during edits to be aborted/cancelled [19:16:19] https://grafana.wikimedia.org/d/000001590/sessionstore?panelId=47&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=sessionstore&from=1591900236776&to=1591900788867 [19:16:19] traffic returning to eqiad [19:16:22] initial reports after that last sync are that things look good [19:16:26] that's the initial reason ^ [19:16:38] somehow session requests to session store increased from 15k to 20k [19:17:02] <_joe_> so we were quite under capacity with just 4 pods [19:17:10] that probably pushed the pod pretty close to the memory limit and from there malloc() failures or whatever sent it spiralling down [19:17:14] <_joe_> and those vms can't host more [19:17:25] now as to why the rest of the nodes had issues.... [19:17:43] <_joe_> that's something we can probably look at spelunking logs [19:17:52] _joe_: yeah but that's fixable. We now can add both more VMs and increase those VMs in size [19:18:12] <_joe_> yes, that's not the worrisome part of tonight by any measure [19:18:20] the trigger can be fixed... it's the widespread kubernetes failures I am worried about [19:18:31] akosiaris: k8s API latencise for GET went from ~1ms to ~100ms [19:18:45] <_joe_> cdanis: that's because of the timeouts everywhere I think [19:18:50] <_joe_> the servers were unreachable [19:18:50] yeah, two broader k8s-related AIs -- one is capacity planning (plus maybe capacity alerting) for k8s services so we don't get taken by surprised like this [19:18:54] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:19:04] and the other is "why was other k8s stuff affected" [19:19:11] <_joe_> which also means we have nothing in grafana to hint what was exhausted [19:19:25] !log gilles@deploy1001 Started deploy [performance/asoranking@0a096c4]: T252424 [19:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:29] T252424: Autonomous Systems report stopped working - https://phabricator.wikimedia.org/T252424 [19:19:43] rzl: _joe_: akosiaris: should we un-hold the train? [19:19:47] <_joe_> rzl: why the failure spread to nodes not serving the application [19:19:58] cdanis: I'd like to wait a little longer and make sure we're stabilized [19:20:04] open to disagreement though [19:20:05] +1 [19:20:11] <_joe_> +1 :P [19:20:13] !log gilles@deploy1001 Finished deploy [performance/asoranking@0a096c4]: T252424 (duration: 00m 47s) [19:20:14] still not certain :-) [19:20:14] sure, 10 minutes? [19:20:21] makes sense [19:20:31] gilles: please hold off on deploying anything else for a moment, there's an incident ongoing [19:20:33] <_joe_> akosiaris: should we do the same in codfw btw? [19:21:12] happy to hold for a bit until ya'll are comfortable with stability :) [19:21:13] rzl: ok, bear in mind that this is not really a production service, it's something that runs on a cron job once a month [19:21:32] <_joe_> sessions are all back to eqiad, fwiw [19:21:33] gilles: thanks, I didn't recognize it and didn't want to go researching right now :) [19:21:53] would appreciate if you can hold off anyway just to keep the number of moving parts low, but it won't be long [19:22:02] <_joe_> but we had a huge increase in requests compared to yesterday [19:22:09] greg-g: appreciate it, will update you [19:22:15] so do we think a spike in requests was the trigger of all of this? [19:22:17] I'm completely done [19:22:20] rzl: as IC, I'll let you ping longma when ya'll are ready :) [19:22:22] it's plausible if we were already close to capacity [19:22:22] <_joe_> cdanis: clearly [19:22:23] gilles: ack, thanks [19:22:25] greg-g: will do [19:22:30] _joe_: yeah, but let me undo the changes in deploy1001 and make those actual gerrit commits [19:22:37] what caused that spike I wonder [19:22:41] <_joe_> akosiaris: oh I was offering to do it [19:22:46] and then we had too many input rps to recover? [19:22:48] <_joe_> apergos: that will need to be determined [19:22:55] <_joe_> cdanis: I think so yes [19:22:59] cdanis: I think it's exactly that [19:23:17] but we need to investigate the rest of the mess though [19:23:22] RECOVERY - MediaWiki edit session loss on graphite1004 is OK: OK: Less than 30.00% above the threshold [10.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/edit-count?panelId=13&fullscreen&orgId=1 [19:23:24] there is a lot of mess here, yes [19:23:48] I'm going to go back through this channel log and get the status doc as complete as I can, and we can start pulling AIs together [19:23:56] two questions first: [19:24:04] (1) is there still any user-visible trouble? [19:24:12] apparently not [19:24:14] and (2) are there any cleanup actions that we need to take today? [19:24:18] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:24:19] <_joe_> now it seems like there were huge spikes of iowait on the other k8s nodes [19:24:25] <_joe_> rzl: 1) no [19:24:31] <_joe_> 2) what alex is doing [19:24:35] <_joe_> uh [19:24:38] we still have elevated exceptions/sec [19:24:39] <_joe_> new problems? [19:25:19] <_joe_> cdanis: are you looking at logstash/mwlog? [19:25:23] I am [19:25:26] <_joe_> I don't think it's coming from sessionstore [19:25:42] or at least I am trying to, logstash is struggling for me [19:26:12] <_joe_> yeah old timer trick: use mwlog [19:26:18] https://grafana.wikimedia.org/d/000000102/production-logging?panelId=13&fullscreen&orgId=1&refresh=5m [19:27:13] <_joe_> timeouts I'd say [19:27:30] <_joe_> all from parsoid [19:27:59] herron: that looks like the absolute number of logs has gone down quite a lot, which makes me think logstash ingestion problems [19:28:16] objectcache [19:28:25] that's interesting [19:28:41] https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/604844 [19:28:44] https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&refresh=5m yeah of course [19:29:16] ouch regarding that logstash dashboard [19:29:26] we do this like once every two months [19:29:33] 5Mi events [19:29:51] (09:55:16 μμ) icinga-wm: PROBLEM - Too many messages in kafka logging-eqiad we saw this earlier so [19:30:11] <_joe_> akosiaris: taking a look [19:30:24] akosiaris: lgtm [19:30:28] <_joe_> +1 too [19:30:53] <_joe_> so it's interesting, all kube nodes were being killed by iowait [19:30:58] it's now 10m from when we said 10m -- thoughts on unblocking the train? [19:31:06] cdanis: lots of lag, but logstash is consuming the queue https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All&from=now-3h&to=now [19:31:28] 10Operations, 10serviceops, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10bd808) >>! In T237889#6053911, @Joe wrote: > Then I'd definitely go with the idea of installing wikitech on a subset of appservers, at least at first. @joe would you als... [19:31:36] um notyet [19:31:48] let logstash settle first? [19:31:49] <_joe_> rzl: please let's wait for akosiaris to complete the deployment of the git-version [19:31:49] rzl I did see a new train blocker task https://phabricator.wikimedia.org/T255179 but I think it is related to this ongoing incident? [19:31:52] (03PS1) 10Addshore: Fix entity id lookup for interwiki special page links [extensions/Wikibase] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604845 (https://phabricator.wikimedia.org/T255078) [19:32:04] _joe_: didn't realize that was still going, thanks [19:32:05] that's the current outage task [19:32:16] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [19:32:16] rzl: longma: we should 100% block the train until this graph returns to 0: https://grafana.wikimedia.org/d/000000561/logstash?panelId=21&fullscreen&orgId=1&refresh=5m [19:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:23] longma: yep, that's the issue we're working on -- keep holding the train please [19:32:27] will do [19:32:29] https://phabricator.wikimedia.org/T255179 [19:32:33] rzl: longma: scap canary checking relies on logstash to function at all, and it does not take into account logstash indexing/queuing latency [19:32:42] cdanis: oh, thank you [19:32:59] (I think I filed a task about this at some point) [19:33:06] !log apply emergency sessionstore fixes in codfw as well [19:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:09] (we shot ourselves in the foot this way once in the past) [19:33:37] 8 pods in codfw as well, at least on the trigger front I think we are ok now [19:33:43] nice [19:34:04] logspam-watch in mwlog1001 doesn't show anything unusal [19:34:17] (03CR) 10Awight: "> Seems like this would be a sensible default. What was the reason" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604681 (https://phabricator.wikimedia.org/T255130) (owner: 10Awight) [19:34:50] (03CR) 10BryanDavis: Session Store: Switch everything to kask-session (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570396 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [19:35:07] Should the incident task be re-opened? [19:35:18] RhinosF1: are you still seeing issues? [19:35:33] thinking out load, when the queue size in kafka is too big, it should start load shedding (RED I assume) [19:35:37] cdanis: no [19:35:40] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm) 05Open→03Resolved a:03Bstorm Done. [19:35:41] (no reports of issues elsewhere) [19:35:42] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Bstorm) [19:35:45] Just noticed you still seem busy [19:35:48] 10Operations, 10DC-Ops, 10cloud-services-team (Hardware): labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286 (10Bstorm) [19:36:08] <_joe_> akosiaris: I think we need to add the additional k8s servers ASAP [19:36:16] cdanis: someone might be, see https://phabricator.wikimedia.org/T255179#6216760 [19:36:36] _joe_: yup. That and another 2 dedicated sessionstore VMs as well [19:36:42] 2 per DC that is [19:36:52] <_joe_> akosiaris: https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=2815&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kubernetes&var-instance=All&from=now-7d&to=now isn't nice [19:37:17] _joe_: that link doesn't work for me, "panel with id 2815 not found" [19:37:23] yup, same here [19:37:26] <_joe_> wut [19:37:26] which graph did you mean? [19:37:44] it's probably one of the per-machine panels [19:37:51] lovely grafana bug there [19:37:57] <_joe_> yes [19:38:00] <_joe_> damn grafana [19:38:08] (03PS1) 10Krinkle: Ensure an array is passed to ApiEchoMute::lookupIds() [extensions/Echo] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604848 (https://phabricator.wikimedia.org/T254699) [19:38:11] (03PS1) 10Papaul: Site: Add rd200[1-7] with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/604847 (https://phabricator.wikimedia.org/T251626) [19:38:37] collapsed panel? [19:38:41] <_joe_> look at 1 week of cpu for kubernetes 1001-1004 [19:38:59] yeah not super happy [19:39:00] wow, what on earth? [19:39:25] <_joe_> started on 5/28 [19:39:25] whatever it was got bad around 5/29 [19:39:38] <_joe_> I think that's when we moved changeprop or something [19:40:00] (03CR) 10Ppchelko: Session Store: Switch everything to kask-session (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570396 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [19:40:00] <_joe_> but rzl this seems like quite the AI: figure our what the hell is starving the kube nodes with iowait [19:40:06] (03CR) 10Papaul: [C: 03+2] Site: Add rd200[1-7] with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/604847 (https://phabricator.wikimedia.org/T251626) (owner: 10Papaul) [19:40:35] (03PS1) 10Krinkle: Check for block in GlobalBlocking::getUserBlockDetails [extensions/GlobalBlocking] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604849 (https://phabricator.wikimedia.org/T254955) [19:41:10] something that now is on kubernetes1001 but not the others [19:41:14] for a variety of reasons it is funny to me to imagine iowait ever occurring on k8s kubes [19:41:15] but it looks like it was moving around [19:41:31] <_joe_> akosiaris: indeed [19:41:32] cdanis: same here. One thing I did not expect biting us on k8s nodes [19:41:36] <_joe_> try to look at ps? [19:41:46] <_joe_> that should tell us which process is doing iowait [19:42:00] <_joe_> I am looking at sal and unsurprisingly nothing really fits [19:42:26] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:43:14] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 413.5 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [19:43:19] cdanis: longma: fyi, projecting that logstash graph out, I don't expect it to catch up fully for 3-4 hours [19:43:25] (03CR) 10Krinkle: Session Store: Switch everything to kask-session (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570396 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [19:43:30] okay, thanks for the update [19:43:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) @Andrew and @Jclark-ctr - I met with our Dell account rep today, to try and push for a new replacement server.... [19:43:56] that's a hold-a-ruler-up-to-it linear regression, so ymmv :) [19:43:59] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` rdb2007.codfw.wmnet ` The log can be found in `/var/l... [19:44:04] <_joe_> shdubsh et al, can something be done about logstash? [19:44:06] longma: rzl: greg-g: it would be possible to proceed with the train despite logstash problems if you used logspam-watch on mwlog1001, but it would take extra caution for sure [19:45:29] _joe_: to what end? unblocking the train? [19:45:53] <_joe_> to the end of being able to see what logs is mediawiki emitting now [19:46:02] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:46:13] <_joe_> see these alerts are all spurious now [19:46:45] (03PS1) 10Gilles: Add monitoring for ASO ranking report [puppet] - 10https://gerrit.wikimedia.org/r/604850 (https://phabricator.wikimedia.org/T255189) [19:47:00] We could pipe all logs to devnull until the queue is empty [19:47:11] _joe_: hm.. are they not backdated based on original timestamp from kafka reception? [19:47:28] I'd expect a big gap not a spike [19:47:48] <_joe_> Krinkle: yes, but you would still get them counted now [19:47:51] through kibana there seems to be a gap indeed [19:47:51] (03CR) 10jerkins-bot: [V: 04-1] Add monitoring for ASO ranking report [puppet] - 10https://gerrit.wikimedia.org/r/604850 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [19:47:58] I'd rather have them in the system than tossed [19:48:03] _joe_: interestingly while https://grafana.wikimedia.org/d/000000377/host-overview?panelId=3&fullscreen&orgId=1&refresh=5m&var-server=kubernetes1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kubernetes&from=now-1h&to=now is at full iowait, https://grafana.wikimedia.org/d/000000377/host-overview?panelId=6&fullscreen&orgId=1&refresh=5m&var-server=kubernetes1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kub [19:48:03] ernetes&from=now-1h&to=now is not ... [19:48:06] _joe_: ah, I guess the prometheus metrics measure logstash intake? [19:48:31] Krinkle: logstash counts messages matching a filter and generates statsd metrics as they come across the pipeline [19:48:33] it's not polling logstash for "what happened now-1m" and capturing the count [19:48:42] right, okay [19:48:52] so the timestamps dont' correlate to the timestamps in logstash itself once processed [19:48:56] whatever is was, it stopped on kubernetes1001 [19:49:01] correct [19:49:06] <_joe_> akosiaris: so I was tracking it back [19:49:06] which gives me an idea [19:49:22] <_joe_> and it started around may 28, and nothing checks out [19:49:23] discussed with greg-g and decided it might be better to run the train now, use logspam-watch, and coordinate with rzl and cdanis to make sure everything goes smoothly. Does that sound acceptable? [19:49:45] <_joe_> the only think we can do is track which process is doing iowait [19:50:20] sorry I'm late, I went to dinner and didn't realize my phone died [19:50:26] rzl: longma: greg-g: I am okay with it, but we should be more willing than usual to revert the deploy, since we have much less visibility than usual for debugging [19:50:28] anything I can help with? [19:50:42] volans: make logstash shovel faster [19:50:46] lol [19:51:09] cdanis: anything "doable" :D [19:51:15] (03PS2) 10Gilles: Add monitoring for ASO ranking report [puppet] - 10https://gerrit.wikimedia.org/r/604850 (https://phabricator.wikimedia.org/T255189) [19:51:16] if it's because cp went to k8s, it might be because it's waiting on response from parsoid [19:51:28] longma: yeah, agreed with cdanis -- fine by me, but you're driving with most of the windshield covered, so be ready to brake :D [19:51:45] (that would explain the iowait) [19:52:00] Amir1: iowait is only for disk, not for sockets [19:52:07] oh thanks [19:52:25] <_joe_> akosiaris: I'm chasing down this iowait but it jumps from server to server [19:52:30] _joe_: yes that! [19:52:38] _joe_: are our k8s nodes running buster? [19:52:39] <_joe_> akosiaris: I think it can wait tomorrow morning when we're fresher though? [19:52:42] I mean the moment I think I got it in iotop [19:52:43] ... [19:52:44] <_joe_> cdanis: nope [19:52:49] ah :( [19:52:58] I would have tried some eBPF tricks if they were [19:53:01] <_joe_> cdanis: next Q I think [19:53:08] upd-localhost-err logstash partitions 4 and 5 look like they have stopped declining, annoyingly enough [19:53:28] <_joe_> cdanis: it's mostly matter of me and akosiaris coordinating efforts probably [19:53:47] <_joe_> because you can find out which process is waiting for io [19:54:01] and as I say that there is the tiniest of dips, naturally... [19:54:03] _joe_: it's not a pod that gets restarted btw [19:54:13] <_joe_> akosiaris: I know, I checked already [19:54:15] as in rescheduled ... [19:54:22] <_joe_> there is nothing that has enough restarts [19:54:38] so what? something with an internal leader ? [19:54:45] only changeprop has that, right ? [19:54:47] <_joe_> changeprop [19:54:50] <_joe_> uhmmm [19:55:02] <_joe_> it would be strange though [19:55:07] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:55:13] okay cdanis rzl I am going to start the deploy [19:56:11] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:56:11] in a pinch can also look at logstash-next which is group logstash-7-* on the lag graph and is catching up much faster [19:57:07] (03PS1) 10Jeena Huneidi: all wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604852 [19:57:09] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604852 (owner: 10Jeena Huneidi) [19:57:15] I will go into lurk mode as things seem quiet-ish (11 pm here, the usual) [19:58:00] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [19:58:02] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604852 (owner: 10Jeena Huneidi) [19:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:06] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` rdb2008.codfw.wmnet ` The log can be found in `/var/l... [19:58:37] okay I need some water, then I'm going to tidy up the doc a bit [19:59:07] I'll keep monitoring for the rest of the day just to make sure logstash catches up, since that's our only remaining thing, but otherwise I'm considering the incident resolved [19:59:37] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.36 [19:59:39] thanks all <3 action items and incident report to follow [19:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:02] (ftr, longma, thcipriani, brennen and I are in a hangout doing the deploy looking at logs etc) [20:00:02] rzl: I'm opening a placeholder IR just so I can link it in another bug report [20:00:09] ack, sgtm [20:00:18] drop me the link in a PM if you don't mind [20:00:19] (this was our normal weekly watercooler meeting turned deploy ;) ) [20:00:23] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:00:23] greg-g: if there's a way to get logspam-watch to focus on the MW canaries that'd be awesome [20:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:34] yo, I am here if changeprop or jobqueue stuff needs looking at [20:01:09] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb2007.codfw.wmnet'] ` Of which those **FAILED**: ` ['rdb2007.codfw.wmnet'] ` [20:01:42] cdanis: not easily right now it sounds like [20:02:04] jouncebot: next [20:02:04] In 2 hour(s) and 57 minute(s): Evening backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200611T2300) [20:02:16] hnowlan: nothing urgent, I think, just some vague suspicion about it [20:02:39] rzl: do you think 'sessionstore' or 'kubernetes' is a more appropriate shortname? [20:02:50] sessionstore [20:03:18] I'm still behind "sessionstore/kubernetes" but if I have to pick one, "sessionstore" [20:03:22] ack [20:04:21] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:07:36] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 4 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Krinkle) [20:08:29] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 4 others: GenerateFancyCaptchas.php crashes with Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Krinkle) [20:08:39] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:08:51] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 4 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Krinkle) [20:09:30] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 4 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Krinkle) [20:09:49] Amir1: are you going to be awake for the backport windows this evening? [20:09:59] /s// [20:10:05] yup [20:10:14] Amir1: fancy doing https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/604845/ ? :D [20:10:29] Sure thing, have you added it to the deployment? [20:10:54] not yet, I was going to fish for someone to deploy it fist, and looks like I got a bite! [20:11:19] logs look normal from what we can tell in logspam-watch and logstash-next [20:11:51] Amir1: it worked last night too, I just pasted a patch in here, said it would be amazing if it magically got deployed, and in the morning it was! :D [20:12:20] 10Operations, 10Scap: scap's logstash_checker.py is blissfully unaware of any logstash indexing latency - https://phabricator.wikimedia.org/T255197 (10CDanis) [20:12:52] :) ^ [20:13:04] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [20:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:04] 10Operations, 10Scap, 10Wikimedia-Incident: scap's logstash_checker.py is blissfully unaware of any logstash indexing latency - https://phabricator.wikimedia.org/T255197 (10greg) [20:15:33] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:17] 10Operations, 10Scap, 10Sustainability (Incident Prevention): scap's logstash_checker.py is blissfully unaware of any logstash indexing latency - https://phabricator.wikimedia.org/T255197 (10greg) [20:16:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:16:51] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` rdb2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [20:18:20] Amir1: thanks for adding it to the cal <3 [20:18:32] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={udp_localhost-err,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h& [20:18:32] r-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [20:18:44] Thank you for fixing the bug, I'm just clicking some buttons :P [20:19:00] Amir1: amusingly, jakob wrote all the code xD I'm just coordinating and reviewing [20:19:23] haha [20:19:36] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb2008.codfw.wmnet'] ` and were **ALL** successful. [20:19:36] Amir1: when deploying it would be worth checking one of the links i put in the ticket (diff page) that should exception, and also https://commons.wikimedia.org/wiki/Special:Contributions/Addshore should start formatting links correctly again in summaries [20:20:03] cool [20:20:05] noted [20:20:10] <3 o/ [20:21:33] o/ [20:22:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:25:03] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920 (10bd808) [20:27:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:29:49] hrm, varnish shows a 503 spike that dropped down to normal pretty quickly. Nothing obvious in logs on mwlog1001 [20:29:59] er grafana rather [20:30:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:30:26] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 58 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:31:51] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [20:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:28] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:56] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:36:31] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) [20:37:14] (03PS1) 10Andrew Bogott: Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) [20:38:28] (03CR) 10jerkins-bot: [V: 04-1] Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [20:38:31] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb2007.codfw.wmnet'] ` and were **ALL** successful. [20:38:38] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) 05Open→03Resolved @akosiaris node ready for service. [20:40:40] (03PS1) 10Bstorm: labstore: fix the failover process [puppet] - 10https://gerrit.wikimedia.org/r/604857 (https://phabricator.wikimedia.org/T224582) [20:41:05] (03PS2) 10Andrew Bogott: Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) [20:41:07] (03CR) 10QChris: [C: 03+1] gerrit: add parameter for db_name, let gerrit1002 use test db [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) (owner: 10Dzahn) [20:42:19] (03CR) 10jerkins-bot: [V: 04-1] Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [20:42:31] (03PS3) 10Andrew Bogott: Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) [20:43:44] (03CR) 10jerkins-bot: [V: 04-1] Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [20:46:04] (03PS4) 10Andrew Bogott: Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) [20:47:14] (03CR) 10jerkins-bot: [V: 04-1] Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [20:48:49] (03PS5) 10Andrew Bogott: Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) [20:50:00] (03CR) 10jerkins-bot: [V: 04-1] Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [20:52:43] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:54:55] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:02:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:05:56] 10Operations, 10Release-Engineering-Team-TODO, 10Scap, 10Sustainability (Incident Prevention), 10User-brennen: scap's logstash_checker.py is blissfully unaware of any logstash indexing latency - https://phabricator.wikimedia.org/T255197 (10brennen) [21:07:11] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:10:23] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:17:05] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:18:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:19:55] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:20:39] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:22:53] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [21:23:33] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:25:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:26:23] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:26:43] (03CR) 10BryanDavis: [C: 03+1] labstore: fix the failover process [puppet] - 10https://gerrit.wikimedia.org/r/604857 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [21:36:19] (03CR) 10Bstorm: [C: 03+2] labstore: fix the failover process [puppet] - 10https://gerrit.wikimedia.org/r/604857 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [21:42:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:46:01] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:56:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:58:33] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [21:58:45] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:59:47] FYI: logstash is now just about caught up, according to https://grafana.wikimedia.org/d/000000561/logstash?panelId=21&fullscreen&orgId=1&refresh=5m&from=now-1h&to=now [22:00:05] Thanks rzl [22:04:18] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10KFrancis) @jbond I am confirming the NDA is complete for Andrew's access request. Thanks! [22:15:37] 10Operations, 10Data-Services, 10cloud-services-team (Kanban): Undo special tools-home and tools-project share definitions for NFS - https://phabricator.wikimedia.org/T161834 (10Bstorm) 05Open→03Declined Since, modifying this would be a big mess right now, and we want to refresh these servers with a tota... [22:18:54] 10Operations, 10Data-Services, 10Tracking-Neverending, 10cloud-services-team (Kanban): overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083 (10Bstorm) [22:22:08] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10Nuria) let's wait to confirm the nature of internship to see wether actual access to data is needed. [22:24:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) > @Andrew - are you ok if I forward them the kernel dump from P10788? Thanks, Willy That's definitely fine! [22:28:42] (03CR) 10DannyS712: [C: 03+1] "LGTM" [extensions/Echo] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604848 (https://phabricator.wikimedia.org/T254699) (owner: 10Krinkle) [22:30:56] (03PS6) 10Andrew Bogott: Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) [22:33:44] (03PS1) 10Mholloway: Remove reference to no-longer-existing maps-beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604888 [22:34:51] (03CR) 10Mholloway: [C: 03+2] Remove reference to no-longer-existing maps-beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604888 (owner: 10Mholloway) [22:35:39] (03Merged) 10jenkins-bot: Remove reference to no-longer-existing maps-beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604888 (owner: 10Mholloway) [22:42:55] 10Operations, 10Data-Services, 10Tracking-Neverending, 10cloud-services-team (Kanban): overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083 (10Bstorm) 05Open→03Resolved a:03Bstorm Only two of the subtasks are still open and on the workboard. They are very important and will... [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200611T2300). [23:00:04] andyrussg and Amir1: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:36] o/ [23:00:38] 10Puppet, 10Wikimedia Meet, 10Patch-For-Review: Puppetize the meet account manager - https://phabricator.wikimedia.org/T251034 (10Ladsgroup) 05Open→03Resolved a:03Dzahn Except the secrets, this is done. The secrets seems to be not easy to do in cloud. We will try to do it once we are in prod [23:03:26] (03PS2) 10Awight: Migrate QuickSurveys `layout` param [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604681 (https://phabricator.wikimedia.org/T255130) [23:03:28] (03PS1) 10Awight: Remove deprecated QuickSurveys config fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604895 [23:08:52] I guess I deploy then [23:08:58] (03CR) 10Ladsgroup: [C: 03+2] Fix entity id lookup for interwiki special page links [extensions/Wikibase] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604845 (https://phabricator.wikimedia.org/T255078) (owner: 10Addshore) [23:16:41] Amir1: Want to undeploy an extension? :P [23:17:22] right now? [23:17:51] yeah [23:18:11] Always appreciate removing old code from production [23:18:19] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/604778/ [23:18:31] just waiting for ejegg to confirm it's ok to go from prod right now [23:19:56] cool, let me know and I press the red button [23:21:19] (03PS2) 10Reedy: Remove ContributionTracking extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604778 (owner: 10Ejegg) [23:21:19] yup it is [23:21:33] (03CR) 10Reedy: [C: 03+2] Remove ContributionTracking extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604778 (owner: 10Ejegg) [23:22:22] (03Merged) 10jenkins-bot: Remove ContributionTracking extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604778 (owner: 10Ejegg) [23:22:31] 10Puppet, 10Toolforge, 10Goal, 10cloud-services-team (Kanban): Fully puppetize Grid Engine - https://phabricator.wikimedia.org/T88711 (10bd808) [23:23:42] Amir1: good to go, or I can ;D [23:23:50] I do it, don't worry [23:24:12] I'm thinking of the way it can be deployed separately without breaking everything [23:24:25] Deploy CommonSettings first [23:24:38] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 (10Papaul) ` [edit interfaces interface-range vlan-public1-c-codfw] member xe-2/0/14 { ... } + member ge-3/0/8; [edit interfaces interface-range disabled] - member ge-3/0/8;... [23:24:45] That stops it being loaded and the $wgUseContributionTracking being used [23:26:32] (03Merged) 10jenkins-bot: Fix entity id lookup for interwiki special page links [extensions/Wikibase] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604845 (https://phabricator.wikimedia.org/T255078) (owner: 10Addshore) [23:27:23] 10Puppet, 10Cloud-Services, 10Toolforge: Puppetize adding new node to OGE - https://phabricator.wikimedia.org/T88712 (10bd808) 05Open→03Resolved a:03Bstorm @Bstorm did this as part of the Debian migration (to the extent that it is easily automated) [23:27:25] 10Puppet, 10Toolforge, 10Goal, 10cloud-services-team (Kanban): Fully puppetize Grid Engine - https://phabricator.wikimedia.org/T88711 (10bd808) [23:27:41] heh i didn't realise Wikibase merges are shown in here [23:27:48] paladox: anything to wmf/ is now [23:27:53] oh [23:28:03] 10Puppet, 10Cloud-Services, 10Toolforge: Puppetize adding a host to a particular queue - https://phabricator.wikimedia.org/T88713 (10bd808) 05Open→03Resolved a:03Bstorm Another things fixed (mostly) in the Debian migration [23:28:04] cause it's kinda relevant :) [23:28:06] 10Puppet, 10Toolforge, 10Goal, 10cloud-services-team (Kanban): Fully puppetize Grid Engine - https://phabricator.wikimedia.org/T88711 (10bd808) [23:30:53] need a bit, my yubikey is not cooperating [23:30:55] ugh [23:31:10] let me know if you need me to do it [23:32:09] just got it fixed. It was the ssh agent not running [23:32:24] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 (10Papaul) [23:33:42] o/ [23:33:58] Reedy: oooops I missed the slot? Hope not!!!! [23:34:05] jouncebot: now [23:34:06] For the next 0 hour(s) and 25 minute(s): Evening backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200611T2300) [23:34:13] AndyRussG: Which? :P [23:34:24] Reedy: The back port slot for CN? [23:34:38] Nope [23:34:42] Ah ok phew [23:34:44] Reedy: https://gerrit.wikimedia.org/r/c/552121/ [23:34:44] heh [23:34:51] it's just the table create [23:34:53] Amir is doing the bkacports :) [23:34:57] Ahhh oops [23:35:01] Amir1: o/ [23:35:01] oh, that's easy [23:35:06] yeee [23:35:09] I can do that simultaneously [23:35:18] ahhhhh ok fantastic thanks so so much!!!! [23:35:37] AndyRussG: o/ [23:35:50] Reedy: Amir1: thanks!!!!! [23:36:41] The alter table was done already [23:36:52] Reedy: Does the undeploy have a ticket? [23:37:07] About to deploy the common settings [23:37:15] !log create cn_notice_regions on metawiki and testwiki T252596 [23:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:19] T252596: CentralNotice: Deploy new regional geotargeting and banner template features - https://phabricator.wikimedia.org/T252596 [23:37:33] https://phabricator.wikimedia.org/T255216 kinda [23:38:52] !log ladsgroup@deploy1001 Synchronized wmf-config/CommonSettings.php: [[gerrit:604778|Remove ContributionTracking extension]] (T255216), Part I (duration: 00m 59s) [23:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:56] T255216: Archive the ContributionTracking extension - https://phabricator.wikimedia.org/T255216 [23:39:19] 10Puppet, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) I found this lurking in the deep backlog: {T115177}. Still not a bad idea. [23:40:31] Reedy: ah wheee that was fast! [23:41:36] stuff like that is easy to do [23:42:03] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:604778|Remove ContributionTracking extension]] (T255216), Part II (duration: 00m 58s) [23:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:34] !log ladsgroup@deploy1001 Synchronized wmf-config/extension-list: [[gerrit:604778|Remove ContributionTracking extension]] (T255216), Part III (duration: 00m 57s) [23:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:49] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 117.3 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [23:45:07] Reedy: thx!!!! [23:50:45] the wikibase backport worked fine on mwdebug1002, syncing [23:51:14] !log ladsgroup@deploy1001 scap failed: average error rate on 3/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b for details) [23:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:18] ugh, I know why it's happening it's due to the files mismatching on arrival [23:53:03] yup [23:53:07] let me force it [23:54:12] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.36/extensions/Wikibase: [[gerrit:604845|Fix entity id lookup for interwiki special page links (T255078)]] (duration: 00m 38s) [23:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:15] T255078: Fatal "Managed to lookup EntityId but got an unexpected type for namespace" from Wikibase EntityLinkTargetEntityIdLookup - https://phabricator.wikimedia.org/T255078 [23:54:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:55:22] monitoring, it'll cool down [23:55:29] 10Operations, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10Bstorm) Nfsiostat is not only useless (it is a crash loop and nothing else), it causes crashes on nfs clients at this point. We have rem... [23:56:05] (03PS1) 10Papaul: DNS: Add DNS entries for alert2001 [dns] - 10https://gerrit.wikimedia.org/r/604927 [23:56:21] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:57:01] okay then ^_^ [23:58:55] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 53 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:59:59] (03CR) 10Papaul: [C: 03+2] DNS: Add DNS entries for alert2001 [dns] - 10https://gerrit.wikimedia.org/r/604927 (owner: 10Papaul)