[00:00:07] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team, and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Krinkle) @elukey The mentioned TKO behaviour says that it immediately takes the backend out... [00:04:04] twentyafterfour: James and I had patches in swat as well :) [00:07:22] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [00:09:31] (03CR) 10Dzahn: [C: 04-1] dumps: monitor generation nfs server hosts for nfsd cpu usage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [00:10:18] (03CR) 10Dzahn: [C: 04-1] dumps: monitor generation nfs server hosts for nfsd cpu usage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [00:11:59] (03CR) 10Dzahn: [C: 04-1] "just remove that "=" from the commandline, then it should work. see inline comments. all else looks good to me. maybe compile it too just " [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [00:18:11] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [00:25:12] Krinkle: sorry I didn't see you respond earlier. Still around? [00:26:18] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on neodymium.eqiad.wmnet for hosts: ``` ['mwmaint1002.eqiad.wmnet'] ``` The log can be found in `/var/log/... [00:30:47] 10Operations, 10Operations-Software-Development: wmf-auto-reimage tries to remove from Debmonitor even with --new - https://phabricator.wikimedia.org/T204789 (10Dzahn) [00:30:55] twentyafterfour: I am [00:31:43] 10Operations, 10Operations-Software-Development: wmf-auto-reimage tries to remove from Debmonitor even with --new - https://phabricator.wikimedia.org/T204789 (10Dzahn) p:05Triage>03Lowest [00:35:22] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [00:36:44] 10Operations, 10SRE-Access-Requests: Requesting access to researchers, statistics-privatedata,users, analytics-privatedata-users for groceryheist - https://phabricator.wikimedia.org/T204790 (10Groceryheist) [00:41:20] ok I'll swat if you're still up for it [00:43:55] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mwmaint1002.eqiad.wmnet'] ``` and were **ALL** successful. [00:44:42] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team, and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) @Krinkle The idea that I have is that each mw host (reporting memcache errors/except... [00:45:26] Krinkle: is it possible to test the js on mwdebug? [00:45:31] Yep [00:45:34] checking now [00:45:40] or not yet? [00:45:40] cool, it's not sync'd yet one sec [00:45:43] k :) [00:46:27] (03PS1) 10Dzahn: add mwmaint1002 to site.pp with spare role [puppet] - 10https://gerrit.wikimedia.org/r/461261 (https://phabricator.wikimedia.org/T201343) [00:47:34] (03PS2) 10Dzahn: add mwmaint1002 to site.pp with spare role [puppet] - 10https://gerrit.wikimedia.org/r/461261 (https://phabricator.wikimedia.org/T201343) [00:47:36] (03PS3) 10Dzahn: add mwmaint1002 to site.pp with spare role and IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/461261 (https://phabricator.wikimedia.org/T201343) [00:47:38] (03CR) 10Dzahn: [C: 032] add mwmaint1002 to site.pp with spare role and IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/461261 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [00:48:13] still waiting on zuul. estimate 13 minutes :( [00:50:30] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:51:30] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:55:41] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on einsteinium is CRITICAL: 132.1 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [01:02:28] 10Operations, 10SRE-Access-Requests: Requesting access to researchers, statistics-privatedata,users, analytics-privatedata-users for groceryheist - https://phabricator.wikimedia.org/T204790 (10Tbayer) Seconding this request: @Groceryheist will be working with @ovasileva and myself on this project under a W... [01:11:37] Krinkle: it's all merged and pulled to mwdebug [01:12:09] James_F: your patch is also merged to mwdebug [01:14:48] mw2001/2017? [01:16:14] I'm not seeing the change on mwdebug2001/mw2017 [01:16:21] could be caching though [01:19:48] also checked 2002 and 1002 [01:19:56] not seeing it. Hm.. [01:22:08] Krinkle: let me see [01:22:43] try now (mwdebug2001) [01:25:32] twentyafterfour: got it, and verified the updated code. works as expected. [01:26:55] cool thanks, I'll sync now [01:28:38] To see James' change, woudl require l10n update btw [01:28:46] I can verify it, but will need scap [01:28:53] !log twentyafterfour@deploy1001 Synchronized php-1.32.0-wmf.22/resources/src/mediawiki.util.js: SWAT: sync https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/461227/ (duration: 01m 00s) [01:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:03] is there a way to test l10n changes on debug? [01:30:27] I'm not sure [01:30:52] I think we need to do a full scap [01:33:21] !log twentyafterfour@deploy1001 Started scap: SWAT: full sync to update l10n for https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMessages/+/461230/ [01:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:04] Would be nice if there was a way to do a l10n rebuild locally as well. E.g. on deploy host and then scap pull, or directly on the mwdebug server. [01:36:15] The command might exist actually, not sure. [01:39:26] there is a command to rebuild l10n but I don't know if it works [01:39:37] definitely needs some testing [01:59:15] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:00:16] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:25:46] PROBLEM - Apache HTTP on mw2136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:45] RECOVERY - Apache HTTP on mw2136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.096 second response time [02:31:16] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:33:26] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:34:42] !log twentyafterfour@deploy1001 Finished scap: SWAT: full sync to update l10n for https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMessages/+/461230/ (duration: 61m 20s) [02:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:35] (03CR) 10Mathew.onipe: Elasticsearch module is coming up. (0330 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [03:03:34] (03CR) 10Krinkle: [C: 031] Allow wikitech bureaucrats to promote to interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461240 (owner: 10Gergő Tisza) [03:04:27] (03PS29) 10Mathew.onipe: Add elasticsearch_cluster module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [03:28:05] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 894.12 seconds [03:29:56] (03PS1) 10Mathew.onipe: Added force shard allocation to elasticsearch_cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) [03:31:26] (03CR) 10Mathew.onipe: "New CR" [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [03:42:33] 10Operations, 10CodeEditor, 10Wikimedia-production-error: Exec error "Possibly missing executable file: svn diff" from Special:Code - https://phabricator.wikimedia.org/T204801 (10Krinkle) [03:44:12] 10Operations, 10CodeEditor, 10Wikimedia-production-error: Exec error "Possibly missing executable file: svn diff" from Special:Code - https://phabricator.wikimedia.org/T204801 (10Krinkle) Looks like the latest server re-provisioning (Debian Stretch?) lost `svn`. Most diffs are still cached so breakage obviou... [03:51:55] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 278.51 seconds [04:03:08] 10Operations, 10MediaWiki-extensions-CodeReview, 10Wikimedia-production-error: Exec error "Possibly missing executable file: svn diff" from Special:Code - https://phabricator.wikimedia.org/T204801 (10Legoktm) [04:10:06] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:10:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:13:25] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:14:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:07:25] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) >>! In T202764#4595562, @Smalyshev wrote: > Looking at logstash: https://logstash.wikimedia.org/goto/39a6fe9ed... [05:08:23] (03PS1) 10Marostegui: db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461270 (https://phabricator.wikimedia.org/T153638) [05:11:45] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461270 (https://phabricator.wikimedia.org/T153638) (owner: 10Marostegui) [05:13:42] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461270 (https://phabricator.wikimedia.org/T153638) (owner: 10Marostegui) [05:16:21] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2057 (duration: 02m 13s) [05:16:27] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461271 [05:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:16] !log Drop echo tables from s7:kowiki - T153638 [05:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:23] T153638: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 [05:21:39] (03CR) 10jenkins-bot: db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461270 (https://phabricator.wikimedia.org/T153638) (owner: 10Marostegui) [05:22:16] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461271 (owner: 10Marostegui) [05:23:56] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461271 (owner: 10Marostegui) [05:25:49] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2057 (duration: 00m 58s) [05:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:57] (03PS1) 10Marostegui: db-codfw.php: Depool db2074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461272 [05:28:46] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461272 (owner: 10Marostegui) [05:30:25] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461272 (owner: 10Marostegui) [05:31:30] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461273 [05:31:50] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2074 (duration: 00m 57s) [05:31:52] !log Drop echo tables from s3 db2074 - T153638 [05:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:03] T153638: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 [05:32:58] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461273 (owner: 10Marostegui) [05:34:40] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461273 (owner: 10Marostegui) [05:36:18] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461271 (owner: 10Marostegui) [05:36:20] (03CR) 10jenkins-bot: db-codfw.php: Depool db2074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461272 (owner: 10Marostegui) [05:36:22] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461273 (owner: 10Marostegui) [05:37:29] (03PS1) 10Marostegui: db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461274 [05:37:36] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2074 (duration: 00m 57s) [05:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:53] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461274 (owner: 10Marostegui) [05:40:14] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461274 (owner: 10Marostegui) [05:41:48] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2050 (duration: 00m 57s) [05:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:20] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461275 [05:43:08] (03PS12) 10ArielGlenn: dumps: monitor generation nfs server hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) [05:46:09] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461275 (owner: 10Marostegui) [05:47:49] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461275 (owner: 10Marostegui) [05:48:47] (03PS1) 10Marostegui: db-codfw.php: Depool db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461276 [05:49:20] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2050 (duration: 00m 58s) [05:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:18] (03CR) 10jenkins-bot: db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461274 (owner: 10Marostegui) [05:50:20] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461275 (owner: 10Marostegui) [05:50:29] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461276 (owner: 10Marostegui) [05:52:10] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461276 (owner: 10Marostegui) [05:52:30] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461277 [05:53:20] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2036 (duration: 00m 57s) [05:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:17] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461277 (owner: 10Marostegui) [05:59:06] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461277 (owner: 10Marostegui) [06:00:42] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2036 (duration: 00m 58s) [06:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:21] (03CR) 10jenkins-bot: db-codfw.php: Depool db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461276 (owner: 10Marostegui) [06:04:23] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461277 (owner: 10Marostegui) [06:08:02] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand the includes in sites in main.conf (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/451260 (https://phabricator.wikimedia.org/T196968) [06:12:17] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: expand the includes in sites in main.conf (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/451260 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:22:40] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: expand the includes in sites in main.conf (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/452322 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:22:51] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand the includes in sites in main.conf (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/452322 (https://phabricator.wikimedia.org/T196968) [06:28:26] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/run-puppet-agent] [06:29:16] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/DigiCert_SHA2_High_Assurance_Server_CA.crt] [06:30:16] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: enable HHVM on some sites(!!!) [puppet] - 10https://gerrit.wikimedia.org/r/452325 [06:30:46] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt-upgrade-activity] [06:31:05] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:31:06] PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [06:31:15] PROBLEM - puppet last run on ms-be2028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:32:15] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R] [06:33:12] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: enable HHVM on some sites(!!!) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452325 (owner: 10Giuseppe Lavagetto) [06:33:36] ^Internal Server Error [06:34:29] works on retry, so probably temp puppet master failure [06:34:36] or network error [06:36:06] RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:40:08] <_joe_> jynus: or it's 6:30 utc and logrotate hits [06:42:19] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529 (10Pine) The problem with spoofed email addresses is happening again, this time on the Education mailing list. https://lists.wikimedia.org/pipermail/education/2018-September/00208... [06:44:35] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:46:20] memecache errors spiked a lot for 4 minutes [06:46:22] (03CR) 10Gehel: Add elasticsearch_cluster module. (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [06:46:45] (03PS2) 10Muehlenhoff: Enable ferm for role::analytics_cluster::hadoop::client [puppet] - 10https://gerrit.wikimedia.org/r/461143 [06:48:55] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:50:59] (03PS4) 10Jcrespo: Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461136 [06:51:13] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461136 (owner: 10Jcrespo) [06:51:50] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461284 [06:52:20] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461136 (owner: 10Jcrespo) [06:52:36] (03PS2) 10Jcrespo: Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461284 [06:52:39] (03PS3) 10Muehlenhoff: Enable ferm for role::analytics_cluster::hadoop::client [puppet] - 10https://gerrit.wikimedia.org/r/461143 [06:52:47] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461284 (owner: 10Jcrespo) [06:54:54] (03PS3) 10Jcrespo: Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461284 [06:56:06] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:56:26] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:28] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2041 with low load (duration: 00m 54s) [06:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:36] RECOVERY - puppet last run on ms-be2028 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:36] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:56] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is CRITICAL: 57.92 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:58:55] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:45] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:00:59] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461136 (owner: 10Jcrespo) [07:03:16] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is OK: (C)60 le (W)70 le 71.5 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:03:56] (03PS4) 10Jcrespo: Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461284 [07:03:58] (03PS1) 10Jcrespo: mariadb: Depool db1109 for upgrade and reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461286 [07:06:04] (03CR) 10Muehlenhoff: [C: 032] Enable ferm for role::analytics_cluster::hadoop::client [puppet] - 10https://gerrit.wikimedia.org/r/461143 (owner: 10Muehlenhoff) [07:06:53] (03CR) 10Gehel: [C: 04-1] "Mostly minor comments inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [07:08:53] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1109 for upgrade and reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461286 (owner: 10Jcrespo) [07:08:59] (03PS2) 10Jcrespo: mariadb: Depool db1109 for upgrade and reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461286 [07:10:48] !log restart db1109 for upgrade [07:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:22] !log Disable puppet on databases to test new alerts - T200509 https://phabricator.wikimedia.org/T172489 [07:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:32] T200509: Make sure multi-instance slaves page - https://phabricator.wikimedia.org/T200509 [07:12:43] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1109 (duration: 00m 58s) [07:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:41] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1109 for upgrade and reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461287 [07:14:49] (03CR) 10jenkins-bot: mariadb: Depool db1109 for upgrade and reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461286 (owner: 10Jcrespo) [07:16:50] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10hashar) That works, thank you. >>! In T201470#4593179, @mark wrote: > There are no concerns about adding the conti... [07:20:35] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Depool some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461288 (https://phabricator.wikimedia.org/T200509) [07:21:20] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1109 for upgrade and reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461287 (owner: 10Jcrespo) [07:21:37] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Depool some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461288 (https://phabricator.wikimedia.org/T200509) [07:21:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:22:26] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1109 for upgrade and reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461287 (owner: 10Jcrespo) [07:23:06] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10MoritzMuehlenhoff) >>! In T201470#4597090, @hashar wrote: > For Jenkins, #releng and Moritz receive the security em... [07:23:24] (03CR) 10Jcrespo: [C: 031] db-eqiad,db-codfw.php: Depool some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461288 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [07:24:53] jynus: I will wait for your depool [07:24:58] (03PS3) 10Marostegui: db-eqiad,db-codfw.php: Depool some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461288 (https://phabricator.wikimedia.org/T200509) [07:25:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:25:34] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1109 (duration: 00m 57s) [07:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:42] ready [07:25:49] \o/ [07:26:24] !log reimaging mw2245 (spare host) to test reimages from cumin2001 (router policies have been updated) [07:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:53] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Depool some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461288 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [07:27:57] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Depool some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461288 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [07:29:25] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1109 for upgrade and reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461287 (owner: 10Jcrespo) [07:29:27] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Depool some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461288 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [07:30:20] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2084:3315 db2053 db1110 db1096:3316 db1110 (duration: 00m 57s) [07:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:24] (03PS10) 10Marostegui: mariadb: Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) [07:31:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db2084:3315 db2053 db1110 db1096:3316 db1110 (duration: 00m 57s) [07:31:31] jynus: I am ready for the merges [07:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:33] (03CR) 10Jcrespo: "if with "mysql maintenance client" you mean a place from which running mysql production maintenance, it needs more changes than these:" [puppet] - 10https://gerrit.wikimedia.org/r/460323 (https://phabricator.wikimedia.org/T177385) (owner: 10Muehlenhoff) [07:32:19] (03PS11) 10Marostegui: mariadb: Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) [07:33:41] (03PS8) 10Jcrespo: mariadb: Enable read_only monitoring on core mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/450228 (https://phabricator.wikimedia.org/T172489) [07:35:08] jynus: Do I merge first? [07:35:20] any order works [07:35:27] (03CR) 10Marostegui: [C: 032] mariadb: Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [07:35:59] jynus: mine is done - I will wait for yours to run puppet on icinga [07:36:41] (03PS9) 10Jcrespo: mariadb: Enable read_only monitoring on core mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/450228 (https://phabricator.wikimedia.org/T172489) [07:37:33] (03CR) 10Jcrespo: [C: 032] mariadb: Enable read_only monitoring on core mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/450228 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [07:38:11] where are you testing first [07:38:34] I will use eqiad hosts, db1110 and db1096:3316 [07:39:02] Running puppet on icinga now [07:40:33] db1110 and db1096 got your changes (I just ran puppet on them) [07:41:46] (03CR) 10Muehlenhoff: "As discussed on IRC, the requisites were merged as" [puppet] - 10https://gerrit.wikimedia.org/r/460323 (https://phabricator.wikimedia.org/T177385) (owner: 10Muehlenhoff) [07:42:19] was icinga updated? [07:42:38] it is now being done (puppet takes a while there) [07:42:50] sure, just asking :-) [07:44:32] (03CR) 10Jcrespo: [C: 031] "Although I would wait to deploy the grant change to have both servers." [puppet] - 10https://gerrit.wikimedia.org/r/460323 (https://phabricator.wikimedia.org/T177385) (owner: 10Muehlenhoff) [07:45:15] I see the new check now [07:45:18] jynus: Going to start the test with db1096 [07:45:32] waiting for it to run [07:45:51] !log Stop MySQL on db1096:3316 for alert testing - T200509 [07:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:00] T200509: Make sure multi-instance slaves page - https://phabricator.wikimedia.org/T200509 [07:46:06] we should test an eqiad master next [07:46:20] after that^ [07:46:31] sure [07:47:29] db1096:3316 stopped, waiting for the check to catch it [07:47:47] "Version 10.1.36-MariaDB, Uptime 593339s, read_only: True, 31.17 QPS, connection latency: 0.005213s, query latency: 0.000840s" [07:48:43] PROBLEM - MariaDB Slave SQL: s6 on db1096 is CRITICAL: CRITICAL slave_sql_state could not connect [07:48:52] PROBLEM - MariaDB Slave IO: s6 on db1096 is CRITICAL: CRITICAL slave_io_state could not connect [07:48:57] could not connect? [07:49:03] mysql is stopped [07:49:14] PROBLEM - MariaDB read only s6 on db1096 is CRITICAL: Could not connect to localhost:3316 [07:49:32] was that planned? [07:49:35] yes [07:49:38] that is the test :) [07:49:38] ah, ok [07:49:52] ok, I am done with db1096 no pages, that is good [07:49:53] well, I thought you were just going to stop replication [07:49:53] starting mysql [07:50:00] :-) [07:50:02] no, replication test was done last week :) [07:50:10] you didn't trusted my read only check, eh? [07:50:15] :-D [07:50:27] RECOVERY - MariaDB read only s6 on db1096 is OK: Version 10.1.35-MariaDB, Uptime 17s, read_only: True, 10.26 QPS, connection latency: 0.003355s, query latency: 0.002803s [07:50:33] jynus: your turn to test! [07:50:46] RECOVERY - MariaDB Slave SQL: s6 on db1096 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:50:56] RECOVERY - MariaDB Slave IO: s6 on db1096 is OK: OK slave_io_state Slave_IO_Running: Yes [07:51:21] honestly, I don't want to set any server in read write [07:51:39] I know the check itself works well, it was just the icinga deployment [07:51:54] I just want to test the check gets instaled well in al types of server [07:52:01] hehe [07:52:05] So I can proceed with db1110? [07:52:08] so a metadata master [07:52:12] es [07:52:17] etc. [07:52:17] (03CR) 10Filippo Giunchedi: [C: 031] Allow mgmt hosts to send syslog to central syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/461170 (owner: 10Ayounsi) [07:52:29] we didn't deploy it to parsercaches [07:52:41] not sure if worth it, as those are always readwrite [07:52:49] I don't think it is worth for now, no [07:52:57] I will proceed with db1110 [07:52:59] and when not, it is very ovious [07:53:06] sure [07:53:29] !log Stop MySQL on db1110 for alert testing - T200509 [07:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:36] T200509: Make sure multi-instance slaves page - https://phabricator.wikimedia.org/T200509 [07:53:44] I will enable puppet on enwiki master eqiad [07:53:49] ok! [07:56:01] PROBLEM - MariaDB Slave SQL: s5 on db1110 is CRITICAL: CRITICAL slave_sql_state could not connect [07:56:03] PROBLEM - MariaDB Slave IO: s5 on db1110 is CRITICAL: CRITICAL slave_io_state could not connect [07:56:22] PROBLEM - MariaDB read only s5 on db1110 is CRITICAL: Could not connect to localhost:3306 [07:56:34] marostegui: this is you right? ^^^ [07:56:41] yep, see SAL above [07:58:05] BTW, I am not 100% sure if "localhost:3306" is 100% clear [07:58:31] but it is what the connect string looks like, even if it uses a socket in reality [08:00:48] PROBLEM - mysqld processes on db1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [08:00:58] ^ Good - I am done with db1110 - starting mysql [08:01:48] RECOVERY - mysqld processes on db1110 is OK: PROCS OK: 1 process with command name mysqld [08:02:04] RECOVERY - MariaDB Slave SQL: s5 on db1110 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:02:14] RECOVERY - MariaDB Slave IO: s5 on db1110 is OK: OK slave_io_state Slave_IO_Running: Yes [08:02:21] db1110 processes paged [08:02:24] Yeah :( [08:02:27] was it supposed to page? (just checking alarms works as expected) [08:02:29] That should not have paged [08:02:31] :( [08:03:05] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/459764/11/modules/role/manifests/mariadb/core.pp [08:03:09] let's see what is wrong there [08:03:37] Ah, I think I know [08:04:10] it should be: is_critical => $::$replication_is_critical, or whatever name we want [08:04:19] copy paste fail! :) [08:04:29] not $::$ [08:04:34] I got the pages at 11:02 and 11:03 (problem and recovery together), meh [08:04:48] !log bounced ferm service on elastic2004 [08:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:28] <_joe_> $replication_is_critical, it's in the local scope [08:05:38] also that [08:06:22] rename it $is_critical [08:06:30] <_joe_> $:: is for variables in the top scope [08:06:36] <_joe_> why? [08:06:45] becuase it is not only for replication [08:06:50] it is missleading [08:07:08] and add the right contact group like with monitor_replication [08:07:26] also $is_critical was used for multiinstance hosts, so we have the same name on both files [08:07:44] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/459764/11/modules/profile/manifests/mariadb/core/multiinstance.pp [08:09:43] (03PS1) 10Marostegui: core.pp: Fix monitor_process and monitor_disk pages [puppet] - 10https://gerrit.wikimedia.org/r/461351 (https://phabricator.wikimedia.org/T200509) [08:10:17] (03CR) 10jerkins-bot: [V: 04-1] core.pp: Fix monitor_process and monitor_disk pages [puppet] - 10https://gerrit.wikimedia.org/r/461351 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [08:10:30] uh? [08:10:48] ha! WARNING top-scope variable being used without an explicit namespace (variable_scope) [08:10:51] :) [08:11:19] which one? [08:11:36] process_count => $num_instances, line 99 [08:12:25] but that should be just 1? [08:12:33] yeah only that [08:12:54] maybe it is the default [08:13:45] RECOVERY - MariaDB read only s5 on db1110 is OK: Version 10.1.36-MariaDB, Uptime 755s, read_only: True, 55.51 QPS, connection latency: 0.003246s, query latency: 0.000832s [08:13:45] (03PS2) 10Marostegui: core.pp: Fix monitor_process and monitor_disk pages [puppet] - 10https://gerrit.wikimedia.org/r/461351 (https://phabricator.wikimedia.org/T200509) [08:13:45] RECOVERY - Check systemd state on elastic2004 is OK: OK - running: The system is fully operational [08:15:35] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2004 is OK: OK ferm input default policy is set [08:16:41] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/12495/" [puppet] - 10https://gerrit.wikimedia.org/r/461351 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [08:16:50] jynus: I am going to merge that, it now looks good [08:16:59] let me recheck [08:17:03] I missed it first time [08:17:05] yeah :) [08:17:11] I was waiting for your check there [08:17:43] $::num_instances is wrong? [08:18:16] why? [08:18:33] whre is that variable defined? [08:19:44] mmm, right, it is defined for multiinstance only yeah [08:21:00] We could actually remove it and just get the default from: modules/mariadb/manifests/monitor_process.pp [08:21:14] For core.pp will always be one [08:21:36] the default is 1 [08:21:39] just remove it [08:21:43] exactly, that is what I am saying [08:21:57] we will have to put all that monitoring code on its own profile [08:22:02] later [08:22:11] yeah, will make things easier [08:22:25] (03PS3) 10Marostegui: core.pp: Fix monitor_process and monitor_disk pages [puppet] - 10https://gerrit.wikimedia.org/r/461351 (https://phabricator.wikimedia.org/T200509) [08:22:29] basucaly making profile::mariadb::monitor more flexible [08:22:42] and more simple to use [08:25:01] I think that should do, do you want to do a compilation just for sanity (rebuild the one you did before?) [08:25:13] yeah [08:26:09] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/12496/db1110.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/461351 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [08:26:36] checking [08:26:50] contact_group => dba ? [08:27:00] Check line 77 [08:28:14] so it is ignored and does the right thing, but maybe missleading? [08:28:38] or it is ok for now [08:29:01] yeah, we can tackle it if we refactor all this [08:29:08] ok [08:29:14] then looks good to me [08:29:22] let's merge and cross fingers :) [08:29:45] well, we don't cross figers, we just test it so we don't have to do that :-) [08:29:53] (03CR) 10Marostegui: [C: 032] core.pp: Fix monitor_process and monitor_disk pages [puppet] - 10https://gerrit.wikimedia.org/r/461351 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [08:31:30] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529 (10Aklapper) @Pine: Please provide the [[ https://en.wikipedia.org/wiki/en:Email#Message_header | message headers ]] of one of these messages - unfortunately the archive at https:/... [08:35:20] !log Stop MySQL on db1110 for alert testing - T200509 [08:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:28] T200509: Make sure multi-instance slaves page - https://phabricator.wikimedia.org/T200509 [08:36:14] PROBLEM - Filesystem available is greater than filesystem size on ms-be2043 is CRITICAL: cluster=swift device=/dev/sdf1 fstype=xfs instance=ms-be2043:9100 job=node mountpoint=/srv/swift-storage/sdf1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2043&var-datasource=codfw%2520prometheus%252Fops [08:36:25] ^ godog [08:37:34] PROBLEM - MariaDB Slave IO: s5 on db1110 is CRITICAL: CRITICAL slave_io_state could not connect [08:38:05] PROBLEM - MariaDB read only s5 on db1110 is CRITICAL: Could not connect to localhost:3306 [08:38:11] moritzm: thanks! I'll take a look shortly [08:38:25] PROBLEM - MariaDB Slave SQL: s5 on db1110 is CRITICAL: CRITICAL slave_sql_state could not connect [08:38:55] PROBLEM - mysqld processes on db1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [08:39:04] starting mysql again [08:39:55] RECOVERY - mysqld processes on db1110 is OK: PROCS OK: 1 process with command name mysqld [08:39:58] !log rebooting mc1019 for kernel security update [08:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:05] RECOVERY - MariaDB read only s5 on db1110 is OK: Version 10.1.36-MariaDB, Uptime 25s, read_only: True, 13.84 QPS, connection latency: 0.003246s, query latency: 0.003337s [08:40:31] is the disk space check paging still everywhere? [08:41:01] I believe so, I don't think that has been changed [08:41:02] you mean the one on the Swift servers? that one is not paging [08:41:26] marostegui: actually no, it is now only sending it to the irc for inactive hosts [08:42:48] let's run puppet on all eqiad db hosts, leave it like that for 1 hour and so , and then run it in codfw [08:43:06] moritzm: I meant dbs [08:43:07] One sec, I noticed something wrong [08:43:11] ? [08:43:15] Sending the patch now [08:45:55] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10MW-1.32-release-notes (WMF-deploy-2018-08-21 (1.32.0-wmf.18)), and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Niker... [08:46:29] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10MW-1.32-release-notes (WMF-deploy-2018-08-21 (1.32.0-wmf.18)), and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Niker... [08:46:38] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:47:16] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10MW-1.32-release-notes (WMF-deploy-2018-08-21 (1.32.0-wmf.18)), and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10jcres... [08:48:39] (03PS1) 10Marostegui: multiinstance.pp: Fix contact_group for multiinstance slaves [puppet] - 10https://gerrit.wikimedia.org/r/461353 [08:48:42] jynus: ^ [08:50:26] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10MW-1.32-release-notes (WMF-deploy-2018-08-21 (1.32.0-wmf.18)), and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Pgine... [08:50:39] if you do that, shouln't you do it also for the above check? [08:50:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:50:58] jynus: for monitor disk? [08:51:14] so it is equal everywhere, every check every role? [08:51:31] same for is_critical? [08:51:36] RECOVERY - MariaDB Slave IO: s5 on db1110 is OK: OK slave_io_state Slave_IO_Running: Yes [08:51:48] Sure [08:52:34] in the end, they should behave exactly the same, except for number of proccesses, right? [08:52:40] yeah, makes sense [08:53:13] that is why I suggested to encapsulate the same code- it avoids changing the same thing multiple times [08:53:18] and minimizes mistakes [08:53:26] (03CR) 10Volans: "Quick reply to the 2 "open" comments." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [08:53:57] yeah, it needs refactoring [08:54:22] (03PS2) 10Marostegui: multiinstance.pp: Fix contact_group for multiinstance slaves [puppet] - 10https://gerrit.wikimedia.org/r/461353 [08:54:48] RECOVERY - MariaDB Slave SQL: s5 on db1110 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:59:19] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/12497/" [puppet] - 10https://gerrit.wikimedia.org/r/461353 (owner: 10Marostegui) [08:59:27] let me check instance module too [09:01:00] (03CR) 10Jcrespo: [C: 031] multiinstance.pp: Fix contact_group for multiinstance slaves [puppet] - 10https://gerrit.wikimedia.org/r/461353 (owner: 10Marostegui) [09:01:29] (03CR) 10Marostegui: [C: 032] multiinstance.pp: Fix contact_group for multiinstance slaves [puppet] - 10https://gerrit.wikimedia.org/r/461353 (owner: 10Marostegui) [09:02:25] !log updating intel-microcode on Debian jessie/stretch to 3.20180807a.1 [09:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:42] (03CR) 10Alexandros Kosiaris: [C: 031] setup.py: add missing fields [software/spicerack] - 10https://gerrit.wikimedia.org/r/460425 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:02:56] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:03:36] (03CR) 10Alexandros Kosiaris: [C: 031] tests: improve prospector tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/460426 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:04:02] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: regio-migrate: support upper case FQDNs in instance names [puppet] - 10https://gerrit.wikimedia.org/r/461354 (https://phabricator.wikimedia.org/T204745) [09:04:39] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: regio-migrate: support upper case FQDNs in instance names [puppet] - 10https://gerrit.wikimedia.org/r/461354 (https://phabricator.wikimedia.org/T204745) (owner: 10Arturo Borrero Gonzalez) [09:04:47] !log rebooting mc1020 for kernel security update [09:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:56] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:05:14] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: regio-migrate: support upper case FQDNs in instance names [puppet] - 10https://gerrit.wikimedia.org/r/461354 (https://phabricator.wikimedia.org/T204745) [09:05:14] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] cloudvps: regio-migrate: support upper case FQDNs in instance names [puppet] - 10https://gerrit.wikimedia.org/r/461354 (https://phabricator.wikimedia.org/T204745) (owner: 10Arturo Borrero Gonzalez) [09:05:50] (03CR) 10Gehel: Add elasticsearch_cluster module. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [09:05:58] (03CR) 10Faidon Liambotis: [C: 031] mx: remove local ip dns lookup and wiki-mail.wikimedia.org default [puppet] - 10https://gerrit.wikimedia.org/r/461193 (owner: 10Herron) [09:06:37] (03CR) 10Volans: [C: 032] setup.py: add missing fields [software/spicerack] - 10https://gerrit.wikimedia.org/r/460425 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:07:56] (03Merged) 10jenkins-bot: setup.py: add missing fields [software/spicerack] - 10https://gerrit.wikimedia.org/r/460425 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:08:17] (03CR) 10Volans: [C: 032] tests: improve prospector tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/460426 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:08:31] (03CR) 10Faidon Liambotis: [C: 04-1] "I don't understand how this would work, I may be missing something :) These files are on the agent, but this manifest runs on the master, " [puppet] - 10https://gerrit.wikimedia.org/r/461197 (owner: 10Dzahn) [09:09:36] (03Merged) 10jenkins-bot: tests: improve prospector tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/460426 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:10:47] (03CR) 10Alexandros Kosiaris: "I am also wondering what this will be used for... :(" [puppet] - 10https://gerrit.wikimedia.org/r/461197 (owner: 10Dzahn) [09:11:10] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:11:55] akosiaris: context is https://gerrit.wikimedia.org/r/399972 [09:12:38] yeah I saw that and still don't understand [09:12:57] what do we need the tor fingerprint for ? [09:13:35] we run multiple instances of the tor relay (separate instances of the daemon with a separate config file each) [09:14:05] !log rebooting mc1021 for kernel security update [09:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:11] when you do something like that, you're supposed to list all of the fingerprints of those instances in a setting called "family" in each of them [09:14:40] family ? lol [09:14:42] that's so that the network knows that they're operated by the same entity (or in the same box) and you shouldn't use more than one of them in a circuit [09:14:50] aaah, that makes sense [09:14:56] ok I was missing that part [09:15:23] right now we're hardconding the fingerprints in our config, which isn't great [09:15:26] I 'll add this IRC conv in the gerrit change for posterity's sake and for others to know that [09:15:32] sure :) [09:15:44] the approach is flawed however [09:15:47] I think I mentioned something like that earlier in the gerrit comments [09:15:59] the one in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/461197/1/modules/tor/manifests/fingerprint.pp I mean [09:16:18] yeah I don't get how that would work tbh, but maybe I'm missing something :) [09:16:19] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:16:37] it wouldn't [09:17:39] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10mobrovac) >>! In T186748#4549704, @pmiazga wrote: > I think (I didn't verify it yet) that it fails because of introduced restbase checks: Th... [09:19:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Answered by faidon, pasting from IRC for posterity's sake." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/461197 (owner: 10Dzahn) [09:20:00] 10Operations, 10Wikimedia-Logstash: Log lines on flourine overflow at 8092 bytes. - https://phabricator.wikimedia.org/T114849 (10fgiunchedi) >>! In T114849#4584805, @EBernhardson wrote: > On the other hand I believe @fgiunchedi is considering increasing the ingestion and disk capacity of the logstash infrastru... [09:20:29] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:21:09] PROBLEM - IPsec on mc2021 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc1021_v4 [09:25:35] is it just me or do memcached errors keep spiking [09:25:51] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10pmiazga) @mobrovac we're on 1.7.0, Puppeteer 1.8.0 got released like a week ago. Today I'll create a patch to update puppeteer to the latest... [09:26:43] !log repair sdf on ms-be2043 - T199198 [09:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:50] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [09:27:49] addshore: caused by the mc1019 reboot, it seems the boot order is incorrect, currently looking via serial console [09:33:23] <_joe_> addshore: the errors happen when we reboot a server, as we can't broadcast SETs so mcrouter reports the error AIUI [09:33:53] gotcha [09:35:05] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Repool db2084, db2053 and db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461356 [09:36:14] RECOVERY - Filesystem available is greater than filesystem size on ms-be2043 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2043&var-datasource=codfw%2520prometheus%252Fops [09:36:38] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Repool db2084, db2053 and db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461356 (owner: 10Marostegui) [09:37:47] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Repool db2084, db2053 and db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461356 (owner: 10Marostegui) [09:38:05] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.50 seconds [09:39:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db2084:3315 db2053 db1096:3316 (duration: 00m 59s) [09:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:46] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2084:3315 db2053 db1096:3316 (duration: 00m 58s) [09:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:20] !log Upgrade MariaDB and kernel on db1110 [09:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:13] lol^I though you had that already [09:42:22] hehe no, not yet :) [09:42:35] I saw you removed it from the paste [09:42:36] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: horizon: mediawiki-vagrant project is now in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/461357 (https://phabricator.wikimedia.org/T204745) [09:42:45] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Repool db2084, db2053 and db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461356 (owner: 10Marostegui) [09:43:37] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: horizon: mediawiki-vagrant project is now in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/461357 (https://phabricator.wikimedia.org/T204745) (owner: 10Arturo Borrero Gonzalez) [09:46:21] (03PS1) 10Muehlenhoff: Remove mc1021 from mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/461358 [09:46:33] (03CR) 10Alexandros Kosiaris: [C: 031] dnsdisc: improve TTL checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:47:08] (03PS4) 10Volans: dnsdisc: improve TTL checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) [09:48:25] PROBLEM - MariaDB read only x1 on db2033 is CRITICAL: Could not connect to localhost:3306 [09:48:37] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/12498/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/461358 (owner: 10Muehlenhoff) [09:50:14] (03PS2) 10Muehlenhoff: Remove mc1021 from mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/461358 [09:50:23] jynus: db2033? [09:50:31] (03CR) 10Alexandros Kosiaris: [C: 031] dnsdisc: catch dnspython exceptions (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459805 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:51:04] it is up form e [09:51:20] (03CR) 10Muehlenhoff: [C: 032] Remove mc1021 from mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/461358 (owner: 10Muehlenhoff) [09:54:06] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461284 (owner: 10Jcrespo) [09:54:25] checking [09:55:15] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461284 (owner: 10Jcrespo) [09:55:28] (03PS1) 10Marostegui: db-eqiad.php: Repool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461359 [09:56:23] strange, it works if I execute locally [09:56:27] as defined on icinga [09:56:56] oh, it fails with the alerting user [09:57:03] so probably grant issue [09:57:10] marostegui: ^ [09:57:16] would make sense yeah [09:57:18] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461284 (owner: 10Jcrespo) [09:57:20] great because it shows an actual problem [09:58:01] (03PS1) 10Banyek: db-eqiad.php: Set API servers weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461360 (https://phabricator.wikimedia.org/T203565) [09:59:18] (03CR) 10Marostegui: "Can you provide some examples on how you found out that the query plan is now the same? What did you do? How did you find it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461360 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [10:00:49] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Set API servers weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461360 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [10:00:51] (03CR) 10Banyek: "Yes, after this patch is out, I close the original ticket, and write down all my findings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461360 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [10:01:36] (03CR) 10Marostegui: "> Yes, after this patch is out, I close the original ticket, and" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461360 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [10:03:11] (03CR) 10Banyek: [C: 032] db-eqiad.php: Set API servers weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461360 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [10:03:50] !log fixing db2033 grants [10:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:00] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None [10:04:09] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None [10:04:13] (03CR) 10Volans: [C: 032] dnsdisc: improve TTL checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:04:20] jynus: your change for db2041 wasn't deployed, no? Maybe banyek can do it while depooling his? [10:04:20] RECOVERY - MariaDB read only x1 on db2033 is OK: Version 10.1.32-MariaDB, Uptime 13912555s, read_only: True, 772.02 QPS, connection latency: 0.005532s, query latency: 0.000858s [10:04:20] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None [10:04:38] (03Merged) 10jenkins-bot: db-eqiad.php: Set API servers weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461360 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [10:04:39] I think he wanted to take a break [10:04:40] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None [10:04:41] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None [10:04:41] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None [10:04:41] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None [10:04:42] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None [10:04:49] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None [10:04:50] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None [10:04:52] mobrovac: ^ [10:04:57] jynus: I mean just the scap [10:04:57] _joe_: ^^^ [10:05:00] has the page changed? [10:05:19] which is the test page? [10:05:30] swagger? [10:05:38] (03CR) 10Alexandros Kosiaris: [C: 04-1] "some inline comments" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:05:41] (03Merged) 10jenkins-bot: dnsdisc: improve TTL checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:05:59] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None [10:06:07] (03PS2) 10Marostegui: db-eqiad.php: Repool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461359 [10:06:35] jynus: db1066 also on icinga about the read only, I guess grants too for that one [10:07:12] db2033 got fixed after the grant change [10:07:17] banyek: will you push jaime's changes or do you want me to do with once I merge mine? [10:07:23] checking db1066 [10:07:26] 10Operations, 10ops-eqiad: mc1021 boot failure - https://phabricator.wikimedia.org/T204812 (10MoritzMuehlenhoff) [10:07:31] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T203565: Weight adjust for S1 API hosts (duration: 00m 57s) [10:07:33] looking [10:07:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461359 (owner: 10Marostegui) [10:07:35] My scap is alredy on way [10:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:38] :( [10:07:38] T203565: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 [10:07:47] yeah it's a swagger check [10:07:54] banyek: yeah, will you scap for db-codfw.php for jaime's changes or I you prefer me to do that? [10:08:07] or I can do [10:08:10] PROBLEM - MariaDB read only s2 on db1066 is CRITICAL: Could not connect to localhost:3306 [10:08:18] I can do it [10:08:34] banyek: you just need to do the same but for db-codfw.php with this change: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/461284/ [10:08:43] So maybe the message can be: "Fully pool db2041 back into production." [10:08:52] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461359 (owner: 10Marostegui) [10:08:54] 👍 [10:08:59] mobrovac: where is the swagger config? [10:09:00] has something else beside db pools been deployed? [10:09:21] it seems mw is not setting the language field any more in the response apparently [10:09:29] ouch [10:09:37] volans: can you clarify your quesiton? :P [10:10:17] <_joe_> mobrovac: right now? [10:10:39] <_joe_> or is this happening since some time? [10:10:39] mobrovac: can wait, let's fix this first [10:10:41] apache? shouldn't that be an application change? [10:10:42] i'm investigating, but that's what the failure message of the check suggests [10:10:49] _joe_: the alarms went off a couple of minutes ago [10:11:00] more like 5~6 [10:11:22] <_joe_> volans: on which vhosts? [10:11:47] !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: Fully pool db2041 back into production (duration: 00m 57s) [10:11:52] \o/ [10:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:54] <_joe_> also, do we know which query is done to mediawiki? [10:11:55] (03CR) 10jenkins-bot: db-eqiad.php: Set API servers weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461360 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [10:11:57] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461359 (owner: 10Marostegui) [10:11:58] dunno yet, the check-mobileapps doesn't say it [10:12:09] PROBLEM - HTTP releases-jenkins.wikimedia.org on releases2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 553 bytes in 0.073 second response time [10:12:16] that's what I was asking for, where is the swagger spec [10:12:19] RECOVERY - MariaDB read only s2 on db1066 is OK: Version 10.1.33-MariaDB, Uptime 8905712s, read_only: True, 109.79 QPS, connection latency: 0.002900s, query latency: 0.000913s [10:12:25] btw this is the media end point, so not as alarming [10:12:25] $ /usr/local/bin/check-mobileapps [10:12:28] let me look into it [10:12:30] /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None [10:12:47] <_joe_> volans: yeah I can read icinga [10:13:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1110 (duration: 00m 57s) [10:13:12] <_joe_> content-language not set is a huge issue and can have something to do with apache's config [10:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:20] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None [10:13:34] <_joe_> but I honestly doubt it would've appeared 2 hours after I did my changes [10:13:37] <_joe_> 3 hours even [10:13:59] PROBLEM - Host mc1021 is DOWN: PING CRITICAL - Packet loss = 100% [10:14:16] ^ silencing [10:14:23] https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/spec.yaml#L658-L707 [10:14:37] thanks mdholloway [10:14:49] ACKNOWLEDGEMENT - Host mc1021 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T204812 [10:14:55] content-language is just passed through from whatever is received with the page html via v1/page/html/{title} [10:15:02] <_joe_> mdholloway: yeah my question is what's the query you do at the mediawiki api [10:15:07] <_joe_> oh so it's restbase [10:15:10] yep [10:15:11] ACKNOWLEDGEMENT - IPsec on mc2021 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc1021_v4 Muehlenhoff T204812 [10:15:14] <_joe_> you get it from there [10:15:18] that's right [10:15:20] <_joe_> one more level of inception [10:16:27] so to be clear- there is a missing respose header, but things are relatively up, right? [10:16:37] this is not a fatal outage? [10:16:46] jynus: that's right [10:16:59] ok, now that is clear, debugging can happen [10:17:10] hmm, the restbase public API is returning content-language. [10:17:37] <_joe_> mdholloway: what url are you testing? [10:17:47] (03PS2) 10Volans: dnsdisc: catch dnspython exceptions [software/spicerack] - 10https://gerrit.wikimedia.org/r/459805 (https://phabricator.wikimedia.org/T199079) [10:18:11] <_joe_> jynus: we're trying to understand if it's a symptom of a big outage [10:18:17] ok [10:18:22] looking at rb [10:18:26] i just tested https://en.wikipedia.org/wiki/User:BSitzmann_%28WMF%29/MCS/Test/Frankenstein (same as the spec x-ample caes) [10:18:28] *case [10:18:39] <_joe_> mdholloway: that's not restbase, that's the page on the wiki [10:18:39] PROBLEM - MariaDB read only x1 on db2069 is CRITICAL: Could not connect to localhost:3306 [10:19:14] <_joe_> I tried https://en.wikipedia.org/api/rest_v1/page/media/User:BSitzmann_(WMF)/MCS/Test/Frankenstein [10:19:24] <_joe_> but I guess I missed the public prefix for mcs [10:19:30] d'oh, thanks. [10:19:34] yes, that's the URL we want [10:20:00] <_joe_> it's page/mobile/ [10:20:02] <_joe_> I think [10:20:40] RECOVERY - MariaDB read only x1 on db2069 is OK: Version 10.1.35-MariaDB, Uptime 3538308s, read_only: True, 455.29 QPS, connection latency: 0.006286s, query latency: 0.000847s [10:20:53] <_joe_> and I still get a 404 anyways [10:21:01] rb wasn't returning the header [10:21:07] it is now [10:21:19] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10fgiunchedi) >>! In T203169#4595824, @herron wrote: >>>! In T203169#4594186, @fgiunchedi wrote: >> If for some reason we're under-provisioning (or over-utilizing)... [10:21:21] <_joe_> what was the issue then? [10:21:41] hm the mobileapps checks are still failing [10:22:41] (03CR) 10Volans: "Thanks for the review, replies inline" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:22:50] <_joe_> so, let's proceed with order [10:23:17] no wait, rb is still not returning it [10:23:18] wth [10:23:36] (03CR) 10Volans: [C: 032] dnsdisc: catch dnspython exceptions [software/spicerack] - 10https://gerrit.wikimedia.org/r/459805 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:23:46] hmmmmmm [10:24:01] a no-cache request returns the accept-lang, but a normal req doesn't [10:24:05] that is veery fishy [10:24:10] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:24:36] (03Merged) 10jenkins-bot: dnsdisc: catch dnspython exceptions [software/spicerack] - 10https://gerrit.wikimedia.org/r/459805 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:24:50] and these two requests return different uuids [10:24:55] uf that's a mess [10:25:10] <_joe_> mobrovac: cassandra-level mess? [10:25:32] <_joe_> mobrovac: how do I perform a "no-cache" request to restbase, btw? [10:25:57] _joe_: jsut add -H'Cache-Control: no-cache' when making reqs to rb inside prod [10:26:47] !log Deploy schema change on x1:testwiki - T51593 [10:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:56] T51593: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 [10:27:50] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Banyek) [10:32:17] (03CR) 10Gehel: mediawiki: improve siteinfo checks (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:37:49] <_joe_> mobrovac: curl -H 'Cache-control: no-cache' -v 'http://restbase.discovery.wmnet:7231/en.wikipedia.org/v1/page/media/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein/803891963' doesn't return the content-language either [10:38:02] 10Operations, 10Operations-Software-Development: wmf-auto-reimage tries to remove from Debmonitor even with --new - https://phabricator.wikimedia.org/T204789 (10Volans) I agree, let's add an if to avoid it :) [10:38:02] <_joe_> argh [10:38:03] <_joe_> typo [10:38:49] <_joe_> still, I don't see it [10:42:16] that's weird [10:42:32] i'm trying locally on rb2009 and it does with no-cache [10:42:35] lemme try differently [10:42:44] <_joe_> try on another host [10:46:47] _joe_: i get content-language: en when doing it form scb2002 [10:46:57] curl -H'Cache-Control: no-cache' 'http://restbase.discovery.wmnet:7231/en.wikipedia.org/v1/page/html/User%3ABSitzmann_(WMF)%2FMCS%2FTest%2FFrankenstein/803891963' [10:49:49] <_joe_> found the difference [10:49:58] <_joe_> you're asking /html/ [10:50:19] yeha that's what mobileapps is asking of restbase [10:50:29] <_joe_> oh I see [10:50:30] <_joe_> heh [10:52:20] <_joe_> I was requesting the mobileapps-powered url [10:53:15] (03PS1) 10Volans: cumin: re-disable the urllib3 warning [puppet] - 10https://gerrit.wikimedia.org/r/461364 (https://phabricator.wikimedia.org/T177385) [10:53:22] <_joe_> content-length: 83791 vs content-length: 83799 is also baffling [10:53:47] i don't understand what changed in the last 30 mins to be causing this [10:54:00] <_joe_> me neither [10:54:22] <_joe_> that revision should be rendered and stored in restbase since a long time [10:54:25] could the cache be hiding the real time when this has changed? [10:54:41] <_joe_> volans: the cache is showing us the wrong result [10:54:51] now [10:54:59] but maybe was invalidated when it alamerd? [10:55:00] <_joe_> and so the moment we saw the problem is the moment the cache changed [10:55:01] *alarmed [10:55:22] the revision was created late last night [10:55:27] ack, what I meant is how long the previous item was there in the cache? [10:55:50] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on einsteinium is CRITICAL: 160.2 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [10:55:54] <_joe_> volans: we don't know for sure without looking at the logs of change-propagation I guess [10:55:55] _joe_: no, that's not correct, there is no usage of cache in this case [10:55:56] (03CR) 10Volans: "Compiler results available at:" [puppet] - 10https://gerrit.wikimedia.org/r/461364 (https://phabricator.wikimedia.org/T177385) (owner: 10Volans) [10:56:06] <_joe_> mobrovac: uh? [10:56:21] mobileapps asks restbase directly for the html [10:56:27] <_joe_> yes [10:56:29] so it's served from cassandra [10:56:34] not from varnish [10:56:35] <_joe_> and restbase has it in its cache [10:56:41] <_joe_> sorry, yes, its storage [10:56:55] <_joe_> I was referring to restbase as a cache :) [10:57:04] <_joe_> from its pov, this is in storage [10:57:17] <_joe_> so what is in storage is wrong [10:57:30] it seems there is something particular about the revision [10:57:32] <_joe_> it also has the anchors all with differend ids [10:57:41] <_joe_> *different [10:57:47] when i ask for the html without the revision and without no-cache, i get the header [10:58:27] <_joe_> was this revision created around the time when a restbase deploy was ongoing? [10:58:31] <_joe_> or a mw deploy [10:58:49] <_joe_> but then, why start alarming 10 minutes ago? [10:59:24] <_joe_> well more like 30 [10:59:34] 60 [10:59:42] but yeah, that's what i'm wondering too [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180919T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:41] <_joe_> mobrovac: do we store a timestamp in cassandra? [11:01:00] <_joe_> if so, we should be able to retreive that and know when it was parsed [11:01:36] when i ask rb for the rev info that it's got, it says loud and clear that the page lang is en [11:01:44] for that particular rev [11:01:54] <_joe_> wtf [11:02:01] the rev asked by mcs is 2017-10-05T09:51:24Z [11:02:04] so long time ago [11:02:07] protected by bernd [11:02:14] <_joe_> we need to look at the cassandra level [11:02:30] <_joe_> what's stored for that revid I guess [11:02:39] this might be a timeline index edge case [11:02:48] <_joe_> uh? [11:03:12] nm [11:03:45] the point now is that since the only problem here is that header missing, it's not a major problem [11:04:05] <_joe_> I would still like to understand what happened :) [11:04:14] the exact intricacies of what happens will take some time to uncover [11:04:15] sure sure [11:04:26] _joe_: i will silence icinga for now for mcs [11:04:31] <_joe_> but if you feel this is just too much work to be worth it, can't we just ask for a re-render? [11:04:47] <_joe_> that should solve the issue, right? [11:05:36] that's what no-cache is supposed to do [11:05:43] but this seems like a different problem [11:05:51] it's not related to the actual content of the html [11:07:24] on the other hand, the page revisions seem to contain the language, but it doesn't seem to be set for some reason [11:07:31] so that part is to be investigated [11:07:41] but this will not be a quick thing [11:07:48] <_joe_> yeah, I see :P [11:08:20] <_joe_> well I thought that header was stored in cassandra with the rest of the response data [11:08:44] <_joe_> so forcing to refresh all that's in cassandra should be enough to fix the issue; apparently it's not [11:09:41] ACKNOWLEDGEMENT - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None Marko Obrovac RB not returning content-language, investigating. - The acknowledgement expires at: 2018-09-20 19:08:32. [11:09:41] ACKNOWLEDGEMENT - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None Marko Obrovac RB not returning content-language, investigating. - The acknowledgement expires at: 2018-09-20 19:08:32. [11:09:41] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None Marko Obrovac RB not returning content-language, investigating. - The acknowledgement expires at: 2018-09-20 19:08:32. [11:09:41] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None Marko Obrovac RB not returning content-language, investigating. - The acknowledgement expires at: 2018-09-20 19:08:32. [11:09:42] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None Marko Obrovac RB not returning content-language, investigating. - The acknowledgement expires at: 2018-09-20 19:08:32. [11:09:42] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None Marko Obrovac RB not returning content-language, investigating. - The acknowledgement expires at: 2018-09-20 19:08:32. [11:09:42] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None Marko Obrovac RB not returning content-language, investigating. - The acknowledgement expires at: 2018-09-20 19:08:32. [11:09:42] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None Marko Obrovac RB not returning content-language, investigating. - The acknowledgement expires at: 2018-09-20 19:08:32. [11:09:43] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None Marko Obrovac RB not returning content-language, investigating. - The acknowledgement expires at: 2018-09-20 19:08:32. [11:09:43] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None Marko Obrovac RB not returning content-language, investigating. - The acknowledgement expires at: 2018-09-20 19:08:32. [11:09:43] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None Marko Obrovac RB not returning content-language, investigating. - The acknowledgement expires at: 2018-09-20 19:08:32. [11:09:44] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header content-language: None Marko Obrovac RB not returning content-language, investigating. - The acknowledgement expires at: 2018-09-20 19:08:32. [11:10:03] <_joe_> untick "send notifications" every time :P [11:10:32] ah right [11:10:46] usually i just want to press "send" as soon as possible [11:10:47] :P [11:10:50] sorry about that [11:11:00] <_joe_> nah I forget as well for the same reason [11:11:09] got to go run an errand and lunch, will continue the investigation later [11:11:43] _joe_: ftr, probably the quick fix is to simply clear the sotrage for this title, but before doing that i'd like to understand exactly what is happening [11:12:05] <_joe_> yes [11:12:18] <_joe_> I agree, better not to overlook this [11:12:47] <_joe_> I'd say I'm happy to help with the investigation, but I have very little time :/ [11:32:49] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 6 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) a:03Addshore [11:33:11] (03PS42) 10Gehel: Convert elasticsearch to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [11:33:23] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180919T1200) [12:00:36] !log reimaging mw1298 (spare host) to test reimages from cumin2001 [12:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:03] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Upgrade Cumin masters to stretch - https://phabricator.wikimedia.org/T177385 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jmm on cumin2001.codfw.wmnet for hosts: ``` ['mw1298.eqiad.wmnet'] ``` The log can be found in `/var... [12:01:34] ^ volans: phab integration also works fine [12:02:25] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/461364 (https://phabricator.wikimedia.org/T177385) (owner: 10Volans) [12:02:30] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:07:17] !log stopping puppet on all elasticsearch servers to deploy new systemd unit - https://gerrit.wikimedia.org/r/c/operations/puppet/+/440498 [12:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:49] (03CR) 10Gehel: [C: 032] Convert elasticsearch to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [12:08:51] (03CR) 10Hashar: [C: 031] "Yup that seems good to me, and eventually I would have figured out the log are in /var/log/cumin/cumin.log :] Thank you!" [software/cumin] - 10https://gerrit.wikimedia.org/r/461168 (https://phabricator.wikimedia.org/T204680) (owner: 10Volans) [12:09:00] !log rolling restart of relforge for new systemd unit [12:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:29] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:13:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:13:50] so, if `ssh debug1001.eqiad.wmnet` just sits there doing nothing, and not even asks for a pass phrase, what am i doing wrong? [12:15:11] .ssh/config is https://pastebin.com/kE4WcS6a [12:15:40] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:16:23] after about 5 minutes or so, it sais "ssh_exchange_identification: Connection closed by remote host" a few hundret times (!) very quickly, and then quits. [12:16:50] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:17:09] DanielK_WMDE_, if you run with -vvv what comes out? [12:17:26] DanielK_WMDE_: debug1001.eqiad.wmnet doesn't exist, mwdebug1001 perhaps [12:17:44] (03PS4) 10Hashar: zuul: allow email connection [puppet] - 10https://gerrit.wikimedia.org/r/376739 (https://phabricator.wikimedia.org/T93414) [12:17:48] hm [12:18:07] I feel like there'd be a more helpful error message if that were the only problem [12:18:25] https://pastebin.com/TPq0aZC3 [12:18:28] and hangs there [12:18:36] !log rolling restart of elasticsearch / logstash for new systemd unit [12:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:38] godog: i did try mwdebug1001.eqiad.wmnet, sorry. [12:19:44] see the second paste, it has the command line [12:19:50] (03CR) 10Hashar: [C: 04-1] "Puppet compilation is: https://puppet-compiler.wmflabs.org/compiler1002/12500/" [puppet] - 10https://gerrit.wikimedia.org/r/376739 (https://phabricator.wikimedia.org/T93414) (owner: 10Hashar) [12:20:34] DanielK_WMDE_, so if you go directly to bast1002.wikimedia.org what happens? [12:20:59] Krenair: the same [12:21:38] DanielK_WMDE_: ah indeed, I see it now [12:21:45] DanielK_WMDE_, if you go directly to bast1002 you get a debug entry saying it's proxying via bast1002...? [12:21:49] if i had to guess, i'd say it'S a packet filter throwing away my packets. [12:21:59] that'll be an infinite loop, try adding !bast1002.wikimedia.org etc. to the Host line on line 17 [12:22:20] would also explain the hundreds of error messages you get eventually [12:23:34] Krenair: oh, yea, that'S it! [12:23:37] debug1: Executing proxy command: exec ssh -a -W bast1002.wikimedia.org:22 bast1002.wikimedia.org [12:23:57] I kind of wish SSH would detect when people try to do that. Have been bitten by it before myself. [12:23:58] Uh... also, debug2: add_identity_file: ignoring duplicate key /home/daki/.ssh/wmf_rsa [12:24:04] this is a new-ish key. [12:24:08] probably okay? [12:24:41] ...did i overwrite the private key with the public key when copying things around? that would be silly. [12:25:07] possibly [12:25:29] DanielK_WMDE_, so are you in now? [12:26:51] Krenair: no. trying to understand what i have to change in my config [12:27:12] i have ProxyCommand none for host bast1002.wikimedia.org [12:27:25] so how come i'm getting an infinite loop? [12:27:38] it matches before it gets that far [12:27:47] it'll be matching "Host *.wikimedia.org" on line 17 [12:27:49] like I said [12:27:57] just add !bast1002.wikimedia.org to line 17 [12:28:06] yes, but it should also match the other line, no? [12:28:08] should probably do the same for the other bast* hosts [12:28:17] oh, does order matter? it will only match one, the first oen it fields? [12:28:18] I wouldn't count on it [12:28:19] that would explain it. [12:28:25] *it finds [12:28:46] well, the example config on https://wikitech.wikimedia.org/wiki/Production_shell_access only makes sense if order matters [12:29:18] yes, order DOES matter [12:29:26] swapping the order of the entries in my config fixes this. [12:29:27] hm that might be it, could try swapping the order of the *.wikimedia.org/*.wmnet vs. bastion blocks [12:29:28] DOH! [12:29:36] ok then [12:29:47] I assumeed it would match all and merge the options, with last-write-wins logic [12:29:57] but it actually uses first-found-winds [12:30:02] *wins [12:30:03] (03PS1) 10Arturo Borrero Gonzalez: New upstream version 0.0.8 [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/461375 [12:30:05] (03PS1) 10Arturo Borrero Gonzalez: debian: initial code import (jessie) [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/461376 (https://phabricator.wikimedia.org/T203177) [12:30:09] * DanielK_WMDE_ can't type today [12:30:14] * DanielK_WMDE_ can't really type any day [12:30:49] (03PS1) 10Arturo Borrero Gonzalez: New upstream version 0.0.8 [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/461377 [12:30:52] * DanielK_WMDE_ is hopelessly dependent on spell checkers and ids to fix his mistakes [12:31:20] (03PS1) 10Arturo Borrero Gonzalez: pristine-tar data for prometheus-openstack-exporter_0.0.8.orig.tar.gz [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/461378 [12:31:57] * DanielK_WMDE_ is in after mistyping his passphrase three times [12:32:21] anyway, thanks folks :) [12:38:06] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] pristine-tar data for prometheus-openstack-exporter_0.0.8.orig.tar.gz [debs/prometheus-openstack-exporter] (pristine-tar) - 10https://gerrit.wikimedia.org/r/461378 (owner: 10Arturo Borrero Gonzalez) [12:38:26] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Upgrade Cumin masters to stretch - https://phabricator.wikimedia.org/T177385 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1298.eqiad.wmnet'] ``` and were **ALL** successful. [12:43:54] (03PS1) 10Alexandros Kosiaris: Edit Project Config [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/461381 [12:44:29] (03PS1) 10Arturo Borrero Gonzalez: Edit Project Config [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/461382 [12:45:42] (03PS1) 10Alex Monk: Allow ops to submit in this repo [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/461383 [12:46:03] (03PS1) 10Alexandros Kosiaris: Edit Project Config [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/461384 [12:46:39] (03PS1) 10Alexandros Kosiaris: Edit Project Config [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/461385 [12:46:57] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] Allow ops to submit in this repo [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/461383 (owner: 10Alex Monk) [12:47:03] (03PS2) 10Arturo Borrero Gonzalez: Allow ops to submit in this repo [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/461383 (owner: 10Alex Monk) [12:47:06] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] Allow ops to submit in this repo [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/461383 (owner: 10Alex Monk) [12:47:43] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] New upstream version 0.0.8 [debs/prometheus-openstack-exporter] (upstream) - 10https://gerrit.wikimedia.org/r/461377 (owner: 10Arturo Borrero Gonzalez) [12:48:03] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] New upstream version 0.0.8 [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/461375 (owner: 10Arturo Borrero Gonzalez) [12:48:37] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] debian: initial code import (jessie) [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/461376 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [12:50:00] <_joe_> DanielK_WMDE_: did you fix your ssh issues? I was away and I didn't read the backscroll [12:50:11] <_joe_> oh you're in :P [12:50:13] <_joe_> ok then [12:50:36] !log Deploy schema change on s3 eqiad master on mediawikiwiki (this will generate some lag on s3 eqiad) - T51593 [12:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:44] T51593: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 [12:51:54] (03CR) 10Effie Mouzeli: CLI: improve help message (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/461168 (https://phabricator.wikimedia.org/T204680) (owner: 10Volans) [12:56:20] (03CR) 10Volans: CLI: improve help message (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/461168 (https://phabricator.wikimedia.org/T204680) (owner: 10Volans) [12:58:01] Krenair: https://wikitech.wikimedia.org/w/index.php?title=Production_shell_access&type=revision&diff=1803438&oldid=1802430 [12:59:56] jouncebot: next [12:59:56] In 0 hour(s) and 0 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180919T1300) [13:00:04] hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - European version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180919T1300). [13:00:04] :) [13:02:15] (03CR) 10Effie Mouzeli: "Short messages have more changes of being read I reckon, I vote for the short version" [software/cumin] - 10https://gerrit.wikimedia.org/r/461168 (https://phabricator.wikimedia.org/T204680) (owner: 10Volans) [13:03:57] !log T203177 add initial prometheus-openstack-exporter package to reprepro (v0.0.8-1) [13:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:05] T203177: cloudvps: metrics and analytics - https://phabricator.wikimedia.org/T203177 [13:08:06] 10Operations, 10Wikidata, 10Patch-For-Review, 10User-notice, 10Wikimedia-Incident: Wikidata and dewiki databases locked - https://phabricator.wikimedia.org/T171928 (10jcrespo) [13:15:41] (03PS3) 10Petar.petkovic: Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492 [13:25:53] !log Deploy schema change on x1 eqiad master on enwiki (this will generate some lag on x1 eqiad) - T51593 [13:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:01] T51593: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 [13:27:40] (03PS8) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) [13:27:42] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert simple wikis in remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/452323 (https://phabricator.wikimedia.org/T196968) [13:27:44] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert usability wiki [puppet] - 10https://gerrit.wikimedia.org/r/452635 (https://phabricator.wikimedia.org/T196968) [13:27:46] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: migrate wikispecies [puppet] - 10https://gerrit.wikimedia.org/r/452636 (https://phabricator.wikimedia.org/T196968) [13:27:48] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: reorganize code a bit [puppet] - 10https://gerrit.wikimedia.org/r/461393 [13:27:50] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert foundation.w.o to use vhost [puppet] - 10https://gerrit.wikimedia.org/r/461394 (https://phabricator.wikimedia.org/T196968) [13:27:52] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert commons.w.o [puppet] - 10https://gerrit.wikimedia.org/r/461395 (https://phabricator.wikimedia.org/T196968) [13:27:54] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert meta.w.o [puppet] - 10https://gerrit.wikimedia.org/r/461396 (https://phabricator.wikimedia.org/T196968) [13:27:56] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikisource.org [puppet] - 10https://gerrit.wikimedia.org/r/461397 (https://phabricator.wikimedia.org/T196968) [13:28:54] (03PS3) 10Mark Bergsma: Don't recalculate server.up in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447766 [13:36:39] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is CRITICAL: 58.55 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:37:49] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is OK: (C)60 le (W)70 le 72.75 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:39:46] !log Deploy schema change on dbstore2002:x1 [13:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:24] (03CR) 10Mathew.onipe: Add elasticsearch_cluster module. (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [13:47:04] (03PS30) 10Mathew.onipe: Add elasticsearch_cluster module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [13:48:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:48:11] (03CR) 10jerkins-bot: [V: 04-1] Add elasticsearch_cluster module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [13:49:58] (03CR) 10Muehlenhoff: [C: 031] mediawiki::web::prod_sites: reorganize code a bit [puppet] - 10https://gerrit.wikimedia.org/r/461393 (owner: 10Giuseppe Lavagetto) [13:50:09] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:51:18] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:55:41] (03PS9) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) [14:03:19] (03PS1) 10Jcrespo: section: Order masters at the end [software] - 10https://gerrit.wikimedia.org/r/461406 [14:04:15] (03PS2) 10Jcrespo: section: Order masters at the end [software] - 10https://gerrit.wikimedia.org/r/461406 [14:04:25] marostegui: ^ [14:07:42] (03PS3) 10Jcrespo: section: Order masters at the end [software] - 10https://gerrit.wikimedia.org/r/461406 [14:08:40] (03PS2) 10Giuseppe Lavagetto: mediawiki: Install php-dba in PHP 7 [puppet] - 10https://gerrit.wikimedia.org/r/459882 (owner: 10Legoktm) [14:08:42] (03PS9) 10Giuseppe Lavagetto: mediawiki: move php to a profile, use the php class [puppet] - 10https://gerrit.wikimedia.org/r/453093 (https://phabricator.wikimedia.org/T201140) [14:08:46] <_joe_> legoktm: [14:08:49] <_joe_> ^^ [14:09:04] (03CR) 10Legoktm: [C: 031] mediawiki: Install php-dba in PHP 7 [puppet] - 10https://gerrit.wikimedia.org/r/459882 (owner: 10Legoktm) [14:09:10] ty :)) [14:09:17] <_joe_> gonna merge this now [14:09:25] <_joe_> then I have to go away for an interview [14:10:08] PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [14:10:40] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: Install php-dba in PHP 7 [puppet] - 10https://gerrit.wikimedia.org/r/459882 (owner: 10Legoktm) [14:10:45] legoktm: that -dba is not covered in php-defaults seems like a bug, though, we should report it to Debian I guess [14:10:53] <_joe_> indeed [14:11:02] <_joe_> legoktm: also thanks for noticing this [14:11:19] moritzm: yep, I said I was going to file a bug in the other channel :) [14:11:38] legoktm: ok, missed that :-) [14:12:49] <_joe_> also I would report php-sybase being included in php-defaults as a bug [14:12:52] <_joe_> :P [14:13:10] * _joe_ has a long troubled history with Sybase in his past [14:21:09] (03CR) 10Marostegui: [C: 031] section: Order masters at the end [software] - 10https://gerrit.wikimedia.org/r/461406 (owner: 10Jcrespo) [14:25:25] (03PS1) 10Marostegui: db-codfw.php: Depool db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461408 [14:28:39] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[intel-microcode],Package[php7.0-dba] [14:28:47] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461408 (owner: 10Marostegui) [14:29:07] <_joe_> moritzm: you're upgrading packages right now? [14:29:41] intel-microcode on some eqiad hosts [14:29:58] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461408 (owner: 10Marostegui) [14:30:21] should be done soonish [14:31:02] done [14:32:39] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:09] is this expected? ^^^ [14:34:24] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10ovasileva) [14:34:49] (03CR) 10jenkins-bot: db-codfw.php: Depool db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461408 (owner: 10Marostegui) [14:35:36] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2033 for alter table (duration: 00m 59s) [14:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:06] (03PS1) 10Andrew Bogott: keystone hooks: remove port 5666 from default security groups [puppet] - 10https://gerrit.wikimedia.org/r/461410 [14:36:43] !log Deploy alter table on db2033 (x1) enwiki [14:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:49] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [14:37:49] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.03 ms [14:37:53] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461411 [14:40:38] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:42:05] (03CR) 10Andrew Bogott: [C: 032] keystone hooks: remove port 5666 from default security groups [puppet] - 10https://gerrit.wikimedia.org/r/461410 (owner: 10Andrew Bogott) [14:42:14] (03PS2) 10Herron: mx: remove local ip dns lookup and wiki-mail.wikimedia.org default [puppet] - 10https://gerrit.wikimedia.org/r/461193 [14:42:25] (03CR) 10Herron: [C: 031] Allow mgmt hosts to send syslog to central syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/461170 (owner: 10Ayounsi) [14:43:29] (03CR) 10Herron: [C: 032] mx: remove local ip dns lookup and wiki-mail.wikimedia.org default [puppet] - 10https://gerrit.wikimedia.org/r/461193 (owner: 10Herron) [14:46:24] !log updating MX bulk_smtp helo_data (gerrit 461193) [14:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:00] (03CR) 10Jcrespo: [C: 032] section: Order masters at the end [software] - 10https://gerrit.wikimedia.org/r/461406 (owner: 10Jcrespo) [14:51:38] volans: re mr1-eqiad.oob v6, no expected, but low risk, thanks for pointing it out, my guess is that Equinix is either doing maintenance or had an outage [14:51:59] XioNoX: ack, thanks for having a look [14:52:26] also the Zayo link between codfw and eqiad is down (unplaned outage) still have plenty of redundancy [14:53:02] (03CR) 10Alexandros Kosiaris: [C: 032] Scap: Update config to use PHP=php7.0 [puppet] - 10https://gerrit.wikimedia.org/r/460021 (https://phabricator.wikimedia.org/T191921) (owner: 10Thcipriani) [14:53:09] (03PS2) 10Alexandros Kosiaris: Scap: Update config to use PHP=php7.0 [puppet] - 10https://gerrit.wikimedia.org/r/460021 (https://phabricator.wikimedia.org/T191921) (owner: 10Thcipriani) [14:53:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Scap: Update config to use PHP=php7.0 [puppet] - 10https://gerrit.wikimedia.org/r/460021 (https://phabricator.wikimedia.org/T191921) (owner: 10Thcipriani) [14:55:17] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461411 (owner: 10Marostegui) [14:56:46] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461411 (owner: 10Marostegui) [14:57:58] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2033 after alter table (duration: 00m 57s) [14:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:09] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:02:00] (03CR) 10Mobrovac: [C: 031] RPC/RunSingleJob.php - send X-Readonly header. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461229 (https://phabricator.wikimedia.org/T204154) (owner: 10Ppchelko) [15:02:09] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:03:12] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461411 (owner: 10Marostegui) [15:09:54] (03PS2) 10Volans: cumin: re-disable the urllib3 warning [puppet] - 10https://gerrit.wikimedia.org/r/461364 (https://phabricator.wikimedia.org/T177385) [15:10:43] (03CR) 10Volans: [C: 032] cumin: re-disable the urllib3 warning [puppet] - 10https://gerrit.wikimedia.org/r/461364 (https://phabricator.wikimedia.org/T177385) (owner: 10Volans) [15:14:44] (03PS1) 10Jcrespo: mariadb: Reduce db2057 load, add db2036 as rc host, too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461418 [15:15:26] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Krinkle) a:03Smalyshev Considering this done from our team, but keeping open for now because the WDQS issues appear unre... [15:16:03] (03PS1) 10Bstorm: standalone-puppetmaster: stretch compatibility improvement [puppet] - 10https://gerrit.wikimedia.org/r/461419 [15:16:19] (03PS1) 10Pmiazga: Enable Page issus A/B test for Latvian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461420 (https://phabricator.wikimedia.org/T204609) [15:17:22] (03PS2) 10Pmiazga: Enable Page issuses A/B test for Latvian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461420 (https://phabricator.wikimedia.org/T204609) [15:17:24] (03CR) 10Marostegui: [C: 031] "let's monitor how db2036 copes with the extra traffic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461418 (owner: 10Jcrespo) [15:17:26] (03PS2) 10Volans: CLI: improve help message [software/cumin] - 10https://gerrit.wikimedia.org/r/461168 (https://phabricator.wikimedia.org/T204680) [15:17:55] (03CR) 10Volans: "> Patch Set 1:" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/461168 (https://phabricator.wikimedia.org/T204680) (owner: 10Volans) [15:18:11] (03CR) 10Jcrespo: [C: 032] mariadb: Reduce db2057 load, add db2036 as rc host, too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461418 (owner: 10Jcrespo) [15:19:27] (03PS3) 10Pmiazga: Enable Page issuses A/B test for Latvian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461420 (https://phabricator.wikimedia.org/T204609) [15:19:38] (03Merged) 10jenkins-bot: mariadb: Reduce db2057 load, add db2036 as rc host, too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461418 (owner: 10Jcrespo) [15:23:33] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Reduce db2057 load (duration: 00m 57s) [15:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:05] !log rebooting cr3-ulsfo for upgrade (not in prod) [15:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:30] (03CR) 10Effie Mouzeli: [C: 032] CLI: improve help message [software/cumin] - 10https://gerrit.wikimedia.org/r/461168 (https://phabricator.wikimedia.org/T204680) (owner: 10Volans) [15:27:18] (03Merged) 10jenkins-bot: CLI: improve help message [software/cumin] - 10https://gerrit.wikimedia.org/r/461168 (https://phabricator.wikimedia.org/T204680) (owner: 10Volans) [15:28:57] (03CR) 10jenkins-bot: CLI: improve help message [software/cumin] - 10https://gerrit.wikimedia.org/r/461168 (https://phabricator.wikimedia.org/T204680) (owner: 10Volans) [15:31:42] (03CR) 10Muehlenhoff: mediawiki::web::prod_sites: convert simple wikis in remnant.conf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/452323 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [15:32:38] Looking for docs on how to log in as an admin in the consolidated grafana console. [15:32:53] It's not my LDAP password and I can't do a PW reset... [15:33:16] (03CR) 10jenkins-bot: mariadb: Reduce db2057 load, add db2036 as rc host, too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461418 (owner: 10Jcrespo) [15:33:42] * awight facepalms. [15:33:44] (03CR) 10Jdlrobson: [C: 031] Enable Page issuses A/B test for Latvian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461420 (https://phabricator.wikimedia.org/T204609) (owner: 10Pmiazga) [15:33:55] OK it was the LDAP password. [15:39:19] !log rebooting cr4-ulsfo for upgrade (not in prod) [15:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:03] (03CR) 10Muehlenhoff: [C: 04-1] "tungsten still uses (but seems kind of superfluous, as it also already uses the xhgui::app role), with that resolved, rest seems fine." [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [15:51:39] PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:52:23] ^ probably me, checking (logstash unit) [15:54:18] ACKNOWLEDGEMENT - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel switching to new systemd unit in progress, cleanup will happen after all nodes are restarted [15:54:18] ACKNOWLEDGEMENT - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel switching to new systemd unit in progress, cleanup will happen after all nodes are restarted [16:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180919T1600). [16:00:04] stephanebisson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:37] hello [16:02:52] The only patches registered for SWAT are mine so I can go ahead and deploy them if there's no objections. [16:04:54] stephanebisson: +2 :) [16:05:22] stephanebisson: do you have some for the EchoForeignWikiRequest undefined indices? [16:05:52] hashar: yes, that's the first one I'll do [16:06:29] oh and the patch might strikes two tasks in one go!! [16:06:51] I am not sure what is the impacts of those though [16:09:19] hashar: FWIW it worked well for me locally [16:10:14] hashar: I'm on the deployment host at `/srv/mediawiki` and git reports that I'm not in a git repo. What am I missing? [16:10:31] stephanebisson: did you mean mw-staging? [16:10:46] mediawiki is the actual deployed code [16:11:01] it's /srv/mediawiki-staging [16:11:06] I'm on deployment.eqiad.wmnet [16:11:20] yeah and there's a /srv/mediawiki on there which will not be a git repo [16:11:28] yep [16:11:28] ah, ok [16:11:32] it's a deployed copy of the code like you'll see on the target machines [16:11:34] you shouldn't taouch that [16:11:44] and shouldn't be able to touch it anyway [16:12:13] yeah no mucking with /srv/mediawiki without knowing what you're doing :) [16:12:52] yeah, wikidevs shouldn't be able to write there, only read [16:12:57] oh deployers will be able to touch it [16:13:05] mmm [16:13:10] I partially know what I'm doing, that's why i'm asking [16:15:31] deployers can most likely go to the host and then SSH to the same host as mwdeploy and then touch the files, even if they can't touch the files under their own user or via sudo (I think they can do both anyway) [16:19:51] stephanebisson: sorry I am in meeting. But yeah /srv/mediawiki-staging is where we fetch stuff [16:20:11] that is then deployed to the target hosts under /srv/mediawiki , the deployment server being a targetted host [16:23:34] (03PS2) 10Arlolra: Set $wgSiteMatrixNonGlobalSites global for SiteMatrix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460034 [16:23:44] hashar: thanks, I was confused for a moment but no worries: I'm proceeding with caution, following my notes and will ask if something doesn't seem right [16:24:26] stephanebisson: when I doubt I refer to https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [16:24:56] the tldr being https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#mediawiki/extensions_and_mediawiki/skins [16:25:30] /srv/mediawiki-staging/php-1.32.0-wmf.22/ is mediawiki/core with extensions registered as submodules [16:25:55] (03CR) 10Subramanya Sastry: [C: 031] "Note that this is +2ed (by deployers with rights) just before a SWAT deploy. So, if this is ready to go, you should schedule it in a swat " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460034 (owner: 10Arlolra) [16:27:46] It is possible that this is stuck or is it just very slow https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Echo/+/461363/ ? [16:28:30] hm [16:28:35] jenkins-bot set V+2 [16:28:41] normally it will then submit [16:29:08] there is an active test running for it [16:29:15] oh there we go [16:29:19] stephanebisson, ^ [16:29:39] alright [16:29:42] stephanebisson, basically I went to https://integration.wikimedia.org/zuul/ and ctrl+f'd for 461363, it showed that it's actively working on it [16:30:37] it has a progress bar and you can click to expand a block, it should show you which part is having problems so you can go and get more details (if you click the link from there you can then go to Console Output) [16:31:45] (03CR) 10Subramanya Sastry: [C: 031] "https://wikitech.wikimedia.org/wiki/Deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460034 (owner: 10Arlolra) [16:31:49] Krenair: thanks [16:31:58] yeah tests take a while, so we usually CR+2 patches ahead of the window [16:32:08] (speeding up the test will happen eventually) [16:32:52] I git pull'd on the deployment host, verified the `git log` and `git submodule update extensions/Echo` (just thinking out loud, ignore me) [16:33:16] pretty sure I used to be able to merge and deploy quite a few patches in-window, but it may be slower these days [16:33:37] I'm going to `scap pull` on mwdebug2001 [16:34:32] It worked quickly without errors but i can't really see what was pulled [16:34:47] I'm going to try to test that change using mw2017 [16:35:08] well theoretically any modifications you made in /srv/mediawiki-staging [16:35:17] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b35727e] (dev-cluster): Bug fix: return the stored headers in the key rev value bucket [16:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:43] if you pulled onto mwdebug2001 it'll be on that host, if you want to test on mw2017 instead I think you'll have to go there and scap pull [16:38:18] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b35727e] (dev-cluster): Bug fix: return the stored headers in the key rev value bucket (duration: 03m 01s) [16:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:54] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b35727e]: Bug fix: return the stored headers in the key rev value bucket [16:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:17] Krenair: I thought mwdebug2001 and mw2017 were the same host. Is it not the case? [16:42:10] uh [16:42:10] stephanebisson: definitely not, they have different hostnames ;) [16:42:17] actually I can't find anything called mw2017 [16:42:32] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b35727e]: Bug fix: return the stored headers in the key rev value bucket (duration: 03m 38s) [16:42:33] alex@alex-laptop:~$ dig mwdebug2001.codfw.wmnet @ns0.wikimedia.org +short [16:42:33] 10.192.0.98 [16:42:33] alex@alex-laptop:~$ dig mw2017.codfw.wmnet @ns0.wikimedia.org +short [16:42:33] alex@alex-laptop:~$ [16:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:58] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b35727e]: Bug fix: return the stored headers in the key rev value bucket [16:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:10] Apparently mw2017 got decommissioned [16:43:22] https://phabricator.wikimedia.org/T187467 [16:43:39] RoanKattouw: when we deployed on Monday we `scap pull` on mwdebug2001 and asked people to test against mw2017 using the browser extension... was it even working? [16:43:41] like over a month ago [16:44:30] I believe so [16:44:51] I believe mw2017 in the extension is really just an alias for mwdebug2001 [16:47:04] The extension sets `x-wikimedia-debug: backend=mw2017.codfw.wmnet` [16:47:49] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [16:48:41] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b35727e]: Bug fix: return the stored headers in the key rev value bucket (duration: 05m 43s) [16:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:49] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [16:48:49] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:49:03] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b35727e]: Bug fix: return the stored headers in the key rev value bucket, take #3 [16:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:58] !log sbisson@deploy1001 Synchronized php-1.32.0-wmf.22/extensions/Echo/includes/api/ApiCrossWiki.php: T204758 (duration: 00m 57s) [16:49:58] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [16:49:59] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [16:49:59] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [16:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:05] T204758: EchoForeignWikiRequest emits PHP error "Undefined index: csrftoken" - https://phabricator.wikimedia.org/T204758 [16:50:08] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [16:50:08] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [16:51:14] (03PS7) 10Jdlrobson: Remove obsolete $wgPopupsBetaFeature, Part I: CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (https://phabricator.wikimedia.org/T203589) (owner: 10Prtksxna) [16:52:19] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [16:52:49] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [16:52:51] !log rolling restart of elasticsearch / logstash for new systemd unit completed - new systemd unit is elasticsearch_5@production-logstash-eqiad [16:52:54] godog: ^ [16:52:55] I'm going to stop here for now. The most important patch was deployed (T204758) [16:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:19] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [16:53:27] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) [16:54:29] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [16:54:31] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b35727e]: Bug fix: return the stored headers in the key rev value bucket, take #3 (duration: 05m 28s) [16:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:36] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b35727e]: Bug fix: return the stored headers in the key rev value bucket, take #4 [16:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:19] stephanebisson, so basically the extension needs to be updated with the new list of hosts? [16:55:44] Krenair: looks like it. Who does that? [16:55:57] (03CR) 10Arlolra: "This doesn't need to be SWAT'ed, just the normal deploy cycle is fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460034 (owner: 10Arlolra) [16:55:57] that's a good question. [16:56:52] (03PS1) 10Dzahn: add IPv6 records for mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/461427 (https://phabricator.wikimedia.org/T201343) [16:56:54] WikimediaDebug [16:56:55] Offered by: chrome-wikimedia-debug [16:57:21] I wonder if Krinkle knows [16:57:37] or bd808 [16:57:49] (as it looks like Bryan is involved in the FF one) [16:58:10] Ori, Bryan and myself have access. [16:58:25] There is an issue on the github repo to make it read debug.json from noc.wikimedia.org/wmf-config [16:58:51] Right now, the Varnish/nginx config for X-Wikimedia-Debug director in operations/puppet maps the old mw* names to mwdebug*. [16:58:57] ah [16:58:58] I *think* the codfw hostnames work, they just aren't the same names that are used to ssh to the instances [16:59:13] Yeah, mw1017>mwdebug1001 (already renamed in the extension) [16:59:14] can't we make it send HTTP 400 in response to a decommissioned one? [16:59:25] mw1099>1002, and 2017/2019>2001/2002. [16:59:33] I've updated the docs on wikitech to explain the current mapping [16:59:41] well, it's not decom'ed [16:59:42] it works fine [17:00:09] PROBLEM - Check systemd state on elastic2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:00:10] it doe respond 400 to invalid/unknown values [17:00:46] if we're going to have clients specify internal wmnet host names, surely as soon as that host ceases to be, clients should start seeing errors if they try to use it? [17:00:59] PROBLEM - Check systemd state on elastic1043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:01:08] PROBLEM - Check systemd state on elastic2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:01:19] PROBLEM - Check systemd state on elastic2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:01:48] PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:01:49] PROBLEM - Check systemd state on elastic1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:01:59] PROBLEM - Check systemd state on elastic1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:02:09] rather than getting mapped to something else silently behind the scenes [17:02:28] PROBLEM - Check systemd state on elastic2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:02:29] PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:02:29] PROBLEM - Check systemd state on elastic2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:02:45] um is systemd on elastic* supposed to be breaking, gehel? [17:02:59] PROBLEM - Check systemd state on elastic1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:02:59] PROBLEM - Check systemd state on elastic1025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:02:59] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 58.54 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:03:08] PROBLEM - Check systemd state on elastic2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:03:10] actually expected, silencing now (sorry for the noise) [17:03:18] np just worried :) [17:03:28] Krenair: you should be! My bad! [17:03:38] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b35727e]: Bug fix: return the stored headers in the key rev value bucket, take #4 (duration: 09m 03s) [17:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:30] _joe_, bearND, mdholloway|afk: the imminent problem with mcs not getting the language should be solved now, icinga says all green [17:04:39] RECOVERY - IPsec on mc2021 is OK: Strongswan OK - 1 ESP OK [17:04:49] PROBLEM - Check systemd state on elastic2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:04:58] RECOVERY - Host mc1021 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:05:06] <_joe_> mobrovac: did you understand what happened there? [17:05:56] _joe_: so far, just the part that caused the lang not to be returned, but there are other bugs we have unearthed there [17:06:08] PROBLEM - Check systemd state on elastic2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:07:01] <_joe_> oh the rabbithole debugging pattern [17:07:18] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:07:38] PROBLEM - Host mc1021 is DOWN: PING CRITICAL - Packet loss = 100% [17:07:39] PROBLEM - Check systemd state on elastic2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:08:29] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 76.36 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:10:18] RECOVERY - Host mc1021 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [17:12:19] PROBLEM - Check systemd state on elastic2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:12:29] PROBLEM - IPsec on mc2021 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc1021_v4 [17:12:44] 10Operations, 10ops-eqiad: mc1021 boot failure - https://phabricator.wikimedia.org/T204812 (10Cmjohnson) @MoritzMuehlenhoff This server was not set to legacy bios. I changed the setting to legacy bios and verified the SATA controller was enabled. I was able to boot to the HDD and the server had no issues goi... [17:12:48] PROBLEM - configured eth on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:12:48] PROBLEM - MD RAID on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:12:58] PROBLEM - Check size of conntrack table on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:12:58] PROBLEM - confd service on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:12:58] PROBLEM - Check systemd state on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:12:58] PROBLEM - DPKG on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:12:59] PROBLEM - Memcached on mc1021 is CRITICAL: connect to address 10.64.0.82 and port 11211: Connection refused [17:13:19] PROBLEM - Check whether ferm is active by checking the default input chain on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:13:19] PROBLEM - Disk space on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:13:19] PROBLEM - dhclient process on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:13:19] PROBLEM - Confd template for /etc/redis/replica/6379-state.conf on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:13:38] PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:13:49] Krinkle: https://github.com/wikimedia/WikimediaDebug/pull/18 [17:13:52] mobrovac: thanks! [17:14:11] (03PS2) 10Dzahn: add IPv6 records for mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/461427 (https://phabricator.wikimedia.org/T201343) [17:14:18] PROBLEM - puppet last run on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:15:09] PROBLEM - Check health of redis instance on 6379 on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:15:24] (03CR) 10Dzahn: [C: 032] add IPv6 records for mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/461427 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [17:16:07] ah crap,, typo.. i'll merge my fix at the same time [17:16:14] didn't run authdns-update yet [17:16:56] 10Operations, 10ops-eqiad, 10netops: Ensure scs-c1-eqiad:eth1 is not connected - https://phabricator.wikimedia.org/T204743 (10Cmjohnson) 05Open>03Resolved IDK how or recall but I managed to plug in a second ethernet port on the mgmt ports of the scs....removed the eth1 connection and all is good now. Res... [17:17:12] the zone-checker would have caught that, if only had been merged :( [17:17:28] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10Cmjohnson) Dell kicked it back again saying it's not our system. I will try calling them now [17:17:29] PROBLEM - IPsec on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:17:50] volans: well, i caught it in the diff so i said "no" [17:18:32] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) latest update from HP...they are sending a new cable Hello Chris, Thankyou for uplaoding the AHS logs, below are the findings -------------------------------------------------------------------... [17:18:48] (03PS1) 10Dzahn: fix typo for v6 reverse record of mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/461434 [17:18:57] (03CR) 10jerkins-bot: [V: 04-1] fix typo for v6 reverse record of mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/461434 (owner: 10Dzahn) [17:19:21] (03PS2) 10Dzahn: fix typo for v6 reverse record of mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/461434 [17:19:44] volans: zone-checker, you mean in Gerrit or in authdns-update [17:20:15] mutante: it's a tool that is pending review since long time that, once we fix all the errors it found in the DNS repo, could be added to CI [17:20:25] !log powering off mw1254 to reseat DIMM T204491 [17:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:33] T204491: Heating alerts / memory errors on mw1254 - https://phabricator.wikimedia.org/T204491 [17:20:50] volans: ah! i am interested in fixing the errors it finds now [17:20:58] PROBLEM - Host mc1021 is DOWN: PING CRITICAL - Packet loss = 100% [17:21:19] RECOVERY - Host mc1021 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:21:42] (03CR) 10Dzahn: [C: 032] fix typo for v6 reverse record of mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/461434 (owner: 10Dzahn) [17:22:22] mutante: I can add you to the task and update the current output, sure [17:22:32] volans: now i understand "if only had been merged" as well.. i was first thinking you mean my change with the typo "if only you had merged it, it would have told you" and was confused :) [17:22:41] volans: ok :) [17:22:48] ahaaha, sorry for the misunderstanding [17:22:48] PROBLEM - Host mw1254 is DOWN: PING CRITICAL - Packet loss = 100% [17:22:59] stephanebisson: thank you :] [17:23:09] I heading out for dinner [17:23:22] will do the train at 19:00 UTC (1h36 minutes from now) [17:23:28] PROBLEM - Host mw1254.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:23:49] PROBLEM - configured eth on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:23:49] PROBLEM - MD RAID on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:23:59] PROBLEM - Check size of conntrack table on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:23:59] PROBLEM - confd service on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:23:59] PROBLEM - DPKG on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:23:59] PROBLEM - Check systemd state on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:24:17] what's up with mc1021? [17:24:18] eh, ok. so mc1021 is being worked on ..i saw above [17:24:25] where? [17:24:26] but that 1254.mgmt one ... [17:24:28] PROBLEM - Disk space on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:24:28] PROBLEM - Check whether ferm is active by checking the default input chain on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:24:28] PROBLEM - Confd template for /etc/redis/replica/6379-state.conf on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:24:28] PROBLEM - dhclient process on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:24:41] it was rebooted this morning according to SAL, that's it [17:25:00] 13:12 <+wikibugs> Operations, ops-eqiad: mc1021 boot failure - https://phabricator.wikimedia.org/T204812 (Cmjohnson) @MoritzMuehlenhoff This server was not set to legacy bios. I changed the setting to legacy [17:25:04] this one volans ^ [17:25:27] 10Operations, 10ops-eqiad: Heating alerts / memory errors on mw1254 - https://phabricator.wikimedia.org/T204491 (10Cmjohnson) @MoritzMuehlenhoff I ended up just swapping the DIMM between side A and B....leaving open to see if it helps [17:25:27] ah ok [17:25:28] and i also see an explanation for mw1254 now [17:25:31] so expired downtime I guess [17:25:45] mw1254 yes has been shutdown for maintenance [17:25:58] RECOVERY - Host mw1254 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [17:26:08] PROBLEM - Check health of redis instance on 6379 on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:26:29] RECOVERY - Host mw1254.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [17:26:39] PROBLEM - puppet last run on mc1021 is CRITICAL: Return code of 255 is out of bounds [17:26:39] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) @jcrespo When you have a minute, I'd like to hear your opinion on calculated field joins, e.g. {P7570} I don'... [17:27:41] sorry about that I disabled checks on mc1021 for moritzm to finish tomorrow. [17:27:56] thanks Chris [17:28:25] 10Operations, 10ops-eqiad: mc1021 boot failure - https://phabricator.wikimedia.org/T204812 (10Cmjohnson) @MoritzMuehlenhoff I disabled icinga checks for this host [17:28:59] RECOVERY - Check systemd state on elastic2018 is OK: OK - running: The system is fully operational [17:28:59] RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational [17:29:18] PROBLEM - DPKG on notebook1003 is CRITICAL: Return code of 255 is out of bounds [17:29:20] notebook1003 seems to have an issue [17:29:28] RECOVERY - Check systemd state on elastic1026 is OK: OK - running: The system is fully operational [17:29:29] PROBLEM - Check systemd state on notebook1003 is CRITICAL: Return code of 255 is out of bounds [17:29:39] RECOVERY - Check systemd state on elastic2011 is OK: OK - running: The system is fully operational [17:29:39] PROBLEM - MD RAID on notebook1003 is CRITICAL: Return code of 255 is out of bounds [17:29:49] PROBLEM - Disk space on notebook1003 is CRITICAL: Return code of 255 is out of bounds [17:29:49] PROBLEM - dhclient process on notebook1003 is CRITICAL: Return code of 255 is out of bounds [17:30:20] PROBLEM - configured eth on notebook1003 is CRITICAL: Return code of 255 is out of bounds [17:30:45] yea.. eh... all of these things unrelated but in the same timeframe [17:30:59] PROBLEM - puppet last run on notebook1003 is CRITICAL: Return code of 255 is out of bounds [17:31:10] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10Papaul) @fgiunchedi HW diagnostics didn't report any problem. Upgrade all firmware on the server . we can observe the server all this week and if we do have the same problem again, I will contact HP. [17:31:32] !log notebook1003 - starting failed nagios-nrpe-server [17:31:33] notebook1003 is not me this time [17:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:40] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [17:31:50] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [17:32:00] RECOVERY - Disk space on notebook1003 is OK: DISK OK [17:32:00] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [17:32:00] cmjohnson1: yep, i see it's not. the nagios-nrpe-server failed.. so all the NRPE checks fail but others work [17:32:20] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [17:32:29] RECOVERY - DPKG on notebook1003 is OK: All packages OK [17:36:30] PROBLEM - DPKG on notebook1003 is CRITICAL: Return code of 255 is out of bounds [17:36:50] PROBLEM - Check systemd state on notebook1003 is CRITICAL: Return code of 255 is out of bounds [17:36:50] PROBLEM - SSH on notebook1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:00] PROBLEM - MD RAID on notebook1003 is CRITICAL: Return code of 255 is out of bounds [17:37:10] PROBLEM - Disk space on notebook1003 is CRITICAL: Return code of 255 is out of bounds [17:37:10] PROBLEM - dhclient process on notebook1003 is CRITICAL: Return code of 255 is out of bounds [17:37:30] PROBLEM - configured eth on notebook1003 is CRITICAL: Return code of 255 is out of bounds [17:40:29] RECOVERY - Check systemd state on elastic2026 is OK: OK - running: The system is fully operational [17:42:30] PROBLEM - Filesystem available is greater than filesystem size on ms-be2042 is CRITICAL: cluster=swift device=/dev/sdl1 fstype=xfs instance=ms-be2042:9100 job=node mountpoint=/srv/swift-storage/sdl1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2042&var-datasource=codfw%2520prometheus%252Fops [17:49:07] Krenair: thx [17:49:13] will release later this week [17:49:53] cool [17:50:01] does that solve it for both chrome and FF Krinkle? [17:50:07] Krenair: yes [17:50:11] (I hope) [17:50:13] it should. [17:51:32] !log notebook1003 unresponsive to icinga checks and serial console, rebooting [17:51:36] Krinkle, bd808: do you remember why it was important to allow the client to select a debug backend in the first place? Is it because stickyness is important somehow, or was the idea that people would claim a backend for themselves to avoid clashing with other users? [17:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:02] !log no errors in notebook1003 SEL [17:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:17] ori: The typical workflow is to stage a patch on one of the servers, then you'd want to get responses from that one, not the other 3 that still match plain app servers. [17:52:45] do we know why it expanded beyond one debug server per DC? [17:52:59] PROBLEM - Host notebook1003 is DOWN: PING CRITICAL - Packet loss = 100% [17:53:07] Sometimes for one-off debugging, but the majority use case nowadays has become our daily SWAT deployments, whereby a deployer runs 'scap pull' on an mwdebug server, and then asks the patch author to verify the bug fix via XWD [17:53:24] ori: pretty rigorous deployment process now :) [17:53:31] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) [17:53:48] two is one, and one is none. [17:54:02] the law of redundancy [17:54:22] hrmm [17:54:23] [ *** ] (1 of 2) A start job is running for…007.wikimedia.org (46s / 1min 33s) [17:54:33] notebook1003 reboot is going oddly. [17:54:40] RECOVERY - SSH on notebook1003 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [17:54:45] also applies to other areas of life, e.g. toilet paper, coffee, bread :) [17:54:49] RECOVERY - Host notebook1003 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [17:54:55] so who in SRE team handles the notebook servers? [17:54:59] this is an odd start condition [17:55:18] [FAILED] Failed to mount /mnt/nfs/dumps-labstore1006.wikimedia.org. [17:55:18] See 'systemctl status "mnt-nfs-dumps\\x…1006.wikimedia.org.mount"' for details. [17:55:18] [DEPEND] Dependency failed for Remote File Systems. [17:55:23] uhhhhh [17:55:30] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [17:55:39] RECOVERY - DPKG on notebook1003 is OK: All packages OK [17:55:47] notebook is probably analytics? [17:55:49] robh: are those the SWAP servers in analytics? [17:55:50] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [17:56:03] # SWAP (Jupyter Notebook) Servers with Analytics Cluster Access [17:56:04] node /notebook100[34].eqiad.wmnet/ { [17:56:04] role(swap) [17:56:04] } [17:56:09] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:56:10] Krenair: yeah [17:56:14] afaik [17:56:27] So, its failing to mount a network target during boot [17:56:32] but it's also interacting with labs-nfs [17:56:36] yeah [17:56:36] so bstorm_ might know something about it [17:56:44] im making a task to document it now [17:57:23] huh... [17:57:29] That's no fun [17:57:40] RECOVERY - Check systemd state on elastic2030 is OK: OK - running: The system is fully operational [17:58:06] 10Operations: notebook1003 failed network mount on boot - https://phabricator.wikimedia.org/T204857 (10RobH) p:05Triage>03Normal [17:58:06] https://phabricator.wikimedia.org/T204857 [17:58:12] not sure what to toss into that in terms of projects [17:58:21] Its probably trying to mount the dumps NFS share [17:58:28] it is and i bet hostname changed? [17:58:29] RECOVERY - Check systemd state on elastic2007 is OK: OK - running: The system is fully operational [17:58:48] hopefully something that easy ;D [17:59:17] I'd just chuck on analytics-cluster and cloud-services and see what they make of it [17:59:40] done ;D [17:59:46] 10Operations, 10Analytics, 10Analytics-Cluster, 10Cloud-Services: notebook1003 failed network mount on boot - https://phabricator.wikimedia.org/T204857 (10RobH) [17:59:49] if anyone asks i did it cuz krenair said to! [17:59:51] (just kidding) [17:59:51] lol [18:00:40] robh: the analytics servers generally mount the web server not the other one [18:00:47] If I recall correctly. Let me check that [18:01:04] node /labstore100[67]\.wikimedia\.org/ { [18:01:04] role(dumps::distribution::server) [18:01:04] Might be some quirk [18:01:10] Yes, those [18:01:21] so the hostname seems fine [18:01:21] One goes to one setup and the other server serves the other side [18:01:24] nothing else has that role [18:02:16] one option would be to treat the backend= value as a (consistent) hash key and use nginx's load balancing feature to map the value to a particular backend. The only issue is, how does the deployer know on which server to stage their change? But you could solve that by having a small script you run to see what debug backend a particular key maps to [18:02:27] Krinkle: ^ [18:04:49] debug_backend() { curl -sv -H "x-wikimedia-debug: backend=$1" 'https://en.wikipedia.org/wiki/Main_Page' -o/dev/null 2>&1 | grep -Poi '(?<=server: ).*' ; } [18:05:09] Hm.. I may've missed a bit of context. Would additional steps not add work instead of reduce? I'm thinking of the simple use case of person A staging a change on a server and person B needing to verify that it works properly. Such a virtual meet of servers seems simpler to arrange explicitly (e.g. "I'm using X can you stage there", or "I've staged it on X, please verify" - the latter is what we usually do). [18:05:54] I assume that the deployer can easily know the mapping hence stage the change on server X that maps to debug_eqiad_1 on the extension [18:05:59] and ask the user to test it there [18:06:30] the key could even be by convention the change-id, for example [18:06:34] The names in the extension already match the hostnames, there exists a map, but that's only for back-compat. [18:07:06] Krinkle: yes by the names in the extension creates confusion because they are of old hosts [18:07:22] if they were explicitely placeholders probably it wouldn't create confusion? [18:07:25] volans: yes, that's a bug because it's not using https://noc.wikimedia.org/conf/debug.json [18:07:28] Krenair just fixed that [18:07:47] and the extension does cache that list? [18:07:49] so in the extension, person B would fill in backend=$MY_CHANGE_ID [18:07:54] Swap servers should be able to mount labstore1006, digging around. That should be primary for them. Poking the exports [18:08:06] and the deployer would sync the change to $(debug_backend $MY_CHANGE_ID) [18:08:24] that servers is specifically in exports by name [18:09:16] robh: it's mounted fine. Not sure what the message was about [18:09:24] ori: sorry, but I'm not seeing how it would make the user experience simpler for both verifier and deployer. seems like a lot of extra steps compared to currently where person A says "Swat window begins, 5 patches from 3 people, I'll be staging on X", then proceeds to stage the first patch on X, and asks the person to verify and so on, re-using the same each time, until done. [18:09:45] bstorm_: odd, i wonder if its just a boot race condition that resolves post os load [18:09:53] Requiring each participant to enter identifiers and the deployer to change hosts seems like extra work we can avoid. [18:10:00] but it held up the boot process for about 90 seconds [18:10:03] so figured worth mentioning! [18:10:15] you're probably right [18:10:17] 10Operations, 10Analytics, 10Analytics-Cluster, 10Cloud-Services: notebook1003 failed network mount on boot - https://phabricator.wikimedia.org/T204857 (10Bstorm) [bstorm@notebook1003]:dumps-labstore1006.wikimedia.org $ df . Filesystem 1K-blocks Used Available Use% Mounte... [18:10:34] ori: I want to understand though. I can see how the mapping would scale better, for example. [18:11:14] Definitely, looks like there is a messed up dependency in systemd. [18:11:21] Like the mount should wait for other things [18:11:31] But things stopped to wait for the mount. [18:11:50] well, the nginx load balancer automatically evicts nonhealthy hosts. so if a debug machine died, you wouldn't have to update debug.json. But arguably you don't need to update debug.json anyway, since you can just ask person B to use foo host instead of bar host. [18:12:10] bstorm_, in that case I suggest it's just an #analytics-cluster task? [18:12:45] Krinkle: anyways, like I said, I think you're right [18:12:47] Sounds right. [18:13:29] I'll leave a note in the ticket. Krenair, do you know who I should subscribe to it? [18:13:40] Off the top of your head [18:14:15] ori: In a cluster-orchestrated future, I suppose we won't have the pool of mwdebug, but instead you'd do something like `xctl add pod --label"debug-gerrit-1234"` then `xctl shell debug-gerrit-1234*` and stage, and the person will enter their gerrit change ID in the extension and stuff figures it out. [18:14:18] temporary pods etc. [18:14:23] bstorm_, otto maybe? [18:14:25] I like the idea of having a free-form input field. [18:14:34] 10Operations, 10Analytics, 10Analytics-Cluster, 10Cloud-Services: notebook1003 failed network mount on boot - https://phabricator.wikimedia.org/T204857 (10Bstorm) Seems like there was a dependency during boot where it gets held up waiting for the mount of NFS in systemd. Perhaps the mount task needs to wa... [18:15:34] bstorm_, ottomata* [18:15:55] Thanks [18:15:58] 10Operations, 10Analytics, 10Analytics-Cluster, 10Cloud-Services: notebook1003 failed network mount on boot - https://phabricator.wikimedia.org/T204857 (10Bstorm) Guessing some folks who might be interested in this task. [18:16:20] Added with one other randomly picked off the work board. People can always remove themselves :) [18:16:33] I would hope he'd find it by it simply being analytics-cluster [18:17:38] ori: I think thats what GitHub do, and possibly what Google may or may not be doing :) [18:17:38] oh um, might also be elu.key these days [18:18:15] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1): exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10Jdlrobson) I'm guessing thi... [18:20:12] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1): exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10Krenair) "which aren't avai... [18:25:34] 10Operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Dzahn) [18:27:29] 10Operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Krenair) >>! In T204830#4599010, @Dzahn wrote: > I think a "site request" doesn't exist... [18:28:17] 10Operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Dzahn) What you describe is a mediawiki configuration change to me. [18:29:20] RECOVERY - Check systemd state on elastic1018 is OK: OK - running: The system is fully operational [18:30:20] RECOVERY - Check systemd state on elastic1017 is OK: OK - running: The system is fully operational [18:31:30] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:31:44] 10Operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Krenair) The problem is they're not all configuration changes. [18:31:59] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:32:23] 10Operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Dzahn) In the past "site request" meant "we need somebody with root to SSH somehere and... [18:33:03] !log add security_team_bot to acl*security_team in phab [18:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:32] 10Operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Krenair) I think that was what the 'ops' and 'shell' keywords were for. [18:34:00] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:34:04] chasemp, what does that bot do? [18:35:10] RECOVERY - Check systemd state on elastic2010 is OK: OK - running: The system is fully operational [18:35:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:36:08] Krenair: atm nothing specific, intended use is to sync certain tasks to an external system for governance and compliance reporting (basically) to keep phab as the source of truth [18:37:11] ok [18:38:22] 10Operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Dzahn) I think it is literally called "site request" because it was requesting somebody... [18:41:00] RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational [18:43:21] !log starting elasticsearch eqiad cluster restart for new systemd unit [18:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:30] RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180919T1900) [19:04:09] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [19:04:44] o/ [19:16:37] !log hashar@deploy1001 Synchronized php-1.32.0-wmf.22/extensions/Translate/tag/PageTranslationHooks.php: Avoid warnings and errors caused by x-pagetranslation-tag - T204797 (duration: 00m 58s) [19:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:45] T204797: [Regression 1.32.0-wmf.22] ParserOutput::getLanguageLinks returns invalid values (Undefined index from ApiParse and LinksUpdate) - https://phabricator.wikimedia.org/T204797 [19:17:29] (03PS1) 10Hashar: group1 wikis to 1.32.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461456 [19:17:31] (03CR) 10Hashar: [C: 032] group1 wikis to 1.32.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461456 (owner: 10Hashar) [19:17:40] rolling the train [19:19:42] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461456 (owner: 10Hashar) [19:20:30] PROBLEM - Check systemd state on elastic2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:20:53] (03PS1) 10Dduvall: Support a literal body for POST requests in `fetch_url` [software/service-checker] - 10https://gerrit.wikimedia.org/r/461457 [19:24:50] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.22 [19:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:41] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461456 (owner: 10Hashar) [19:25:46] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.22 (duration: 00m 55s) [19:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:39] RECOVERY - Check systemd state on elastic2017 is OK: OK - running: The system is fully operational [19:27:47] oh my god [19:28:50] RECOVERY - Check systemd state on elastic2001 is OK: OK - running: The system is fully operational [19:30:06] !log while promoting group1 to 1.32.0-wmf.22 lot of web requests timed out at 60 seconds. Roughly from 19:24 to 19:28. But that is no more occurring | T191068 [19:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:14] T191068: 1.32.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T191068 [19:30:29] 10Operations, 10SRE-Access-Requests: nathante shell request for access to researchers, statistics-privatedata,users, analytics-privatedata-users for groceryheist - https://phabricator.wikimedia.org/T204790 (10RobH) p:05Triage>03Normal [19:31:05] 10Operations, 10SRE-Access-Requests: nathante/groceryheist shell request for researchers, statistics-privatedata,users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RobH) [19:34:23] 10Operations, 10SRE-Access-Requests: nathante/groceryheist shell request for researchers, statistics-privatedata,users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RobH) [19:34:49] 10Operations, 10SRE-Access-Requests: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RobH) [19:37:27] (03PS2) 10Ayounsi: Allow mgmt hosts to send syslog to central syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/461170 [19:39:16] (03CR) 10Ayounsi: [C: 032] Allow mgmt hosts to send syslog to central syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/461170 (owner: 10Ayounsi) [19:40:59] !log web request 60 second timeout when deploying is filled as https://phabricator.wikimedia.org/T204871 [19:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:10] RECOVERY - Check systemd state on elastic1043 is OK: OK - running: The system is fully operational [19:41:35] (03PS1) 10Ayounsi: Revert "Allow mgmt hosts to send syslog to central syslog servers" [puppet] - 10https://gerrit.wikimedia.org/r/461460 [19:41:48] 10Operations, 10SRE-Access-Requests: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RobH) a:03Groceryheist Ok, trying to figure out who did what via task history is a bit confusing, so I'll just be... [19:42:22] (03CR) 10Ayounsi: [C: 032] Revert "Allow mgmt hosts to send syslog to central syslog servers" [puppet] - 10https://gerrit.wikimedia.org/r/461460 (owner: 10Ayounsi) [19:48:54] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10herron) >>! In T203169#4597471, @fgiunchedi wrote: >>>! In T203169#4595824, @herron wrote: >> Assuming 64GB as the currently daily index size (actual daily indice... [19:50:24] (03PS1) 10Ayounsi: Allow mgmt hosts to send syslog to central syslog servers, try 2 [puppet] - 10https://gerrit.wikimedia.org/r/461461 [19:54:24] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/12505/lithium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/461461 (owner: 10Ayounsi) [19:58:46] 10Operations, 10SRE-Access-Requests: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RobH) @Groceryheist: That seems to be what is needed, but the main thing I need is confirmation from WMF legal that... [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180919T2000). [20:00:21] 10Operations, 10SRE-Access-Requests: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RobH) I forgot to note, thank you for the expiry date info, so the only thing we need is WMF legal confirmation and... [20:00:22] I have a deploy for ORES [20:01:02] 10Operations, 10SRE-Access-Requests: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RobH) a:05Groceryheist>03RStallman-legalteam [20:01:39] no parsoid deploy [20:01:40] 10Operations, 10SRE-Access-Requests: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10Groceryheist) @RobH: Great. Thanks! [20:03:39] Going to deploy MCS in a bit (~5 minutes or so) [20:03:39] RECOVERY - Check systemd state on elastic2033 is OK: OK - running: The system is fully operational [20:05:14] !log ladsgroup@deploy1001 Started deploy [ores/deploy@76fe25a]: ORES doesn't block hammering IPs (T204862) [20:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:22] T204862: ORES doesn't block hammering IPs - https://phabricator.wikimedia.org/T204862 [20:05:36] I started deploying ores [20:06:55] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) [20:07:42] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@55d0b34]: Update mobileapps to a224e99 Ia80abe02490 [20:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:01] (03CR) 10Herron: [C: 031] "A quick test of the rules output by the compiler shows ferm restarting happily on a test host. lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/461461 (owner: 10Ayounsi) [20:08:21] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) [20:09:07] works on canary, moving forward [20:09:18] (03CR) 10Herron: [C: 031] "> Created I1790c8cebe079c40d2851874cd24a4b22b4dbc28 to simplify this and" [dns] - 10https://gerrit.wikimedia.org/r/143762 (owner: 10Faidon Liambotis) [20:09:55] (03CR) 10Ayounsi: [C: 032] Allow mgmt hosts to send syslog to central syslog servers, try 2 [puppet] - 10https://gerrit.wikimedia.org/r/461461 (owner: 10Ayounsi) [20:11:12] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@55d0b34]: Update mobileapps to a224e99 Ia80abe02490 (duration: 03m 29s) [20:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:54] (03PS1) 10Ayounsi: Revert "Depool ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/461466 [20:21:18] (03CR) 10Ayounsi: [C: 032] Revert "Depool ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/461466 (owner: 10Ayounsi) [20:22:27] (03PS2) 10Ayounsi: Revert "Depool ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/461466 [20:24:43] !log repool ulsfo [20:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:27] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@76fe25a]: ORES doesn't block hammering IPs (T204862) (duration: 21m 13s) [20:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:35] T204862: ORES doesn't block hammering IPs - https://phabricator.wikimedia.org/T204862 [20:53:23] 10Operations, 10Cleanup, 10Gerrit, 10GitHub-Mirrors, and 7 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10MarcoAurelio) [20:54:39] jouncebot: next [20:54:39] In 2 hour(s) and 5 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180919T2300) [20:54:50] Hmm [20:55:30] greg-g: OK to do an emergency deploy for the UBN fix to T204873 (new train blocker)? [20:55:31] T204873: "Special page" accidentally renamed to "CollabPad" - https://phabricator.wikimedia.org/T204873 [20:55:42] 10Operations, 10Cleanup, 10Gerrit, 10GitHub-Mirrors, and 7 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10MarcoAurelio) [21:00:30] James_F: yes [21:00:52] Thank you. I have the conch. [21:02:27] (03PS2) 10Volans: Documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/458358 [21:10:12] (03CR) 10Volans: [C: 032] Documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/458358 (owner: 10Volans) [21:13:18] (03Merged) 10jenkins-bot: Documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/458358 (owner: 10Volans) [21:14:45] (03CR) 10jenkins-bot: Documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/458358 (owner: 10Volans) [21:24:39] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.22/extensions/VisualEditor/includes/SpecialCollabPad.php: T204873 Fix special page override hot deploy (duration: 00m 57s) [21:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:47] T204873: "Special page" tab accidentally renamed to "CollabPad" - https://phabricator.wikimedia.org/T204873 [21:27:17] OK, all looks good. I'm releasing the conch. [21:32:10] 10Operations, 10Cleanup, 10GitHub-Mirrors, 10OCG-General, and 6 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10MarcoAurelio) [21:33:14] 10Operations, 10Mail, 10Patch-For-Review, 10User-herron: Outdated TLS config for MXes - https://phabricator.wikimedia.org/T203260 (10herron) Reworked and added a few more panels to the bottom of https://grafana.wikimedia.org/dashboard/db/mail to better show the distribution of ciphers and TLS versions. [21:42:04] 10Operations, 10Cleanup, 10GitHub-Mirrors, 10OCG-General, and 6 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10MarcoAurelio) [21:47:32] (03PS8) 10Ayounsi: Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/370103 (https://phabricator.wikimedia.org/T83992) [21:48:20] (03PS3) 10C. Scott Ananian: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 [21:51:32] (03CR) 10Ayounsi: "Previous Puppet compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/370103 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [21:57:57] (03CR) 10Ayounsi: [C: 032] Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/370103 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [22:00:36] !log merging icinga check_bfd [22:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:00] RECOVERY - Check systemd state on elastic2006 is OK: OK - running: The system is fully operational [22:05:14] (03CR) 10Ayounsi: "sudo icinga -v /etc/icinga/icinga.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/370103 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [22:15:50] (03CR) 10Krinkle: [C: 031] "Indeed depends on the core change, because current core still uses an outdated check for HHVM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian) [22:23:51] (03PS1) 10Ayounsi: SNMP: add additional mibdir [puppet] - 10https://gerrit.wikimedia.org/r/461480 [22:29:02] (03CR) 10Ayounsi: "Manually tested and working fine." [puppet] - 10https://gerrit.wikimedia.org/r/461480 (owner: 10Ayounsi) [22:29:23] (03CR) 10Ayounsi: [C: 032] SNMP: add additional mibdir [puppet] - 10https://gerrit.wikimedia.org/r/461480 (owner: 10Ayounsi) [22:30:47] (03CR) 10Dzahn: [V: 031 C: 031] "confirmed:" [puppet] - 10https://gerrit.wikimedia.org/r/461480 (owner: 10Ayounsi) [22:47:29] RECOVERY - Check systemd state on elastic2028 is OK: OK - running: The system is fully operational [22:59:58] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) mwmaint1002.eqiad.wmnet has address 10.64.16.77 mwmaint1002.eqiad.wmnet has IPv6 address 2620:0:861:102:10:64:16:77 [mwmaint1001:~] $ host 2620:0:861:102:10:64:16:7... [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180919T2300). [23:00:05] Jdlrobson and RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:03] I'll SWAT today [23:01:40] jdlrobson: You here for your SWAT? [23:03:22] (03PS1) 10Dzahn: tcpircbot: add mwmaint1002 to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/461486 (https://phabricator.wikimedia.org/T201343) [23:03:24] (03PS2) 10Dzahn: tcpircbot: add mwmaint1002 to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/461486 (https://phabricator.wikimedia.org/T201343) [23:04:01] (03CR) 10Dzahn: [C: 032] tcpircbot: add mwmaint1002 to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/461486 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [23:05:29] \o [23:05:37] RoanKattouw: sorry got held up [23:05:55] I am here now and itching to deploy! [23:06:51] Alright, let's roll [23:07:11] \o/ [23:07:25] (03CR) 10Catrope: [C: 032] Enable Page issuses A/B test for Latvian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461420 (https://phabricator.wikimedia.org/T204609) (owner: 10Pmiazga) [23:07:47] (03PS1) 10Dzahn: site/admin: replace mwmaint1001 with 1002 in comments [puppet] - 10https://gerrit.wikimedia.org/r/461487 (https://phabricator.wikimedia.org/T201343) [23:08:40] (03Merged) 10jenkins-bot: Enable Page issuses A/B test for Latvian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461420 (https://phabricator.wikimedia.org/T204609) (owner: 10Pmiazga) [23:10:32] jdlrobson: PageIssues A/B test for Latvian is on mwdebug2001, please test [23:10:35] on it! [23:10:49] (03PS1) 10Dzahn: network::constants:: add mwmaint1002 to maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/461488 (https://phabricator.wikimedia.org/T201343) [23:10:51] (03PS8) 10Catrope: Remove obsolete $wgPopupsBetaFeature, Part I: CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (https://phabricator.wikimedia.org/T203589) (owner: 10Prtksxna) [23:11:11] (03CR) 10Catrope: [C: 032] Remove obsolete $wgPopupsBetaFeature, Part I: CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (https://phabricator.wikimedia.org/T203589) (owner: 10Prtksxna) [23:11:17] (03PS5) 10Catrope: Remove obsolete $wgPopupsBetaFeature, Part II: InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452863 (https://phabricator.wikimedia.org/T203589) (owner: 10Jforrester) [23:11:22] (03CR) 10Catrope: [C: 032] Remove obsolete $wgPopupsBetaFeature, Part II: InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452863 (https://phabricator.wikimedia.org/T203589) (owner: 10Jforrester) [23:11:32] (03PS2) 10Dzahn: network::constants:: add mwmaint1002 to maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/461488 (https://phabricator.wikimedia.org/T201343) [23:11:35] (03PS10) 10Catrope: Remove obsolete $wgPopupsBetaFeature, Part III: InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444574 (https://phabricator.wikimedia.org/T203589) (owner: 10Prtksxna) [23:11:37] (03CR) 10Catrope: [C: 032] Remove obsolete $wgPopupsBetaFeature, Part III: InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444574 (https://phabricator.wikimedia.org/T203589) (owner: 10Prtksxna) [23:11:43] (03CR) 10Dzahn: [C: 032] network::constants:: add mwmaint1002 to maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/461488 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [23:11:55] @RoanKattouw wait you mean mwdebug1001? [23:12:13] Sorry, it'll say mw2017 in your dropdown [23:12:23] jdlrobson: no, it's actually 2001 currently we are in codfw [23:12:29] (03Merged) 10jenkins-bot: Remove obsolete $wgPopupsBetaFeature, Part I: CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (https://phabricator.wikimedia.org/T203589) (owner: 10Prtksxna) [23:12:31] eh, something starting with 2 anyays [23:12:33] I forgot about that; it's called mwdebug2001 on my end but mw2017 on your end [23:12:33] (03Merged) 10jenkins-bot: Remove obsolete $wgPopupsBetaFeature, Part II: InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452863 (https://phabricator.wikimedia.org/T203589) (owner: 10Jforrester) [23:12:37] (03Merged) 10jenkins-bot: Remove obsolete $wgPopupsBetaFeature, Part III: InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444574 (https://phabricator.wikimedia.org/T203589) (owner: 10Prtksxna) [23:12:46] ahh mw2017.codfw.wmnet [23:13:27] (03CR) 10jenkins-bot: Enable Page issuses A/B test for Latvian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461420 (https://phabricator.wikimedia.org/T204609) (owner: 10Pmiazga) [23:13:30] (03CR) 10jenkins-bot: Remove obsolete $wgPopupsBetaFeature, Part II: InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452863 (https://phabricator.wikimedia.org/T203589) (owner: 10Jforrester) [23:13:32] (03CR) 10jenkins-bot: Remove obsolete $wgPopupsBetaFeature, Part I: CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (https://phabricator.wikimedia.org/T203589) (owner: 10Prtksxna) [23:13:34] (03CR) 10jenkins-bot: Remove obsolete $wgPopupsBetaFeature, Part III: InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444574 (https://phabricator.wikimedia.org/T203589) (owner: 10Prtksxna) [23:13:36] RoanKattouw: you can sync now! Looks good to me! [23:13:42] Yeah, 2017 for 2001 [23:13:52] I ran into that earlier today. :-) [23:14:01] but please hang around a bit, as i need to verify the sampling rates don't burn things [23:14:09] and i'll need about 10 minutes to say that with confidence :) [23:14:12] October 11th this will be back to 1xxx [23:14:24] mutante: Hopefully. :-) [23:14:28] heh, yea [23:14:57] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Page issues A/B test on lvwiki (T204609) (duration: 00m 58s) [23:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:04] T204609: Turn on page issues A/B test for Latvian wikipedia - https://phabricator.wikimedia.org/T204609 [23:16:09] jdlrobson: Popups beta feature patches (all 3) now on mw2017, please test [23:16:17] mwdebug2001.codfw.wmnet is the correct name in DNS. mw2017 doesn't exist there [23:16:45] (since it was repurposed as another role) [23:17:35] RoanKattouw: on it [23:17:48] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.22/extensions/PageTriage/: PageTriage fixes for T203184 (duration: 00m 58s) [23:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:56] T203184: Deploy PageTriage AfC to production - https://phabricator.wikimedia.org/T203184 [23:18:32] RoanKattouw: that also looks good - Popups still working [23:19:36] (03PS1) 10Dzahn: scap/dsh: add mwmaint1002 to mw scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/461489 (https://phabricator.wikimedia.org/T201343) [23:21:01] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Remove obsolete $wgPopupsBetaFeature, part 1 (T203589) (duration: 00m 56s) [23:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:07] T203589: SWAT $wgPopupsBetaFeature reading web config cleanup - https://phabricator.wikimedia.org/T203589 [23:21:24] (03PS1) 10Dzahn: trafficserver: replace mwmaint1001 with 1002 as noc.wm.org backend [puppet] - 10https://gerrit.wikimedia.org/r/461490 (https://phabricator.wikimedia.org/T201343) [23:21:58] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove obsolete $wgPopupsBetaFeature, part 2 (T203589) (duration: 00m 56s) [23:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:10] OK, all done [23:23:39] (03PS1) 10Dzahn: site: add mediawiki_maintenance role to mwmaint1002 [puppet] - 10https://gerrit.wikimedia.org/r/461491 (https://phabricator.wikimedia.org/T201343) [23:23:42] (03PS1) 10Dzahn: site: turn mwmaint1001 into a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/461492 (https://phabricator.wikimedia.org/T201343) [23:24:48] thanks RoanKattouw . Will keep an eye on event logging and let you know if anything goes wrong [23:28:13] (03PS1) 10Dzahn: mariadb: add mwmaint1002 to grants for production-m5 [puppet] - 10https://gerrit.wikimedia.org/r/461493 (https://phabricator.wikimedia.org/T201343) [23:31:19] (03PS2) 10Dzahn: mariadb: add mwmaint1002 to grants for production-m5 [puppet] - 10https://gerrit.wikimedia.org/r/461493 (https://phabricator.wikimedia.org/T201343) [23:32:25] (03CR) 10Gehel: [C: 031] "LGTM, let's wait for volans last comment. I'm sure he'll find something!" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [23:32:51] (03CR) 10Dzahn: "This is about backend config for ATS. I am not sure if it can happen anytime or needs to wait until we are actively switched from mwmaint1" [puppet] - 10https://gerrit.wikimedia.org/r/461490 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [23:33:21] (03CR) 10Gehel: [C: 04-1] "See comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/461206 (https://phabricator.wikimedia.org/T204106) (owner: 10Mathew.onipe) [23:33:25] (03CR) 10Dzahn: "for more details what this is about see commit message on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/461493/" [puppet] - 10https://gerrit.wikimedia.org/r/461490 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [23:34:30] (03CR) 10Dzahn: [C: 04-1] "don't add to scap until it has mediawiki installed" [puppet] - 10https://gerrit.wikimedia.org/r/461489 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [23:35:40] thanks RoanKattouw looks like we're good. Not seeing any big spikes on the old EventLogging [23:36:44] (03PS1) 10Dzahn: mediawiki_maintenance: allow rsyncing home dirs from 1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/461494 (https://phabricator.wikimedia.org/T201343) [23:37:21] (03CR) 10jerkins-bot: [V: 04-1] mediawiki_maintenance: allow rsyncing home dirs from 1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/461494 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [23:37:23] RoanKattouw: Did you sync the part III? [23:38:14] RoanKattouw: Oh, right, you synced part 3 (InitialiseSettings.php) as part 2, and didn't sync part 2 (InitialiseSettings-labs.php). [23:38:20] (03PS2) 10Dzahn: mediawiki_maintenance: allow rsyncing home dirs from 1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/461494 (https://phabricator.wikimedia.org/T201343) [23:39:40] thanks jerkins-bot. it detected "mwmaint1002.codfw.wmnet" because of the combo of "codfw" and number starting with 1 [23:39:52] that's one of the better typo patterns we have [23:41:01] (03CR) 10Dzahn: [C: 032] mediawiki_maintenance: allow rsyncing home dirs from 1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/461494 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [23:45:48] (03PS1) 10Dzahn: Revert "profile::mediawiki::maintenance: depend on mediawiki config, not hiera" [puppet] - 10https://gerrit.wikimedia.org/r/461496 [23:46:10] (03CR) 10jerkins-bot: [V: 04-1] Revert "profile::mediawiki::maintenance: depend on mediawiki config, not hiera" [puppet] - 10https://gerrit.wikimedia.org/r/461496 (owner: 10Dzahn) [23:47:11] (03CR) 10Dzahn: "i would like to revert this change temporarily so that i can finish T201343 before we are back in eqiad as the active dc.. and after that " [puppet] - 10https://gerrit.wikimedia.org/r/457492 (owner: 10Giuseppe Lavagetto) [23:47:28] (03CR) 10Dzahn: "i would like to revert this temporarily so that i can finish T201343 before we are back in eqiad as the active dc.. and after that is done" [puppet] - 10https://gerrit.wikimedia.org/r/461496 (owner: 10Dzahn) [23:52:20] (03CR) 10Dzahn: [C: 032] "this won't work yet because mwmaint1002 doesn't have the mediawiki_maintenance role yet but it can't be applied because since I6c67e6214a6" [puppet] - 10https://gerrit.wikimedia.org/r/461494 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [23:54:09] (03CR) 10Dzahn: [C: 031] dumps: monitor generation nfs server hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [23:55:12] (03CR) 10Dzahn: [C: 04-1] "can't be applied because since I6c67e6214a6a1b010cdf4 that would mean it automatically gets active cron jobs and can't be activated separa" [puppet] - 10https://gerrit.wikimedia.org/r/461491 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [23:55:16] (03CR) 10Dzahn: [C: 04-1] "not until mwmaint1002 is active" [puppet] - 10https://gerrit.wikimedia.org/r/461492 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [23:55:41] (03CR) 10Dzahn: [C: 031] "can go anytime since it adds both systems in paralell rather than replacing in a single step" [puppet] - 10https://gerrit.wikimedia.org/r/461493 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [23:56:20] (03CR) 10Dzahn: [C: 04-1] "just comments, but shouldn't claim this until mwmaint1002 actually has the role.which is stalled" [puppet] - 10https://gerrit.wikimedia.org/r/461487 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)