[00:11:45] 10Operations, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Bstorm) Ok, it took a bit of doing, but here's what I found: The limits that are in place are based on having a server that was on 10G Etherne... [00:14:45] (03PS1) 10Bstorm: dumps distribution: increase the rate limit to 5MBps [puppet] - 10https://gerrit.wikimedia.org/r/555632 (https://phabricator.wikimedia.org/T222349) [00:15:43] 10Operations, 10Data-Services, 10Discovery-Search, 10Wikidata, and 3 others: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Bstorm) [00:34:54] (03PS1) 10Bstorm: toolforge-kubernetes: disable profiling on api servers [puppet] - 10https://gerrit.wikimedia.org/r/555634 (https://phabricator.wikimedia.org/T240009) [00:58:51] !log andrew@deploy1001 Finished deploy [horizon/deploy@1911591]: (no justification provided) (duration: 106m 13s) [00:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:51] !log andrew@deploy1001 Started deploy [horizon/deploy@0f70602]: (no justification provided) [00:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:55] !log andrew@deploy1001 Finished deploy [horizon/deploy@0f70602]: (no justification provided) (duration: 02m 04s) [01:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:11] !log andrew@deploy1001 Started deploy [horizon/deploy@0f70602]: (no justification provided) [01:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:06] !log andrew@deploy1001 Finished deploy [horizon/deploy@0f70602]: (no justification provided) (duration: 02m 55s) [01:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:22] (03CR) 10Ebe123: [C: 03+1] Add new HD logos to wgLogoHD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555629 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald) [01:27:32] (03CR) 10Ebe123: [C: 03+1] Update three logos with more detailed versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald) [01:30:55] (03PS1) 10Catrope: [beta] Align GrowthExperiments treatment groups in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555639 (https://phabricator.wikimedia.org/T232396) [01:32:02] (03PS2) 10Catrope: GrowthExperiments: Align help panel new account enabling with homepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553222 (https://phabricator.wikimedia.org/T232396) [01:37:45] (03CR) 10Catrope: [C: 04-1] "Needs to wait until wmf.8 is deployed everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553222 (https://phabricator.wikimedia.org/T232396) (owner: 10Catrope) [01:37:52] (03CR) 10Catrope: [C: 03+2] [beta] Align GrowthExperiments treatment groups in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555639 (https://phabricator.wikimedia.org/T232396) (owner: 10Catrope) [01:38:43] (03Merged) 10jenkins-bot: [beta] Align GrowthExperiments treatment groups in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555639 (https://phabricator.wikimedia.org/T232396) (owner: 10Catrope) [01:39:27] (03PS1) 10Catrope: [beta] Enable GrowthExperiments 40/40/20 homepage test in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555640 (https://phabricator.wikimedia.org/T238888) [01:39:40] (03PS3) 10Catrope: GrowthExperiments: Begin "initiation test" for suggested edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553225 (https://phabricator.wikimedia.org/T238888) [01:40:26] (03CR) 10Catrope: [C: 04-1] "Not until wmf.8 is deployed everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553225 (https://phabricator.wikimedia.org/T238888) (owner: 10Catrope) [01:40:36] (03CR) 10Catrope: [C: 03+2] [beta] Enable GrowthExperiments 40/40/20 homepage test in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555640 (https://phabricator.wikimedia.org/T238888) (owner: 10Catrope) [01:42:19] (03PS2) 10Catrope: [beta] Enable GrowthExperiments 40/40/20 homepage test in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555640 (https://phabricator.wikimedia.org/T238888) [01:42:33] (03CR) 10Catrope: [C: 03+2] [beta] Enable GrowthExperiments 40/40/20 homepage test in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555640 (https://phabricator.wikimedia.org/T238888) (owner: 10Catrope) [01:43:25] (03Merged) 10jenkins-bot: [beta] Enable GrowthExperiments 40/40/20 homepage test in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555640 (https://phabricator.wikimedia.org/T238888) (owner: 10Catrope) [02:55:40] !log andrew@deploy1001 Started deploy [horizon/deploy@0f70602]: (no justification provided) [02:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:47] !log andrew@deploy1001 Finished deploy [horizon/deploy@0f70602]: (no justification provided) (duration: 00m 07s) [02:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:11] !log andrew@deploy1001 Started deploy [horizon/deploy@0f70602]: (no justification provided) [02:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:51] !log andrew@deploy1001 Finished deploy [horizon/deploy@0f70602]: (no justification provided) (duration: 01m 40s) [02:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:34] !log andrew@deploy1001 Started deploy [horizon/deploy@841693b]: (no justification provided) [03:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:19] !log andrew@deploy1001 Finished deploy [horizon/deploy@841693b]: (no justification provided) (duration: 01m 45s) [03:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:47] (03PS1) 10Andrew Bogott: Horizon: set SECRET_KEY to the same value across a deployment [puppet] - 10https://gerrit.wikimedia.org/r/555646 (https://phabricator.wikimedia.org/T145703) [03:38:33] (03CR) 10jerkins-bot: [V: 04-1] Horizon: set SECRET_KEY to the same value across a deployment [puppet] - 10https://gerrit.wikimedia.org/r/555646 (https://phabricator.wikimedia.org/T145703) (owner: 10Andrew Bogott) [03:40:44] 10Operations, 10DC-Ops, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-jessie): Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10hashar) [03:41:18] (03PS1) 10Andrew Bogott: Add dummy horizon secret keys [labs/private] - 10https://gerrit.wikimedia.org/r/555647 [03:41:44] 10Operations, 10DC-Ops, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-jessie): Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10hashar) [03:44:11] 10Operations, 10DC-Ops, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-jessie): Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10hashar) [03:44:54] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [03:44:56] 10Operations, 10DC-Ops, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-jessie): Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10hashar) [03:44:59] (03PS2) 10Andrew Bogott: Add dummy horizon secret keys [labs/private] - 10https://gerrit.wikimedia.org/r/555647 [03:48:06] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add dummy horizon secret keys [labs/private] - 10https://gerrit.wikimedia.org/r/555647 (owner: 10Andrew Bogott) [03:50:01] (03PS2) 10Andrew Bogott: Horizon: set SECRET_KEY to the same value across a deployment [puppet] - 10https://gerrit.wikimedia.org/r/555646 (https://phabricator.wikimedia.org/T145703) [03:54:13] (03PS1) 10Andrew Bogott: Fix dummy key for eqiad1 horizon [labs/private] - 10https://gerrit.wikimedia.org/r/555648 [03:54:27] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Fix dummy key for eqiad1 horizon [labs/private] - 10https://gerrit.wikimedia.org/r/555648 (owner: 10Andrew Bogott) [03:59:25] (03PS3) 10Andrew Bogott: Horizon: set SECRET_KEY to the same value across a deployment [puppet] - 10https://gerrit.wikimedia.org/r/555646 (https://phabricator.wikimedia.org/T145703) [04:01:34] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: set SECRET_KEY to the same value across a deployment [puppet] - 10https://gerrit.wikimedia.org/r/555646 (https://phabricator.wikimedia.org/T145703) (owner: 10Andrew Bogott) [04:08:29] !log andrew@deploy1001 Started deploy [horizon/deploy@841693b]: (no justification provided) [04:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:23] !log andrew@deploy1001 Finished deploy [horizon/deploy@841693b]: (no justification provided) (duration: 01m 55s) [04:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:59] (03PS2) 10Ammarpad: Enable local uploads on inh.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) [06:30:04] (03PS1) 10TechneSiyam: Added 1.5x and 2x pngs of wiki project logos that are in SVG. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555649 [09:01:39] PROBLEM - PHP7 rendering on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:03:23] RECOVERY - PHP7 rendering on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 4.554 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:51:44] !log reboot dumpsdata1002, checking that rpc.statd starts on boot properly [09:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:47] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [11:57:35] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5600 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:02:57] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5093 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:20:51] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5146 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:24:51] PROBLEM - PHP7 jobrunner on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [12:26:31] RECOVERY - PHP7 jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 3.044 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:27:59] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5097 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:36:41] PROBLEM - PHP7 rendering on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:37:11] a bit weird, mcrouter looks ok [12:37:44] seems related to a couple of jobrunners [12:37:55] ah yes mw1299 is one [12:38:25] RECOVERY - PHP7 rendering on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 5.025 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:38:56] weird, tkos registered to mw2 [12:39:01] the proxies.. [12:40:31] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5138 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:44:05] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5021 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:44:37] PROBLEM - PHP7 jobrunner on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [12:46:13] RECOVERY - PHP7 jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:52:47] lunch and I am going to triple check, I am running ifstat on mw2271 to see if there are any bandwidth issues but it doesn't seem so [12:53:50] there seems to be an error while connecting from mw1299's mcrouter to mw2271's mcrouter (to replicate keys) [13:07:21] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5822 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:16:17] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5390 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:23:26] in theory even if we see tkos for mw2 proxies, we have async routes so it should not lead to timeouts [13:23:53] but I am seeing client handshakeErr: AsyncSocketException: SSL connect timed out after 268ms, type = Timed out [13:24:28] so I am wondering if in this case the async route to codfw doesn't count [13:24:58] (so mcrouter in eqiad is forced to wait for the connect to timeout, leading to other timeouts) [13:25:13] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5644 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:26:09] PROBLEM - Nginx local proxy to videoscaler on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [13:26:09] PROBLEM - Nginx local proxy to jobrunner on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [13:26:17] PROBLEM - PHP7 rendering on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:26:45] !log restart php-fpm on mw1299 (jobrunner) as test [13:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:47] RECOVERY - Nginx local proxy to videoscaler on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 338 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [13:27:47] RECOVERY - Nginx local proxy to jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 338 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [13:27:55] RECOVERY - PHP7 rendering on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:29:33] !log restart php-fpm on mw1293 (jobrunner) as test [13:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:35] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 768 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:32:03] the two jobruners were overloaded, a simple restart of php-fpm brought down all the errors [13:35:37] PROBLEM - PHP opcache health on mw1293 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:18] I guess that this is the side effect of the restart --^ [13:37:23] RECOVERY - PHP opcache health on mw1293 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:31] yes :) [13:38:37] let's see if this was a temporary issue or if it will re-occur [13:38:39] * elukey afk! [13:44:39] !log andrew@deploy1001 Started deploy [horizon/deploy@841693b]: (no justification provided) [13:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:46] !log andrew@deploy1001 Finished deploy [horizon/deploy@841693b]: (no justification provided) (duration: 00m 08s) [13:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:15] (03Abandoned) 10Urbanecm: This is for GCI - TechneSiyam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554889 (owner: 10TechneSiyam) [15:31:54] (03CR) 10Urbanecm: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald) [15:32:01] (03CR) 10Urbanecm: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald) [15:32:05] (03CR) 10Urbanecm: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555629 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald) [21:02:05] (03PS1) 10IAmNetx: Add aliases for Help and Project on eswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555692 (https://phabricator.wikimedia.org/T240050) [21:55:43] (03CR) 10Ammarpad: [C: 03+1] "Looks Ok." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555692 (https://phabricator.wikimedia.org/T240050) (owner: 10IAmNetx) [22:40:41] (03CR) 10Jforrester: [C: 03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555621 (https://phabricator.wikimedia.org/T239925) (owner: 10Ammarpad)