[00:01:55] !log seeing "php: Lost parent, LightProcess exiting" in syslog on mw1275 today (T124956) [00:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:06] T124956: Rise in "parent, LightProcess exiting" console spam - https://phabricator.wikimedia.org/T124956 [00:03:26] (03Abandoned) 10Dzahn: lists: raise TTL back to 1H after service IP change [dns] - 10https://gerrit.wikimedia.org/r/354072 (owner: 10Dzahn) [00:05:54] 27 hhvm.server.light_process_count = 5 [00:06:02] ^ yes, it has it set to higher than zero [00:06:26] " When HHVM is configured to have a nonzero number of LightProcess workers (pre-forked subprocesses it creates on startup to make shelling out cheaper), each worker prints out this message when it exits, even on normal termination." -- Ori [00:07:40] but others, like mw1276 do too [00:08:08] The Hiera override for this was apparently applied on (just) deployment servers [00:08:47] RainbowSprinkles: does it still look any different than say mw1276, wherever you saw it pop up [00:10:27] It's still spewing them to logstash. Coincided exactly with wmf.4 deploy to wikipedias, but not affecting other nodes :\ [00:11:05] so that light_process_worker setting is the same on another box .. uhm... [00:11:15] otherwise it would have all matched that ticket [00:12:56] mutante: I'm inclined to just depool 1275 for now [00:14:04] 10Operations, 10HHVM, 10MW-1.27-release (WMF-deploy-2016-01-19_(1.27.0-wmf.11)), 10Patch-For-Review: Rise in "parent, LightProcess exiting" console spam - https://phabricator.wikimedia.org/T124956#1970973 (10Dzahn) mw1275 has them all over syslog, and the `light_process_count` is indeed set to 5, so non-ze... [00:14:12] ok, let's do that [00:14:22] i'll do the easiest way, already on it [00:15:23] !log mw1275 depooled (T124956) [00:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:33] T124956: Rise in "parent, LightProcess exiting" console spam - https://phabricator.wikimedia.org/T124956 [00:15:38] by that i meant literally typing "depool" on the box [00:15:54] i just expected a log line, kind of [00:17:08] hehe :) [00:18:17] ] $ sudo -i [00:18:18] root@mw1275:~# depool [00:18:18] Depooling mw1275.eqiad.wmnet from all services... [00:18:52] that's the alternative way to targetting it from conftool configmaster [00:30:30] PROBLEM - HHVM rendering on mw2220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:31:20] RECOVERY - HHVM rendering on mw2220 is OK: HTTP OK: HTTP/1.1 200 OK - 81248 bytes in 0.131 second response time [00:43:41] (03PS1) 10Dzahn: system::role: remove leading 'role::' to avoid role-role [puppet] - 10https://gerrit.wikimedia.org/r/357960 [00:50:22] (03PS3) 10Dzahn: fix all the "role-role" in system::roles [puppet] - 10https://gerrit.wikimedia.org/r/354172 [00:54:06] (03CR) 10Dzahn: "ok:) here is the follow-up with a regsubst for that https://gerrit.wikimedia.org/r/#/c/357960/" [puppet] - 10https://gerrit.wikimedia.org/r/354172 (owner: 10Dzahn) [02:19:02] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.4) (duration: 06m 04s) [02:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:29] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jun 9 02:25:29 UTC 2017 (duration 6m 27s) [02:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:30] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 713.34 seconds [03:33:30] PROBLEM - HHVM rendering on mw2201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:20] RECOVERY - HHVM rendering on mw2201 is OK: HTTP OK: HTTP/1.1 200 OK - 81828 bytes in 0.130 second response time [03:47:30] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 249.48 seconds [03:57:20] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:58:10] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 211 bytes in 0.351 second response time [03:58:50] PROBLEM - Check whether ferm is active by checking the default input chain on mw1294 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:59:40] RECOVERY - Check whether ferm is active by checking the default input chain on mw1294 is OK: OK ferm input default policy is set [04:06:14] (03PS2) 10Zhuyifei1999: tools-static: add /fontcdn/ to reverse-proxy to Google Fonts [puppet] - 10https://gerrit.wikimedia.org/r/357878 (https://phabricator.wikimedia.org/T110027) [04:10:30] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=538.60 Read Requests/Sec=578.10 Write Requests/Sec=2.20 KBytes Read/Sec=44496.40 KBytes_Written/Sec=153.60 [04:19:30] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=114.40 Read Requests/Sec=196.50 Write Requests/Sec=0.80 KBytes Read/Sec=2853.20 KBytes_Written/Sec=222.40 [04:30:20] PROBLEM - MD RAID on mw1294 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:30:20] PROBLEM - Check systemd state on mw1294 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:30:21] PROBLEM - puppet last run on mw1294 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:31:20] RECOVERY - Check systemd state on mw1294 is OK: OK - running: The system is fully operational [04:31:20] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 14 minutes ago with 0 failures [04:31:20] RECOVERY - MD RAID on mw1294 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [05:33:38] (03PS1) 10Gilles: Disable Thumbor dual-serving and serve testwiki with Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/357968 (https://phabricator.wikimedia.org/T167490) [05:34:23] (03CR) 10jerkins-bot: [V: 04-1] Disable Thumbor dual-serving and serve testwiki with Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/357968 (https://phabricator.wikimedia.org/T167490) (owner: 10Gilles) [05:40:52] (03PS2) 10Gilles: Disable Thumbor dual-serving and serve testwiki with Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/357968 (https://phabricator.wikimedia.org/T167490) [05:41:32] (03CR) 10jerkins-bot: [V: 04-1] Disable Thumbor dual-serving and serve testwiki with Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/357968 (https://phabricator.wikimedia.org/T167490) (owner: 10Gilles) [05:42:27] (03PS3) 10Gilles: Disable Thumbor dual-serving and serve testwiki with Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/357968 (https://phabricator.wikimedia.org/T167490) [05:43:08] (03PS1) 10Ayounsi: Rancid: set configs world readable [puppet] - 10https://gerrit.wikimedia.org/r/357969 (https://phabricator.wikimedia.org/T167288) [05:45:13] (03CR) 10Ayounsi: [C: 032] Rancid: set configs world readable [puppet] - 10https://gerrit.wikimedia.org/r/357969 (https://phabricator.wikimedia.org/T167288) (owner: 10Ayounsi) [05:47:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357971 [05:47:37] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357971 [05:49:09] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357971 (owner: 10Marostegui) [05:50:07] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357971 (owner: 10Marostegui) [05:50:16] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357971 (owner: 10Marostegui) [05:51:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1074 - T166205 (duration: 00m 42s) [05:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:22] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [05:53:26] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357972 [05:53:30] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357972 [05:55:14] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357972 (owner: 10Marostegui) [05:56:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357972 (owner: 10Marostegui) [05:56:36] (03PS1) 10Ayounsi: LibreNMS add rancid integration [puppet] - 10https://gerrit.wikimedia.org/r/357973 (https://phabricator.wikimedia.org/T164911) [05:56:38] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357972 (owner: 10Marostegui) [05:57:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1056 - T166206 (duration: 00m 41s) [05:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:58] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [05:58:17] (03CR) 10Ayounsi: [C: 032] LibreNMS add rancid integration [puppet] - 10https://gerrit.wikimedia.org/r/357973 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [05:59:55] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3334659 (10Marostegui) Please, have a plan B just in case this host doesn't come back up, it is a very old server and we know that sometimes, old servers once powered o... [06:04:09] 10Operations, 10netops, 10Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3334660 (10ayounsi) [06:10:40] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/1: down - Transit: Zayo (IPYX/125449/003/ZYO) {#11542} [10Gbps]BR [06:12:43] (03CR) 10Zhuyifei1999: "https://tools.wmflabs.org/fontcdn/ should be ready. I can't test it well without this patch due this css url 404 and I can't refresh font " [puppet] - 10https://gerrit.wikimedia.org/r/357878 (https://phabricator.wikimedia.org/T110027) (owner: 10Zhuyifei1999) [06:13:40] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [06:35:05] (03PS3) 10Giuseppe Lavagetto: [WiP] role::lvs::balancer: refactor to role/profile (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/357863 [06:40:25] 10Operations, 10DBA: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3334686 (10Marostegui) We need to test myloader using compressed tables, as during T153743 it took almost 24 hours to load: ``` root@labsdb1009:/srv/tmp/dewiki# du -sh * 37G e... [06:41:12] (03PS4) 10Giuseppe Lavagetto: [WiP] role::lvs::balancer: refactor to role/profile (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/357863 [06:41:23] !log updating mw1161 to HHVM 3.18 [06:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:33] 10Operations, 10netops, 10Patch-For-Review: Rancid improvements - https://phabricator.wikimedia.org/T167288#3334694 (10ayounsi) For reference, this is now possible: git clone ssh://netmon1001.wikimedia.org:/var/lib/rancid/core/ rancid-configs Devices also now have a "config" tab in LibreNMS. [06:49:20] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:49:22] (03PS5) 10Giuseppe Lavagetto: [WiP] role::lvs::balancer: refactor to role/profile (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/357863 [06:50:10] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 211 bytes in 0.076 second response time [06:54:52] <_joe_> so thumbor is failing since yesterday and we put it as the source of images on testwiki? [06:55:04] <_joe_> oh no it was just a PS [06:58:22] !log updating mw117* to HHVM 3.18+wmf5 [06:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:08] (03PS6) 10Giuseppe Lavagetto: [WiP] role::lvs::balancer: refactor to role/profile (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/357863 [07:09:10] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3334712 (10elukey) Thanks @Marostegui, I didn't think the situation was so desperate :D If there could be the risk of a bigger failure I'd change idea about the BBU an... [07:09:48] 10Operations, 10DBA: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674#3273350 (10Marostegui) I know this is old but maybe this was affected by: T166344 and its performance degradation? [07:09:56] (03PS7) 10Giuseppe Lavagetto: role::lvs::balancer: refactor to role/profile (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/357863 [07:11:02] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3334717 (10Marostegui) I don't want to be pessimistic, but I have had issues with old servers in the past, so just wanted to give a heads up to make sure you guys have... [07:11:17] (03CR) 10Alexandros Kosiaris: [C: 031] system::role: remove leading 'role::' to avoid role-role [puppet] - 10https://gerrit.wikimedia.org/r/357960 (owner: 10Dzahn) [07:11:32] (03CR) 10Alexandros Kosiaris: [C: 031] fix all the "role-role" in system::roles [puppet] - 10https://gerrit.wikimedia.org/r/354172 (owner: 10Dzahn) [07:15:03] !log deleted /etc/logrotate.d/nova-manage from labtestvirt2003 to reduce cronspam (same solution used in T132422#2679434) [07:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:14] T132422: cronspam from labscontrol1001, labstore1001, labnet1002.eqiad.wmnet, labsdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T132422 [07:15:46] 10Operations, 10HHVM, 10Patch-For-Review, 10Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3334728 (10MoritzMuehlenhoff) [07:15:47] 10Operations, 10HHVM: HHVM 3.18 segfault on jobrunner / string handling - https://phabricator.wikimedia.org/T165051#3334723 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff This bug can be closed, it was caused by the same underlying bug which was fixed in T165043 [07:16:33] 10Operations, 10HHVM, 10Patch-For-Review, 10Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3028990 (10MoritzMuehlenhoff) [07:16:35] 10Operations, 10HHVM, 10Upstream: HHVM: Crash in server worker - https://phabricator.wikimedia.org/T165669#3334729 (10MoritzMuehlenhoff) 05Open>03Resolved This bug can be closed, it was caused by the same underlying bug which was fixed in T165043 [07:18:20] 10Operations, 10DBA: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674#3334734 (10akosiaris) I doubt that. Looking at tendril for the last 72 hours (>2 days before T166344 happened again), it's clear that this is happening around the time the `make_updates` cron... [07:19:27] 10Operations, 10DBA: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674#3334735 (10Marostegui) Cool - just wanted to make sure we had that in mind :-) [07:19:44] 10Operations, 10HHVM: Switch CI tests back to HHVM 3.18 - https://phabricator.wikimedia.org/T167493#3334736 (10MoritzMuehlenhoff) [07:20:10] 10Operations, 10HHVM, 10Release-Engineering-Team (Kanban): Switch CI tests back to HHVM 3.18 - https://phabricator.wikimedia.org/T167493#3334752 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03hashar [07:23:37] (03CR) 10Alexandros Kosiaris: "minor inline comment, looks good otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [07:26:09] !log run megacli -LDSetProp ADRA -LALL -aALL on analytics[1058-1068] - T166140 [07:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:18] T166140: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140 [07:40:34] !log upgrade app servers in codfw running HHVM 3.18 to +wmf5 [07:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:07] !log run megacli -LDSetProp -Direct -LALL -aALL on analytics[1058-1068] - T166140 [07:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:16] T166140: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140 [08:22:42] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3334853 (10elukey) Finally the same setting across all analytics workers: ``` elukey@neodymium:~$ sudo cumin 'R:class = role::ana... [08:40:18] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167434#3334862 (10fgiunchedi) [08:40:20] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3334859 (10fgiunchedi) [08:40:22] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167426#3334863 (10fgiunchedi) [08:41:11] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3209252 (10fgiunchedi) We were getting duplicate alerts from ms-be1019 due to its hp raid check going unknown (I think). I've disabled the handler for hp raid on ms... [08:41:52] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3334865 (10fgiunchedi) [08:47:52] (03PS1) 10Ayounsi: Remove Rancid passwdrd [labs/private] - 10https://gerrit.wikimedia.org/r/357977 [08:49:02] (03PS2) 10Ayounsi: Remove Rancid password [labs/private] - 10https://gerrit.wikimedia.org/r/357977 [08:49:18] (03CR) 10Ayounsi: [V: 032 C: 032] Remove Rancid password [labs/private] - 10https://gerrit.wikimedia.org/r/357977 (owner: 10Ayounsi) [08:51:44] (03CR) 10Alexandros Kosiaris: [C: 031] role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824 (owner: 10Giuseppe Lavagetto) [08:51:59] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, -1'ing since I'm merging this on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/357968 (https://phabricator.wikimedia.org/T167490) (owner: 10Gilles) [08:56:18] elukey: \o/ (re: write policy) [08:56:51] jynus: are you setting raid cache policy to WB or NoCachedBadBBU typically? [09:01:38] (03CR) 10Alexandros Kosiaris: [C: 04-1] role::lvs::balancer: refactor to role/profile (step 2) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357863 (owner: 10Giuseppe Lavagetto) [09:03:33] (03CR) 10Volans: [C: 031] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357819 (owner: 10Faidon Liambotis) [09:05:55] (03CR) 10Giuseppe Lavagetto: role::lvs::balancer: refactor to role/profile (step 2) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357863 (owner: 10Giuseppe Lavagetto) [09:19:21] (03CR) 10Alexandros Kosiaris: role::lvs::balancer: refactor to role/profile (step 2) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357863 (owner: 10Giuseppe Lavagetto) [09:21:15] 10Operations, 10netops, 10Patch-For-Review: Rancid improvements - https://phabricator.wikimedia.org/T167288#3334923 (10ayounsi) 05Open>03Resolved All done. [09:22:02] (03PS1) 10Gilles: Upgrade to 0.1.40 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/357978 (https://phabricator.wikimedia.org/T166938) [09:23:18] 10Operations, 10netops, 10Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3334929 (10ayounsi) [09:25:01] 10Operations, 10netops, 10Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3250758 (10ayounsi) 05Open>03Resolved List items commented in the description. No real value for now for http basic auth, will reopen task if needed. [09:32:37] 10Operations, 10Traffic, 10netops: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3334946 (10ema) 05Open>03Resolved Looks good! [09:33:18] 10Operations, 10Patch-For-Review: Switch to predictable network interface names? - https://phabricator.wikimedia.org/T158429#3036728 (10fgiunchedi) I've finished converting ms-be stretch systems to predictable network interfaces, no problems observed so far. For reference the commands: ``` sed -i 's/net.ifnam... [09:55:31] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:55:40] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:59:20] (03PS1) 10Ladsgroup: Make /entity/ redirect internal [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) [10:03:40] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:04:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:04:40] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:05:14] (03CR) 10Volans: "Nice job! +1 for the Python, I have a couple of questions inline regarding the queries." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [10:12:32] (03CR) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [10:16:41] question, Are we allowed to make apache redirects internal? https://gerrit.wikimedia.org/r/#/c/357985/ [10:26:12] jynus, marostegui: any decom'ed-but-online db server I can experiment with? [10:26:18] Dell one, that is [10:26:26] megacli experiments is what I want to do [10:50:10] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:50:20] PROBLEM - nutcracker process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:50:20] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:51:00] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [10:51:10] RECOVERY - nutcracker process on thumbor1001 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [10:51:10] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:59:58] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.40 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/357978 (https://phabricator.wikimedia.org/T166938) (owner: 10Gilles) [11:00:17] (03PS3) 10Faidon Liambotis: raid: remove unused aac, twe, zfs [puppet] - 10https://gerrit.wikimedia.org/r/357819 [11:00:19] (03PS1) 10Faidon Liambotis: raid: remove the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/357992 [11:00:21] (03PS1) 10Faidon Liambotis: raid: split parts of raid into raid::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/357993 [11:00:23] (03PS1) 10Faidon Liambotis: Add a new raid::policy define [puppet] - 10https://gerrit.wikimedia.org/r/357994 [11:00:56] elukey: ^ [11:01:22] :D [11:02:23] wow really nice [11:02:43] (03CR) 10Faidon Liambotis: [C: 032] raid: remove unused aac, twe, zfs [puppet] - 10https://gerrit.wikimedia.org/r/357819 (owner: 10Faidon Liambotis) [11:03:03] completely untested [11:03:19] and I'm still wondering whether it should support ensure => present/absent [11:03:37] to effectively make writeback ensure => absent -> writethrough [11:04:10] paravoid: I'm afraid this might add a *lot* of time to ms-be puppet runs [11:04:20] the test command [11:04:53] <_joe_> this would probably merit a puppet resource on the long run, if we have more than one type of raid supported [11:04:58] root@analytics1058:~# time megacli -LDInfo -LAll -aAll | grep -c '^Virtual Drive' [11:05:01] 13 [11:05:04] real 0m0.011s [11:05:20] HPEs... would be a different matter [11:07:58] this is the kind of thing we could be doing during the install [11:08:21] but jynus mentioned that he's seen props randomly set after various changes during the lifetime of the system [11:09:33] in any case, monitoring is definitely not the place to do it IMHO :) [11:12:45] (03PS1) 10Ema: VCL: rate limit wikiScrape with vsthrottle [puppet] - 10https://gerrit.wikimedia.org/r/357995 (https://phabricator.wikimedia.org/T163233) [11:14:50] sure :) [11:18:19] (03PS2) 10Faidon Liambotis: raid: split parts of raid into raid::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/357993 [11:18:23] (03PS2) 10Faidon Liambotis: Add a new raid::policy define [puppet] - 10https://gerrit.wikimedia.org/r/357994 (https://phabricator.wikimedia.org/T166108) [11:18:29] (03PS1) 10Faidon Liambotis: raid: add megacli default vs. current policy check [puppet] - 10https://gerrit.wikimedia.org/r/357999 (https://phabricator.wikimedia.org/T166108) [11:19:25] (03CR) 10jerkins-bot: [V: 04-1] raid: split parts of raid into raid::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/357993 (owner: 10Faidon Liambotis) [11:19:32] (03CR) 10jerkins-bot: [V: 04-1] Add a new raid::policy define [puppet] - 10https://gerrit.wikimedia.org/r/357994 (https://phabricator.wikimedia.org/T166108) (owner: 10Faidon Liambotis) [11:21:16] paravoid: You still want that host? (sorry I was having lunch) [11:21:58] (03CR) 10jerkins-bot: [V: 04-1] raid: add megacli default vs. current policy check [puppet] - 10https://gerrit.wikimedia.org/r/357999 (https://phabricator.wikimedia.org/T166108) (owner: 10Faidon Liambotis) [11:22:06] (03PS2) 10Faidon Liambotis: raid: add megacli default vs. current policy check [puppet] - 10https://gerrit.wikimedia.org/r/357999 (https://phabricator.wikimedia.org/T166108) [11:23:05] (03PS3) 10Faidon Liambotis: raid: split parts of raid into raid::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/357993 [11:23:07] (03PS3) 10Faidon Liambotis: Add a new raid::policy define [puppet] - 10https://gerrit.wikimedia.org/r/357994 (https://phabricator.wikimedia.org/T166108) [11:23:42] volans: rewriting check-raid to look saner and splitting it up into three separate plugins would be a low hanging fruit if you're bored [11:24:42] paravoid: noted! but quite full of TODOs for the moment, which priority we want to give to it? :-P [11:24:46] ;) [11:24:48] not high! [11:24:56] 10Operations, 10DBA: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3335077 (10Marostegui) Looks like not having compression makes a big difference. On labsdb1011: 139G in 5 hours which is way faster than the other two hosts with dewiki (which... [11:25:15] (03PS3) 10Faidon Liambotis: raid: add megacli default vs. current policy check [puppet] - 10https://gerrit.wikimedia.org/r/357999 (https://phabricator.wikimedia.org/T166108) [11:26:57] "would be a low hanging fruit if you're bored" ---> this is clearly an attempt to ignite volans' fix all the things gene :P :P [11:27:33] rotfl [11:28:26] s/if you're bored/if you're procrastinating on writing annual reviews/ [11:39:16] I'm guilty! [12:16:10] PROBLEM - Host elastic1052 is DOWN: CRITICAL - Plugin timed out after 15 seconds [12:17:30] PROBLEM - HHVM rendering on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:18:20] RECOVERY - HHVM rendering on mw2221 is OK: HTTP OK: HTTP/1.1 200 OK - 81865 bytes in 0.130 second response time [12:19:30] RECOVERY - Host elastic1052 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [12:22:12] (03PS1) 10Amire80: Sort wmgBabelMainCategory alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358006 [12:22:14] (03PS1) 10Amire80: Add wmgBabelMainCategory for many languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358007 [12:25:43] mw2221 was caused by the HHVM restart script, we can revisit the 3 day max period when the new HHVM is fully rolled out [12:27:47] moritzm: was it due to taking too much to restart? [12:29:09] elukey: that happens from time to time, the restart script depools, but doesn't mark the service downtime, so if the HHVM check hits during the restart, it flags [12:31:12] Jun 9 12:16:38 mw2221 systemd[1]: hhvm.service stop-sigterm timed out. Killing. [12:31:15] Jun 9 12:16:38 mw2221 systemd[1]: hhvm.service: main process exited, code=killed, status=9/KIL [12:31:23] moritzm: but this shouldn't happen right? --^ [12:31:36] <_joe_> nope [12:33:50] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 126.14, 104.21, 83.58 [12:36:35] !log reducing high watermark on elasticsearch eqiad to rebalance shards [12:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:50] RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 55.71, 76.95, 79.31 [12:42:49] 10Operations, 10Operations-Software-Development: New tool to track package updates/status for hosts and images (debmonitor) - https://phabricator.wikimedia.org/T167504#3335151 (10MoritzMuehlenhoff) [13:01:14] (03PS1) 10Alexandros Kosiaris: compiler: Split fact collection from shipping/collation [puppet] - 10https://gerrit.wikimedia.org/r/358010 [13:10:40] (03CR) 10Daniel Kinzler: [C: 031] "We want this change, and it looks good." [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [13:13:01] (03CR) 10Daniel Kinzler: [C: 04-1] "Ooops, sorry, wait. I failed to read my own note on the ticket. There is a pretty big caveat to consider:" [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [13:18:35] !log upgrade thumbor to 0.1.40 - T167462 [13:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:43] T167462: Support PNG thumbnails of WebP originals in Thumbor - https://phabricator.wikimedia.org/T167462 [13:19:09] (03CR) 10Daniel Kinzler: [C: 04-1] Make /entity/ redirect internal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [13:21:56] 10Operations, 10Operations-Software-Development: New tool to track package updates/status for hosts and images (debmonitor) - https://phabricator.wikimedia.org/T167504#3335219 (10Volans) [13:32:58] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/358010 (owner: 10Alexandros Kosiaris) [13:44:11] (03PS2) 10Ema: VCL: rate limit wikiScrape with vsthrottle [puppet] - 10https://gerrit.wikimedia.org/r/357995 (https://phabricator.wikimedia.org/T163233) [13:46:37] (03PS4) 10Alexandros Kosiaris: ores: Add twemproxy support [puppet] - 10https://gerrit.wikimedia.org/r/350421 (https://phabricator.wikimedia.org/T122676) [13:49:20] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:50:53] (03PS3) 10Ema: VCL: rate limit wikiScrape with vsthrottle [puppet] - 10https://gerrit.wikimedia.org/r/357995 (https://phabricator.wikimedia.org/T163233) [14:00:24] (03PS1) 10Ottomata: Update camus config for eventbus topics [puppet] - 10https://gerrit.wikimedia.org/r/358023 [14:00:27] (03PS4) 10Ema: VCL: rate limit wikiScrape with vsthrottle [puppet] - 10https://gerrit.wikimedia.org/r/357995 (https://phabricator.wikimedia.org/T163233) [14:10:11] (03CR) 10Ottomata: [C: 032] Update camus config for eventbus topics [puppet] - 10https://gerrit.wikimedia.org/r/358023 (owner: 10Ottomata) [14:10:15] (03PS2) 10Ottomata: Update camus config for eventbus topics [puppet] - 10https://gerrit.wikimedia.org/r/358023 [14:10:18] (03CR) 10Ottomata: [V: 032 C: 032] Update camus config for eventbus topics [puppet] - 10https://gerrit.wikimedia.org/r/358023 (owner: 10Ottomata) [14:18:21] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [14:19:34] (03PS1) 10Nschaaf: (in progress) Update recommendation-api module and role [puppet] - 10https://gerrit.wikimedia.org/r/358026 (https://phabricator.wikimedia.org/T167113) [14:19:53] (03CR) 10Nschaaf: [C: 04-1] (in progress) Update recommendation-api module and role [puppet] - 10https://gerrit.wikimedia.org/r/358026 (https://phabricator.wikimedia.org/T167113) (owner: 10Nschaaf) [14:21:53] (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/6724/scb2001.codfw.wmnet/ points out this is mostly correct." [puppet] - 10https://gerrit.wikimedia.org/r/350421 (https://phabricator.wikimedia.org/T122676) (owner: 10Alexandros Kosiaris) [14:22:41] (03CR) 10BBlack: VCL: rate limit wikiScrape with vsthrottle (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357995 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [14:23:00] PROBLEM - HHVM rendering on mw1184 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.075 second response time [14:24:00] RECOVERY - HHVM rendering on mw1184 is OK: HTTP OK: HTTP/1.1 200 OK - 81839 bytes in 0.281 second response time [14:24:05] (03PS33) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) [14:27:26] (03PS8) 10Giuseppe Lavagetto: role::lvs::balancer: refactor to role/profile (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/357863 [14:27:28] (03PS1) 10Giuseppe Lavagetto: role::lvs::balancer: also manage interface tagging [puppet] - 10https://gerrit.wikimedia.org/r/358027 [14:27:41] (03PS1) 10BBlack: varnish mobile redirects: allow for dashes in first label [puppet] - 10https://gerrit.wikimedia.org/r/358028 (https://phabricator.wikimedia.org/T167492) [14:28:34] (03CR) 10jerkins-bot: [V: 04-1] role::lvs::balancer: also manage interface tagging [puppet] - 10https://gerrit.wikimedia.org/r/358027 (owner: 10Giuseppe Lavagetto) [14:28:58] I just realized that I might have missed something in my zk code review [14:29:10] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Mobile, 10Patch-For-Review: Accessing zh-classical.wikipedia.org on a mobile device does not redirect to zh-classical.m.wikipedia.org - https://phabricator.wikimedia.org/T167492#3335419 (10BBlack) I think this is because our mobile-redirect log... [14:33:43] (03PS34) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) [14:34:27] should be good, if anybody wants to review it --^ [14:37:59] (03PS9) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [14:39:13] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Add an redirect lzh.wikipedia to zh-classical.wikipedia - https://phabricator.wikimedia.org/T167513#3335451 (10Liuxinyu970226) [14:39:30] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Add an redirect lzh.wikipedia to zh-classical.wikipedia - https://phabricator.wikimedia.org/T167513#3335387 (10Liuxinyu970226) Following T105999 tags [14:40:32] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Add an redirect lzh.wikipedia to zh-classical.wikipedia - https://phabricator.wikimedia.org/T167513#3335456 (10vjudge404) [14:41:58] (03PS2) 10Ladsgroup: Make /entity/ redirect internal [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) [14:44:14] (03CR) 10Ladsgroup: Make /entity/ redirect internal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [14:46:30] (03PS5) 10Ema: VCL: rate limit wikiScrape with vsthrottle [puppet] - 10https://gerrit.wikimedia.org/r/357995 (https://phabricator.wikimedia.org/T163233) [14:50:56] (03CR) 10BBlack: [C: 031] VCL: rate limit wikiScrape with vsthrottle [puppet] - 10https://gerrit.wikimedia.org/r/357995 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [14:52:39] (03CR) 10Ema: [C: 032] VCL: rate limit wikiScrape with vsthrottle [puppet] - 10https://gerrit.wikimedia.org/r/357995 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [14:55:44] (03CR) 10Thcipriani: "Inline comment" (031 comment) [software/gerrit] - 10https://gerrit.wikimedia.org/r/356484 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [14:58:59] (03PS1) 10Giuseppe Lavagetto: cache: add monitoring of services at the SSL termination level [puppet] - 10https://gerrit.wikimedia.org/r/358032 (https://phabricator.wikimedia.org/T167048) [14:59:45] (03PS4) 10Chad: Adding scap3 config [software/gerrit] - 10https://gerrit.wikimedia.org/r/356484 (https://phabricator.wikimedia.org/T157414) [14:59:58] (03Abandoned) 10Ema: VCL: update wikiScrape regex [puppet] - 10https://gerrit.wikimedia.org/r/357787 (owner: 10Ema) [15:03:38] (03CR) 10Paladox: [C: 04-1] "This is causing failures on labs" [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [15:04:35] (03CR) 10Paladox: [C: 031] "I wonder how will we ship the gerrit.init.d script and systemd script? It's currently blocked on reviews in puppet." [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [15:05:14] (03CR) 10Chad: "They will ship via puppet." [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [15:05:42] (03CR) 10Paladox: [C: 031] "> They will ship via puppet." [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [15:09:15] (03CR) 10Paladox: [C: 04-1] gerrit: let Apache proxy only listen on service IP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [15:10:14] (03PS2) 10Giuseppe Lavagetto: cache: add monitoring of services at the SSL termination level [puppet] - 10https://gerrit.wikimedia.org/r/358032 (https://phabricator.wikimedia.org/T167048) [15:11:44] <_joe_> !log upgraded python-service-checker to 0.1.2 on tegmen,einsteinium [15:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:43] (03CR) 10Giuseppe Lavagetto: [C: 032] cache: add monitoring of services at the SSL termination level [puppet] - 10https://gerrit.wikimedia.org/r/358032 (https://phabricator.wikimedia.org/T167048) (owner: 10Giuseppe Lavagetto) [15:20:03] 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3335544 (10RobH) Ok, this is having an installer issue of some sort. Both systems should be identical, but one shows no root filesystem when the partitioning m... [15:29:22] (03CR) 10Chad: "Yes. That's fine." [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [15:35:30] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [15:36:30] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [15:39:30] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] [15:40:45] (03PS1) 10Giuseppe Lavagetto: Use assert_hostname for https urls only [software/service-checker] - 10https://gerrit.wikimedia.org/r/358037 [15:41:24] (03CR) 10Daniel Kinzler: "@Ladsgroup "redirect=force" is not an apache thing. I propose to add this parameter to Special:EntityData." [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [15:41:32] (03CR) 10Giuseppe Lavagetto: [C: 032] "/me wears brown paper bag for misreading the docs" [software/service-checker] - 10https://gerrit.wikimedia.org/r/358037 (owner: 10Giuseppe Lavagetto) [15:41:41] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Use assert_hostname for https urls only [software/service-checker] - 10https://gerrit.wikimedia.org/r/358037 (owner: 10Giuseppe Lavagetto) [15:42:32] (03CR) 10Daniel Kinzler: [C: 04-1] "Now the CR-1 is for incorrect rewrite rules." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [15:44:37] <_joe_> !log uploaded service-checker 0.1.3 [15:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:22] <_joe_> !log installed python-service-checker 0.1.3 on einsteinium,tegmen T167048 [15:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:30] T167048: Services need external monitoring - https://phabricator.wikimedia.org/T167048 [15:48:03] 10Operations, 10Monitoring, 10Patch-For-Review, 10Services (next), and 2 others: Services need external monitoring - https://phabricator.wikimedia.org/T167048#3335573 (10Joe) both maps and restbase are now monitored at the load-balancers of the SSL terminators in all datacenters. Resolving. [15:49:30] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [15:51:30] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [15:52:30] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [15:53:11] monitoring ^, looks like perhaps some bot decided to hit a smaller wiki thats not as distributed with hundreds of req/s, but its dieing down [15:55:35] <_joe_> ebernhardson: heh [15:56:10] <_joe_> that's impacting that wiki only or the other ones as well? [15:56:35] 10Operations, 10Monitoring, 10Patch-For-Review, 10Services (next), and 2 others: Services need external monitoring - https://phabricator.wikimedia.org/T167048#3335606 (10GWicke) 05Open>03Resolved Thank you, @Joe! [15:57:37] _joe_: it looks like 3 machines in the cluster spiked from a load average of 10 to 60 [15:57:58] _joe_: which hints that its a wiki thats not well distributed, almost certainly a small one that usually sees a handful of req/s at most [15:58:14] and the increase in full text reqs across the cluster was from 400 to 750 [15:58:30] so, most other requests should be services just fine [16:01:20] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:01:20] PROBLEM - nutcracker process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:01:20] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:02:10] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [16:02:10] RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [16:02:11] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:02:19] <_joe_> 3/win 25 [16:32:42] (03PS1) 10Cmjohnson: Merge branch 'production' of https://gerrit.wikimedia.org/r/p/operations/puppet into lold [puppet] - 10https://gerrit.wikimedia.org/r/358042 [16:32:57] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'production' of https://gerrit.wikimedia.org/r/p/operations/puppet into lold [puppet] - 10https://gerrit.wikimedia.org/r/358042 (owner: 10Cmjohnson) [16:33:23] (03Abandoned) 10Cmjohnson: Merge branch 'production' of https://gerrit.wikimedia.org/r/p/operations/puppet into lold [puppet] - 10https://gerrit.wikimedia.org/r/358042 (owner: 10Cmjohnson) [16:35:04] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1089: update RAID controller firwmare - https://phabricator.wikimedia.org/T166935#3335699 (10Marostegui) @Cmjohnson you think you will have time for this sometime next week? Thanks! [16:47:10] (03PS12) 10Paladox: Upgrade gerrit to 2.14.1 (DO NOT MERGE) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [16:48:40] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:49:21] (03PS13) 10Paladox: Upgrade gerrit to 2.14.1 (DO NOT MERGE) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [16:50:18] (03PS14) 10Paladox: Upgrade gerrit to 2.14.1 (DO NOT MERGE) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [16:55:40] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:10:03] (03PS8) 10BBlack: numa_networking: add facter data from sysfs [puppet] - 10https://gerrit.wikimedia.org/r/355809 [17:10:05] (03PS9) 10BBlack: numa_networking: support NUMA in interface::rps [puppet] - 10https://gerrit.wikimedia.org/r/355810 [17:10:07] (03PS9) 10BBlack: numa_networking: support NUMA in tlsproxy nginx config [puppet] - 10https://gerrit.wikimedia.org/r/355811 [17:10:09] (03PS2) 10BBlack: numa_networking: test enable on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/357844 [17:10:11] (03PS2) 10BBlack: numa_networking: remove install-time bnx2x stuff [puppet] - 10https://gerrit.wikimedia.org/r/357850 [17:17:17] Hallo. [17:17:54] (03PS4) 10Dzahn: fix all the "role-role" in system::roles [puppet] - 10https://gerrit.wikimedia.org/r/354172 [17:17:57] When does https://wikitech.wikimedia.org/wiki/Deployments get the table for the next week? [17:19:53] aharoni: I should have done that yesterday, but didn't, will later today after a 1:1 in 13 minutes [17:20:05] greg-g: thanks! [17:24:03] (03PS1) 10Framawiki: Lift IP throttle for Editathon (13 June 2017) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358056 (https://phabricator.wikimedia.org/T167517) [17:24:20] (03PS1) 10Ema: [WIP] VCL: switch to resp.reason testing [puppet] - 10https://gerrit.wikimedia.org/r/358057 [17:24:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:24:50] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [17:25:20] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:25:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:31:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:32:31] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:32:50] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:33:20] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:36:59] (03PS1) 10Framawiki: Add NS:100 to wgNamespacesToBeSearchedDefault for enwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358059 (https://phabricator.wikimedia.org/T167511) [17:44:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:47:41] aharoni: {{done}} (my 1:1 was delayed) [17:50:58] greg-g: thanks! [17:52:51] 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3336035 (10RobH) Both Chris and I have reviewed, and everything on these two systems appears identical. The fact we get two results from the same hardware is c... [17:55:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:08:23] (03PS1) 10RobH: remove dumpsdata100[12] from netboot for testing [puppet] - 10https://gerrit.wikimedia.org/r/358063 [18:08:33] (03PS2) 10RobH: remove dumpsdata100[12] from netboot for testing [puppet] - 10https://gerrit.wikimedia.org/r/358063 [18:08:44] (03CR) 10RobH: [C: 032] remove dumpsdata100[12] from netboot for testing [puppet] - 10https://gerrit.wikimedia.org/r/358063 (owner: 10RobH) [18:11:57] 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3336086 (10RobH) It is odd, since i've also pulled them out of netboot.cfg entirely to test. When they are out of it, it should ALWAYS show the manual partitio... [18:12:10] !log retry allocation of failed shards on elasticsearch eqiad [18:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:22] (03CR) 10BryanDavis: [C: 04-1] "small issue with the location matching noted inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357878 (https://phabricator.wikimedia.org/T110027) (owner: 10Zhuyifei1999) [18:24:13] (03CR) 10Zhuyifei1999: tools-static: add /fontcdn/ to reverse-proxy to Google Fonts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357878 (https://phabricator.wikimedia.org/T110027) (owner: 10Zhuyifei1999) [18:25:49] (03PS3) 10Zhuyifei1999: tools-static: add /fontcdn/ to reverse-proxy to Google Fonts [puppet] - 10https://gerrit.wikimedia.org/r/357878 (https://phabricator.wikimedia.org/T110027) [18:40:31] (03PS1) 10Andrew Bogott: designate.conf: Raise query a few more query limits. [puppet] - 10https://gerrit.wikimedia.org/r/358078 [18:42:35] (03CR) 10Andrew Bogott: [C: 032] designate.conf: Raise query a few more query limits. [puppet] - 10https://gerrit.wikimedia.org/r/358078 (owner: 10Andrew Bogott) [19:26:11] (03CR) 10Dzahn: [C: 032] "at a glance this may look like a big change, but it's really just about the 'message of the day' and that "role:role" duplication in it." [puppet] - 10https://gerrit.wikimedia.org/r/354172 (owner: 10Dzahn) [19:26:23] (03PS5) 10Dzahn: fix all the "role-role" in system::roles [puppet] - 10https://gerrit.wikimedia.org/r/354172 [19:43:28] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266#3336288 (10debt) p:05Triage>03High [19:45:29] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: replace es-tool with elasticsearch-curator for standard elasticsearch operations - https://phabricator.wikimedia.org/T166154#3336294 (10debt) p:05Triage>03High [19:46:50] !log mw1299: running scap pull, maybe out of date? [19:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:32] !log fermium: $ sudo /usr/local/sbin/disable_list wikino-bureaucrats (T166848) [19:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:41] T166848: Shut down mailing list wikino-bureaucrats - https://phabricator.wikimedia.org/T166848 [19:52:50] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 121.02, 101.73, 82.34 [19:57:13] 10Operations, 10DBA: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3336319 (10Marostegui) It is inserting around 6k rows every 10 seconds for the revision dewiki table: ``` root@labsdb1009:/srv/sqldata/dewiki# strace -p 27245 -tT-f -s100000 -o... [19:59:50] RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 49.82, 75.38, 79.07 [20:05:46] (03PS1) 10Pmiazga: Use the new wgPopupsGateway config variable instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358091 (https://phabricator.wikimedia.org/T165018) [20:08:19] (03PS2) 10Pmiazga: Use the new wgPopupsGateway config variable instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358091 (https://phabricator.wikimedia.org/T165018) [20:09:34] (03PS3) 10Pmiazga: Setup the new wgPopupsGateway config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358091 (https://phabricator.wikimedia.org/T165018) [20:12:19] !log demon@tin Synchronized php-1.30.0-wmf.4/extensions/CirrusSearch/includes/Job/DeleteArchive.php: Really fix it this time (duration: 00m 43s) [20:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:10] PROBLEM - Apache HTTP on mw1203 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.075 second response time [20:46:10] PROBLEM - Nginx local proxy to apache on mw1203 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.155 second response time [20:47:10] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.096 second response time [20:47:10] RECOVERY - Nginx local proxy to apache on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.179 second response time [20:50:47] !log mobrovac@tin Started deploy [restbase/deploy@4e5cb35] (staging): Ensure the extract field is always present in the summary response [20:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:26] !log mobrovac@tin Finished deploy [restbase/deploy@4e5cb35] (staging): Ensure the extract field is always present in the summary response (duration: 03m 39s) [20:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:48] !log mobrovac@tin Started deploy [restbase/deploy@4e5cb35]: Ensure the extract field is always present in the summary response - T167045 [20:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:57] T167045: Preview error for several articles - https://phabricator.wikimedia.org/T167045 [20:58:07] (03PS1) 10Andrew Bogott: Openstack: Added 'dnsleaks.py' script. [puppet] - 10https://gerrit.wikimedia.org/r/358124 [20:58:53] (03CR) 10jerkins-bot: [V: 04-1] Openstack: Added 'dnsleaks.py' script. [puppet] - 10https://gerrit.wikimedia.org/r/358124 (owner: 10Andrew Bogott) [20:59:30] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [21:00:48] (03PS2) 10Andrew Bogott: Openstack: Added 'dnsleaks.py' script. [puppet] - 10https://gerrit.wikimedia.org/r/358124 [21:01:44] !log mobrovac@tin Finished deploy [restbase/deploy@4e5cb35]: Ensure the extract field is always present in the summary response - T167045 (duration: 04m 57s) [21:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:18] !log mobrovac@tin Started deploy [restbase/deploy@4e5cb35]: Ensure the extract field is always present in the summary response - T167045 (take #2) [21:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:26] T167045: Preview error for several articles - https://phabricator.wikimedia.org/T167045 [21:05:06] (03PS1) 10Herron: Adjust wikimedia.org SPF from neutral (?all) to soft fail (~all) to impede sender address spoofing. [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) [21:07:41] !log mobrovac@tin Finished deploy [restbase/deploy@4e5cb35]: Ensure the extract field is always present in the summary response - T167045 (take #2) (duration: 05m 23s) [21:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:51] T167045: Preview error for several articles - https://phabricator.wikimedia.org/T167045 [21:08:46] (03CR) 10Reedy: Adjust wikimedia.org SPF from neutral (?all) to soft fail (~all) to impede sender address spoofing. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) (owner: 10Herron) [21:13:31] 10Operations, 10Discovery, 10Maps, 10Interactive-Sprint: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#3336458 (10Gehel) [21:16:01] 10Operations, 10Discovery, 10Maps, 10Interactive-Sprint: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#3336475 (10Gehel) [21:17:09] !log mobrovac@tin Started deploy [restbase/deploy@4e5cb35]: (no justification provided) [21:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:48] (03CR) 10Andrew Bogott: [C: 032] Openstack: Added 'dnsleaks.py' script. [puppet] - 10https://gerrit.wikimedia.org/r/358124 (owner: 10Andrew Bogott) [21:18:30] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 15540 bytes in 0.014 second response time [21:18:36] 10Operations, 10Discovery, 10Icinga, 10Maps, and 2 others: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#3336490 (10Peachey88) [21:18:49] !log mobrovac@tin Finished deploy [restbase/deploy@4e5cb35]: (no justification provided) (duration: 01m 40s) [21:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:23] 10Operations, 10ops-esams, 10Traffic: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3313139 (10Dzahn) importing text from (almost) duplicate ticket T166964 (merging into this ticket) ``` TASK AUTO-GENERATED by Nagios/Icinga RAID event handler A degraded RAID (md) was detected on hos... [21:25:59] 10Operations, 10ops-esams, 10Traffic: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166964#3313134 (10Dzahn) 05Open>03Invalid closing as duplicate of T166965 imported text over there [21:26:41] 10Operations, 10DNS, 10Traffic: Redirect status.wikipedia.org to status.wikimedia.org - https://phabricator.wikimedia.org/T167239#3336503 (10Dzahn) p:05Triage>03Normal [21:27:08] 10Operations: terbium maintenance cron "processEchoEmailBatch.php" is getting "access denied" from database - https://phabricator.wikimedia.org/T167373#3336505 (10Dzahn) p:05Triage>03Normal [21:28:35] 10Operations, 10Gerrit: Upload gerrit package to stretch apt.wm.org repo - https://phabricator.wikimedia.org/T165620#3336506 (10Dzahn) p:05Triage>03Normal [21:31:20] 10Operations, 10ops-esams, 10Traffic: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3336512 (10Dzahn) [21:31:22] 10Operations, 10ops-esams, 10Traffic: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166964#3336514 (10Dzahn) [21:44:24] 10Operations, 10Discovery, 10Icinga, 10Maps, and 2 others: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#3336539 (10debt) p:05Triage>03High [21:49:11] 10Operations, 10Labs, 10Patch-For-Review: (don't) decom promethium - https://phabricator.wikimedia.org/T164395#3336547 (10Dzahn) @Andrew @MoritzMuehlenhoff Should i just close it now? I don't have anything specific to do for me, but do we still need to classify it as prod or labs? [21:49:31] 10Operations, 10ops-eqiad, 10Labs, 10Patch-For-Review: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#3336550 (10Dzahn) [21:49:33] 10Operations, 10Labs, 10Patch-For-Review: (don't) decom promethium - https://phabricator.wikimedia.org/T164395#3336548 (10Dzahn) 05Open>03stalled p:05Normal>03Low [22:21:19] (03PS2) 10Reedy: added spf record to toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T131930) (owner: 10Mschon) [22:21:35] (03CR) 10Reedy: "PS2 is a rebase" [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T131930) (owner: 10Mschon) [22:23:00] PROBLEM - Check systemd state on mw1294 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:50] RECOVERY - Check systemd state on mw1294 is OK: OK - running: The system is fully operational [22:29:47] (03CR) 10Reedy: Adjust wikimedia.org SPF from neutral (?all) to soft fail (~all) to impede sender address spoofing. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) (owner: 10Herron) [22:41:20] 10Operations, 10Prometheus-metrics-monitoring: prometheus-node-exporter - invalid group: ‘prometheus:prometheus' - https://phabricator.wikimedia.org/T167245#3336629 (10Dzahn) p:05Triage>03Low [22:47:00] PROBLEM - puppet last run on mw1294 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:48:00] PROBLEM - Check systemd state on mw1294 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:48:50] RECOVERY - Check systemd state on mw1294 is OK: OK - running: The system is fully operational [22:48:50] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 32 minutes ago with 0 failures [23:07:49] (03PS1) 10Dzahn: wikistats: cron for XML dumps (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/358150 [23:08:09] (03CR) 10Dzahn: [C: 04-2] "reminder to self.. todo" [puppet] - 10https://gerrit.wikimedia.org/r/358150 (owner: 10Dzahn) [23:09:12] (03CR) 10jerkins-bot: [V: 04-1] wikistats: cron for XML dumps (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/358150 (owner: 10Dzahn) [23:09:30] PROBLEM - MD RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:20] RECOVERY - MD RAID on ms-be1019 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [23:13:07] (03CR) 10Ladsgroup: "Facepalm" [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [23:14:31] (03PS3) 10Ladsgroup: Make /entity/ redirect internal [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) [23:21:49] (03CR) 10Jdlrobson: [C: 031] Setup the new wgPopupsGateway config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358091 (https://phabricator.wikimedia.org/T165018) (owner: 10Pmiazga) [23:23:14] (03CR) 10Jdlrobson: [C: 031] "I should note a follow up to this will be needed to remove the other variable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358091 (https://phabricator.wikimedia.org/T165018) (owner: 10Pmiazga) [23:25:44] (03CR) 10BryanDavis: [C: 031] "Untested, but it looks right. We can try it out on the inactive tools-static proxy first to make sure things work as expected." [puppet] - 10https://gerrit.wikimedia.org/r/357878 (https://phabricator.wikimedia.org/T110027) (owner: 10Zhuyifei1999) [23:56:28] (03PS2) 10Dzahn: wikistats: cron for XML dumps (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/358150 [23:57:41] (03CR) 10jerkins-bot: [V: 04-1] wikistats: cron for XML dumps (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/358150 (owner: 10Dzahn) [23:58:57] (03PS3) 10Dzahn: wikistats: cron for XML dumps (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/358150 (https://phabricator.wikimedia.org/T165879)