[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170706T0000). [00:01:52] !log preparing to deploy phabricator release/2017-07-05/1 (Milestone: https://phabricator.wikimedia.org/project/view/2881/ ) [00:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:38] (03CR) 10Dzahn: [C: 032] "just affects the new installation that isn't in prod yet - may need some minor follow-up but definitely better than the default - thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/361190 (owner: 10Paladox) [00:03:16] (03CR) 10Paladox: "> just affects the new installation that isn't in prod yet - may need" [puppet] - 10https://gerrit.wikimedia.org/r/361190 (owner: 10Paladox) [00:04:54] !log phabricator update completed. [00:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:47] twentyafterfour it seems the phab update did not update the ext [00:05:56] since it redirects me here https://phabricator.wikimedia.org/source/mediawiki/browse/refs%2Fmeta%2Fconfig/ [00:06:05] PROBLEM - HHVM jobrunner on mw1163 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [00:07:05] RECOVERY - HHVM jobrunner on mw1163 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [00:07:34] by ext i mean the phab ext we use for the gerrit redirects :) [00:08:48] paladox: more worrisome, diffusion is fataling [00:08:54] https://phabricator.wikimedia.org/diffusion/PHDEP/ [00:08:55] oh [00:09:03] is not fataling for me [00:09:05] so hmm [00:09:31] i see the fatal [00:10:17] twentyafterfour /config/version/ will show you which parts got updated [00:10:47] for me it's https://phabricator.wikimedia.org/P5686 [00:13:29] twentyafterfour it did not deploy correctly it seems [00:13:40] as i carn't find canHandleRequestException [00:13:45] in https://github.com/wikimedia/phabricator [00:14:17] I don't get it [00:15:00] (03CR) 10Niharika29: "Apparently the image misalignment was only an issue in Chrome. It's been fixed in https://gerrit.wikimedia.org/r/#/c/363511/ along with th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363497 (https://phabricator.wikimedia.org/T169284) (owner: 10Niharika29) [00:16:35] Something must be cached somewhere. [00:16:44] !log restarting apache and clearing phabricator caches [00:16:49] https://github.com/wikimedia/phabricator/commit/c71d9c601f9725451e7666d17360bd87922b0f33#diff-62aa8cf75591484fe337fcdfea26ed97 [00:16:50] yeh [00:16:51] i found that too [00:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:24] twentyafterfour did you git pull and did a submodule update? [00:17:40] paladox: yes [00:17:41] i found the submodules have to be a commit and not checkout as a branch. [00:17:48] restarting apache seems to have done the trick [00:18:46] !log diffusion fatals resolved by restarting apache and clearing phabricator's bytecode cache [00:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:36] paladox: does the redirect work now? [00:21:39] yep [00:21:40] thanks :) [00:21:41] cool [00:21:44] thank you! [00:21:55] !log phabricator deployment really finished this time. really. [00:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:13] :) [00:24:44] :) [00:27:20] (03CR) 10Dzahn: [C: 031] Remove exim4-heavy and exim4::ganglia from role requesttracker_server [puppet] - 10https://gerrit.wikimedia.org/r/363390 (https://phabricator.wikimedia.org/T169794) (owner: 10Herron) [00:27:40] (03CR) 10Jforrester: [C: 031] "Good to merge now (assuming deploys to other wikis won't be until after the wmf.8 train runs next week)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363497 (https://phabricator.wikimedia.org/T169284) (owner: 10Niharika29) [00:29:52] (03CR) 10Niharika29: "Yes, deploys to other wikis aren't planned for now. We'll have this in testwiki only for another two weeks by my estimate. I'll schedule t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363497 (https://phabricator.wikimedia.org/T169284) (owner: 10Niharika29) [00:39:15] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [00:40:55] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.73 ms [00:46:56] !log labcontrol1002 has multiple IPs, 208.80.154.102 (no DNS name) and 208.80.154.12 (labservices1002). labservices1002 is another host that ALSO has the 208.80.154.12 IP and 208.80.154.20 (lab-recursor1). Can the duplicate IP be removed from one of them? T169039 [00:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:15] 10Operations, 10Release-Engineering-Team, 10Wikimedia-Site-requests: Run updateArticleCount.php on Wikimedia Commons - https://phabricator.wikimedia.org/T169822#3409698 (10Dcljr) Wow, that was fast. I thought it would take hours… [00:56:29] to me it seems wikidata is down cuasing the whole wikipedia go down, before panicking I'm trying again [00:56:46] wikidata is up [00:56:52] but I can't load english wikipedia [00:57:10] Amir1: english wikipedia works for me [00:57:23] okay, it's just me [00:57:31] I can open everything else though [01:00:47] okay, it's just me [01:02:15] PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:02:15] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1499302929 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 9154215 keys, up 2 minutes 8 seconds - replication_delay is 1499302929 [01:02:15] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1499302929 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9154171 keys, up 2 minutes 8 seconds - replication_delay is 1499302929 [01:02:25] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:25] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1499302943 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9160861 keys, up 2 minutes 21 seconds - replication_delay is 1499302943 [01:03:05] RECOVERY - Check health of redis instance on 6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4450667 keys, up 3 minutes - replication_delay is 0 [01:03:15] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 9153679 keys, up 3 minutes 12 seconds - replication_delay is 0 [01:03:15] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9150591 keys, up 3 minutes 12 seconds - replication_delay is 0 [01:03:15] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9058401 keys, up 3 minutes 11 seconds - replication_delay is 0 [01:03:35] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9156139 keys, up 3 minutes 27 seconds - replication_delay is 0 [01:23:34] 10Operations, 10Release-Engineering-Team, 10Wikimedia-Site-requests: Run updateArticleCount.php on Wikimedia Commons - https://phabricator.wikimedia.org/T169822#3410222 (10demon) Weird, it was taking way longer for me earlier... Glad we got it done :) [02:29:36] 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3410257 (10herron) sounds good 👍 [02:31:05] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.7) (duration: 09m 25s) [02:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:33] (03PS1) 10Krinkle: Archive old "wikimedia-job-runner" repo [debs/wikimedia-job-runner] - 10https://gerrit.wikimedia.org/r/363517 [02:37:44] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jul 6 02:37:44 UTC 2017 (duration 6m 39s) [02:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:52] (03CR) 10Krinkle: [C: 032] Archive old "wikimedia-job-runner" repo [debs/wikimedia-job-runner] - 10https://gerrit.wikimedia.org/r/363517 (owner: 10Krinkle) [02:38:58] (03CR) 10jerkins-bot: [V: 04-1] Archive old "wikimedia-job-runner" repo [debs/wikimedia-job-runner] - 10https://gerrit.wikimedia.org/r/363517 (owner: 10Krinkle) [02:39:13] (03CR) 10Krinkle: [V: 032 C: 032] Archive old "wikimedia-job-runner" repo [debs/wikimedia-job-runner] - 10https://gerrit.wikimedia.org/r/363517 (owner: 10Krinkle) [03:32:25] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [03:33:25] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [04:13:20] (03CR) 10Liuxinyu970226: [C: 04-1] "Also, shouldn't we make some unit tests?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363370 (https://phabricator.wikimedia.org/T168727) (owner: 10MarcoAurelio) [04:56:51] !log Deploy alter table on s1 eqiad hosts - T168661 [04:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:03] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [04:59:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363522 (https://phabricator.wikimedia.org/T168661) [05:00:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363522 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [05:01:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363522 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [05:02:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363522 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [05:02:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 - T168661 (duration: 00m 43s) [05:03:05] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 55 [05:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:10] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [05:09:05] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363524 [05:10:30] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363524 (owner: 10Marostegui) [05:11:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363524 (owner: 10Marostegui) [05:11:43] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363524 (owner: 10Marostegui) [05:12:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 - T168661 (duration: 00m 43s) [05:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:59] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [05:13:14] (03PS2) 10Marostegui: db2056.yaml: Remove old socket location [puppet] - 10https://gerrit.wikimedia.org/r/362991 (https://phabricator.wikimedia.org/T148507) [05:16:09] !log Stop mysql on db2056 for maintenance - T148507 T169510 [05:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:21] T169510: Setup dbstore2002 with 2 new mysql instances from production and enable GTID - https://phabricator.wikimedia.org/T169510 [05:18:20] (03CR) 10Marostegui: [C: 032] db2056.yaml: Remove old socket location [puppet] - 10https://gerrit.wikimedia.org/r/362991 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui) [05:20:20] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:20:20] PROBLEM - nutcracker process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:22:23] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:22:23] RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [05:22:23] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:22:24] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [05:31:25] (03PS1) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363526 (https://phabricator.wikimedia.org/T168661) [05:38:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363526 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [05:39:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363526 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [05:39:58] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363526 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [05:40:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1073 - T168661 (duration: 00m 42s) [05:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:08] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [05:47:00] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363527 [05:48:24] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363527 (owner: 10Marostegui) [05:49:38] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363527 (owner: 10Marostegui) [05:49:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363527 (owner: 10Marostegui) [05:51:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 - T168661 (duration: 00m 42s) [05:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:45] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [06:03:54] (03PS1) 10Marostegui: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363528 (https://phabricator.wikimedia.org/T168661) [06:06:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363528 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:07:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363528 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:07:30] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363528 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:08:23] (03PS1) 10ArielGlenn: get rid of mirrors user, run all dataset rsyncs as datasets user [puppet] - 10https://gerrit.wikimedia.org/r/363529 [06:08:57] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1080 - T168661 (duration: 00m 42s) [06:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:09] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [06:10:36] (03CR) 10ArielGlenn: [C: 032] get rid of mirrors user, run all dataset rsyncs as datasets user [puppet] - 10https://gerrit.wikimedia.org/r/363529 (owner: 10ArielGlenn) [06:14:07] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363530 [06:15:34] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363530 (owner: 10Marostegui) [06:18:25] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363530 (owner: 10Marostegui) [06:18:34] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363530 (owner: 10Marostegui) [06:19:25] !log marostegui@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [06:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:13] Another normal run looked fine [06:20:16] (03PS1) 10Legoktm: Check that NS_MODULE is defined before using it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363531 [06:20:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1080 - T168661 (duration: 00m 42s) [06:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:27] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [06:26:56] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363532 (https://phabricator.wikimedia.org/T168661) [06:28:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363532 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:30:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363532 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:30:14] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363532 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:30:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1083 - T168661 (duration: 00m 42s) [06:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:06] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [06:34:45] (03PS1) 10Marostegui: db-eqiad.php: Repool db1083, depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363533 (https://phabricator.wikimedia.org/T168661) [06:36:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1083, depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363533 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:36:25] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:36:35] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:37:24] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1083, depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363533 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:37:33] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1083, depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363533 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:38:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1083, depool db1089 - T168661 (duration: 00m 43s) [06:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:50] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [06:38:56] 10Operations, 10Dumps-Generation: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3410553 (10ArielGlenn) [06:39:45] (03CR) 10Muehlenhoff: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/363390 (https://phabricator.wikimedia.org/T169794) (owner: 10Herron) [06:41:06] (03PS1) 10Marostegui: db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363534 (https://phabricator.wikimedia.org/T168661) [06:41:25] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:41:35] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:42:12] !log rebooting wtp1* for kernel update [06:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363534 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:44:45] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363534 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:45:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 - T168661 (duration: 00m 44s) [06:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:50] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [06:45:58] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363534 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:46:57] (03PS1) 10Muehlenhoff: Remove expiry date for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/363535 [07:11:11] !log Disable puppet on dbstore2002 - T169510 [07:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:21] T169510: Setup dbstore2002 with 2 new mysql instances from production and enable GTID - https://phabricator.wikimedia.org/T169510 [07:15:28] !log Stop MySQL on dbstore2002 for maintenance - T169510 [07:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:14] 10Operations, 10Pybal, 10Traffic: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765#3410598 (10elukey) Another thing that would be nice is the possibility to specify more than one conf host in `profile::pybal::config_host: conf2001.codfw.wmnet`, and allow pybal to conne... [07:24:48] 10Operations, 10Pybal, 10Traffic: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765#3407910 (10MoritzMuehlenhoff) >>! In T169765#3410598, @elukey wrote: > Another thing that would be nice is the possibility to specify more than one conf host in `profile::pybal::config_h... [07:26:50] 10Operations, 10Pybal, 10Traffic: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765#3407910 (10Volans) >>! In T169765#3410598, @elukey wrote: > Another thing that would be nice is the possibility to specify more than one conf host in `profile::pybal::config_host: conf20... [07:31:45] 10Operations, 10Pybal, 10Traffic: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765#3410629 (10Joe) One option to support reconnections and srv records and everything is to use the (blocking) python-etcd library via `defer.deferToThread` as `etcd-mirror` does. The issu... [07:42:04] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3410634 (10jcrespo) I would do labsdb1004 first, which is the slave for the toolsdb, and labsdb1005- I didn't want to pressure you because I knew you had other concerns. I would say Tue... [07:42:19] !log reboot wasat for kernel update [07:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:32] (03CR) 10Jcrespo: [C: 032] mariadb: Revert parsercaches to pc100[456] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363375 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [07:47:10] (03Merged) 10jenkins-bot: mariadb: Revert parsercaches to pc100[456] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363375 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [07:47:19] (03CR) 10jenkins-bot: mariadb: Revert parsercaches to pc100[456] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363375 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [07:48:50] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Revert parsercaches to pc100[456] (duration: 00m 43s) [07:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:11] !log rebooting restbase1013 for kernel update [08:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:22] (03PS1) 10Jcrespo: mariadb: Retire db1096, db1099 and db1101 from the parsercache role [puppet] - 10https://gerrit.wikimedia.org/r/363546 (https://phabricator.wikimedia.org/T167784) [08:12:45] (03PS2) 10Jcrespo: mariadb: Retire db1096, db1099 and db1101 from the parsercache role [puppet] - 10https://gerrit.wikimedia.org/r/363546 (https://phabricator.wikimedia.org/T167784) [08:12:47] (03CR) 10Marostegui: mariadb: Retire db1096, db1099 and db1101 from the parsercache role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363546 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [08:15:04] (03CR) 10Jcrespo: "Where?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363546 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [08:16:05] (03CR) 10Elukey: "Got a notification for this one, but this patch is not applicable anymore to the class since it is changed too much." [puppet] - 10https://gerrit.wikimedia.org/r/251800 (owner: 10Ori.livneh) [08:16:25] (03CR) 10Marostegui: mariadb: Retire db1096, db1099 and db1101 from the parsercache role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363546 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [08:16:29] (03CR) 10Elukey: "> Got a notification for this one, but this patch is not applicable" [puppet] - 10https://gerrit.wikimedia.org/r/251800 (owner: 10Ori.livneh) [08:17:11] (03PS1) 10ArielGlenn: monitor dataset hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 [08:18:17] (03CR) 10jerkins-bot: [V: 04-1] monitor dataset hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (owner: 10ArielGlenn) [08:18:23] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3410699 (10ArielGlenn) As volans rightly points out, we should be alerting for nfs lockups of this sort on the dataset servers. First take on a patchset h... [08:18:32] (03CR) 10Jcrespo: "Blame gerrit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363546 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [08:19:29] (03CR) 10Marostegui: [C: 031] "Shame on you gerrit!!" [puppet] - 10https://gerrit.wikimedia.org/r/363546 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [08:22:10] (03CR) 10Jcrespo: [C: 032] mariadb: Retire db1096, db1099 and db1101 from the parsercache role [puppet] - 10https://gerrit.wikimedia.org/r/363546 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [08:30:55] PROBLEM - cassandra-a CQL 10.64.48.135:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.135 and port 9042: Connection refused [08:31:45] PROBLEM - cassandra-a SSL 10.64.48.135:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:33:15] PROBLEM - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.136 and port 9042: Connection refused [08:33:25] PROBLEM - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:34:05] PROBLEM - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.137 and port 9042: Connection refused [08:34:17] uptime 253 days, so not rebooted --^ [08:34:19] checking [08:34:24] moritzm: I assume this is your reboot of restbase1013 [08:34:25] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3410711 (10jcrespo) [08:34:45] PROBLEM - cassandra-c SSL 10.64.48.137:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:35:13] ah no it was drained! [08:35:20] sorry, expired downtime, it's drained [08:35:30] okok! It got me worried for a bit :) [08:35:30] reboot forthcoming shortly [08:36:34] !log rebooting restbase1014 for kernel update [08:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:40] 10Operations, 10Dumps-Generation: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3410740 (10ArielGlenn) The moving parts are as follows: dumpsdata: - filesystem definition, nfs export to snapshot hosts - clean up of old dump run output - save completed revi... [08:50:55] RECOVERY - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is OK: SSL OK - Certificate restbase1014-b valid until 2017-09-12 15:34:28 +0000 (expires in 68 days) [08:51:05] RECOVERY - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is OK: TCP OK - 0.000 second response time on 10.64.48.137 port 9042 [08:51:13] I like that kind of ping [08:52:59] !log rebooting restbase-test cluster for kernel updates [08:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:35] PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:56:35] RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [09:00:47] mmmmm [09:01:24] Jul 6 08:55:35 mw1300 systemd[1]: hhvm.service: main process exited, code=killed, status=11/SEGV [09:01:27] moritzm: --^ [09:03:07] yeah, I already looked at the segfault, nothing usable, memory garbled in the corefile [09:04:19] yep checked the stacktrace, really weird [09:04:33] moritzm: where is the corefile ? [09:05:01] ah should be in /var/log/hhvm [09:05:14] so I guess stacktrace.964.log [09:05:19] (03CR) 10Filippo Giunchedi: [C: 04-1] use 'require_package' for stats packages including python-yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363382 (owner: 10ArielGlenn) [09:10:52] it's in /var/tmp/core/ [09:12:39] argh, didn't know it [09:12:49] I checked the hhvm config and found only var log [09:12:50] thanks! [09:13:09] it's configured system-wide for all our corefiles, not specific to hhvm [09:13:19] got it [09:16:34] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3410781 (10jcrespo) I have checked puppet, and I do not see any error with the puppet configuration (ip, mac of the new hosts). @ayounsi do you have time to help us check the network config... [09:16:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:16:35] (03PS1) 10Gehel: maps - postgis-vt SQL lib has moved to a new location [puppet] - 10https://gerrit.wikimedia.org/r/363555 [09:17:34] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:17:41] <_joe_> we just had a peak of 503s [09:18:29] it has been happening for the past two days, one big spike with varnish backend fetch failures in codfw (causing uslfo issues too) [09:18:41] indeed, five minutes ago [09:18:47] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3410793 (10jcrespo) db1098, for example, should have IP `10.64.16.83` and mac `18:66:DA:F8:D5:E0` according to the server and puppet configuration, but PXE doesn't move forward with the ins... [09:18:54] <_joe_> elukey: yup [09:20:01] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3410794 (10jcrespo) ``` Link Status ``` This could be a physical issue or a network configuration issue, could you help us check? [09:20:06] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-1h&to=now&var-datasource=codfw%20prometheus%2Fops&var-cache_type=text [09:20:21] happened in text/upload/misc afaics [09:21:57] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3410795 (10jcrespo) Oh, I think I have it `18:66:DA:F8:D5:E1` says connected. I think we used the wrong port to configure the server. This may still need network check, maybe? but most like... [09:25:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:25:44] PROBLEM - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.138 and port 9042: Connection refused [09:25:54] PROBLEM - cassandra-a SSL 10.64.48.138:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:26:18] *sigh*, silencing again [09:26:34] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:29:13] 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3410797 (10fgiunchedi) thanks @Gilles for the debugging! I think it is due to me moving some swift settings from wikitech Hiera: page to horizon, I've put the l... [09:29:24] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=501.00 Read Requests/Sec=1549.00 Write Requests/Sec=3.80 KBytes Read/Sec=45496.40 KBytes_Written/Sec=1594.40 [09:32:12] (03PS1) 10Jcrespo: install_server: Change db1098 MAC address to the one that shows link [puppet] - 10https://gerrit.wikimedia.org/r/363563 (https://phabricator.wikimedia.org/T162233) [09:32:49] (03CR) 10Marostegui: [C: 031] install_server: Change db1098 MAC address to the one that shows link [puppet] - 10https://gerrit.wikimedia.org/r/363563 (https://phabricator.wikimedia.org/T162233) (owner: 10Jcrespo) [09:32:55] (03PS2) 10Jcrespo: install_server: Change db1098 MAC address to the one that shows link [puppet] - 10https://gerrit.wikimedia.org/r/363563 (https://phabricator.wikimedia.org/T162233) [09:34:26] (03CR) 10Jcrespo: [C: 032] install_server: Change db1098 MAC address to the one that shows link [puppet] - 10https://gerrit.wikimedia.org/r/363563 (https://phabricator.wikimedia.org/T162233) (owner: 10Jcrespo) [09:36:33] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Complete stretch reimage for ms-fe / ms-be fleet - https://phabricator.wikimedia.org/T169601#3410831 (10fgiunchedi) codfw fully reinstalled with stretch, starting eqiad [09:37:34] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=27.90 Read Requests/Sec=7.10 Write Requests/Sec=2.90 KBytes Read/Sec=28.80 KBytes_Written/Sec=83.60 [09:38:14] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3156297 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1098.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-re... [09:41:35] (03CR) 10Gehel: "fwiw, here is my opinion, since it was requested..." [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [09:46:42] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3410856 (10jcrespo) Sadly, I still cannot see it booting. [09:51:07] 10Operations, 10Graphite, 10User-fgiunchedi: Something puts many different metrics into graphite, allocating a lot of disk space - https://phabricator.wikimedia.org/T1075#3410880 (10fgiunchedi) [09:51:10] 10Operations, 10Cloud-Services, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi: Move labs 'instances' data to graphite labs - https://phabricator.wikimedia.org/T143405#3410877 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I've deleted the `instances` directory for real from graphite machines,... [09:51:31] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3410893 (10jcrespo) @Cmjohnson this is not urgent, but can you check the link of the initially configured device? `18:66:DA:F8:D5:E0` aka network card1. @ayounsi can see link, but cannot se... [09:58:38] (03PS1) 10Jcrespo: Revert "install_server: Change db1098 MAC address to the one that shows link" [puppet] - 10https://gerrit.wikimedia.org/r/363565 [10:00:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363567 (https://phabricator.wikimedia.org/T166204) [10:07:35] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#3410932 (10fgiunchedi) [10:07:38] 10Operations, 10Cassandra, 10Patch-For-Review, 10Services (blocked): setup/install restbase-dev100[123] - https://phabricator.wikimedia.org/T151075#3410928 (10fgiunchedi) 05Open>03Resolved It is indeed, resolving! [10:18:22] !log rebooting ocg1003 for kernel update [10:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:35] (03PS3) 10Lucas Werkmeister (WMDE): Enable WikibaseQualityConstraints statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363200 (https://phabricator.wikimedia.org/T169647) [10:25:34] RECOVERY - MariaDB Slave Lag: s4 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89955.15 seconds [10:34:01] !log rebooting ocg1001/1002 for kernel update [10:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:18] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 {channel:frontend.error,request:{id:1499337489996-37350},error:{message:Status check failed (redis failure?)}} - 232 bytes in 0.005 second response time [10:38:32] page [10:38:34] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: http status 500 [10:38:34] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1003.eqiad.wmnet because of too many down! [10:38:34] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: connection error: HTTPConnectionPool(host=localhost, port=8000): Read timed out. (read timeout=5) [10:38:41] moritzm: expected? [10:38:44] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1001.eqiad.wmnet because of too many down! [10:38:44] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1001.eqiad.wmnet because of too many down! [10:38:54] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1003.eqiad.wmnet because of too many down! [10:39:06] I only rebooted ocg1002 (and it's depooled, having a look) [10:39:34] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 328.67 seconds [10:39:47] <_joe_> moritzm: you need to switch the redis host if you reboot ocg1002 [10:39:54] <_joe_> since we migrated redis there [10:40:14] and it's all documented right _joe_? :-P [10:40:18] <_joe_> because it was giving us issues before the switchover [10:40:18] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 468 bytes in 0.005 second response time [10:40:19] * volans hides [10:40:34] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 143516 msg: ocg_render_job_queue 0 msg [10:40:34] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 143516 msg: ocg_render_job_queue 0 msg [10:40:34] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [10:40:44] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [10:40:44] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [10:40:49] <_joe_> volans: do you mean if I thought about service restarts when I moved redis at 10 PM? no, I didn't [10:40:49] elukey: can you downtime db1047? [10:40:54] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [10:41:21] marostegui: sure, I started a bit alter and it is not liking it sigh [10:41:27] lovely, will add a note to the Service restarts page, but fortunately we'll never need to remember that... [10:41:40] <_joe_> volans: otoh if this wasn't an abandonware it would already be using nutcracker [10:41:57] <_joe_> but we had marko fix the redis connection on a volunteer basis 1 week ago [10:42:08] ack [10:42:11] <_joe_> so the original idea was we would not need to know [10:42:24] <_joe_> :) [10:42:35] <_joe_> but then, ocg :P [10:42:41] eheheh [10:44:51] btw I got 1 page for crit and 2 for recovery... [10:45:08] same text, 3 minutes delay between them [10:49:17] and finally ocg1001... [10:52:55] !log reboot conf2002 for kernel update [10:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:56] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 27.71 seconds [11:03:15] <_joe_> elukey: did you downtime the host? [11:03:19] <_joe_> else it's gonna page [11:03:51] yeah, it's downtimed [11:04:04] _joe_ yes I do it before every restart [11:04:17] it is already up and running [11:04:29] etcdmirror seems fine [11:05:00] (03PS1) 10ArielGlenn: increase min_free_kbytes on dataset hosts permanently [puppet] - 10https://gerrit.wikimedia.org/r/363574 (https://phabricator.wikimedia.org/T169680) [11:07:12] !log rebooting restbase1017 for kernel update [11:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:28] (03PS4) 10Ayounsi: Add diffscan module. [puppet] - 10https://gerrit.wikimedia.org/r/363296 (https://phabricator.wikimedia.org/T169624) [11:14:58] <_joe_> elukey: great [11:15:36] !log reboot conf2003 for kernel updates [11:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:05] this one might cause pybal in ulsfo to be sad, we'll check right after [11:16:30] * ema stares at ulsfo LVSs [11:16:30] (03PS2) 10ArielGlenn: increase min_free_kbytes on dataset hosts permanently [puppet] - 10https://gerrit.wikimedia.org/r/363574 (https://phabricator.wikimedia.org/T169680) [11:17:26] <_joe_> ema: I was about to :P [11:18:42] (03CR) 10ArielGlenn: [C: 032] increase min_free_kbytes on dataset hosts permanently [puppet] - 10https://gerrit.wikimedia.org/r/363574 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [11:21:37] interesting, this time they did reconnect on their own [11:26:20] ema: will do conf1* after lunch ok? [11:26:40] elukey: perfect, thanks! [11:34:30] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [11:36:30] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [11:37:04] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3411179 (10Mvolz) [11:37:43] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3257075 (10Mvolz) a:05Mvolz>03None [11:38:52] (03PS1) 10Filippo Giunchedi: Deploy statsv with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/363579 (https://phabricator.wikimedia.org/T129139) [11:39:32] 10Operations, 10Pybal, 10Traffic: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765#3411195 (10ema) Today @elukey took care of rebooting conf2003, and the pybals using it (ulsfo) **did** reconnect automatically. I've observed the situation a bit more closely on lvs4004... [11:41:42] (03PS1) 10Muehlenhoff: Enable base::firewall for diadem/sysprosium [puppet] - 10https://gerrit.wikimedia.org/r/363580 [11:42:41] (03CR) 10Filippo Giunchedi: "repo part at https://gerrit.wikimedia.org/r/#/c/363578" [puppet] - 10https://gerrit.wikimedia.org/r/363579 (https://phabricator.wikimedia.org/T129139) (owner: 10Filippo Giunchedi) [11:42:45] (03CR) 10Ayounsi: [C: 031] Enable base::firewall for diadem/sysprosium [puppet] - 10https://gerrit.wikimedia.org/r/363580 (owner: 10Muehlenhoff) [11:47:20] (03PS2) 10Muehlenhoff: Enable base::firewall for diadem/sysprosium [puppet] - 10https://gerrit.wikimedia.org/r/363580 [11:47:38] (03PS2) 10Gehel: maps - postgis-vt SQL lib has moved to a new location [puppet] - 10https://gerrit.wikimedia.org/r/363555 [11:48:03] 10Operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#3411211 (10fgiunchedi) [11:48:06] 10Operations, 10User-fgiunchedi: Point swiftrepl to swift HTTPS - https://phabricator.wikimedia.org/T161717#3411207 (10fgiunchedi) 05Open>03Resolved This is done, albeit an hack, but see {T162123} about swiftrepl in general [11:48:16] (03CR) 10Alexandros Kosiaris: [C: 032] Enable base::firewall for diadem/sysprosium [puppet] - 10https://gerrit.wikimedia.org/r/363580 (owner: 10Muehlenhoff) [11:48:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Enable base::firewall for diadem/sysprosium [puppet] - 10https://gerrit.wikimedia.org/r/363580 (owner: 10Muehlenhoff) [11:48:54] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "Nice, these were introduced yesterday. I wonder if we are finally ready to include base::firewall in standard." [puppet] - 10https://gerrit.wikimedia.org/r/363580 (owner: 10Muehlenhoff) [11:50:13] (03PS3) 10Gehel: maps - postgis-vt SQL lib has moved to a new location [puppet] - 10https://gerrit.wikimedia.org/r/363555 [11:50:15] (03PS7) 10Alexandros Kosiaris: Create hourly backup schedule, modeled on weekly and use for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/341371 (owner: 10Chad) [11:50:19] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Create hourly backup schedule, modeled on weekly and use for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/341371 (owner: 10Chad) [11:52:45] 10Operations, 10ops-eqiad: mgmt inaccessible on restbase1018 - https://phabricator.wikimedia.org/T169871#3411220 (10MoritzMuehlenhoff) [11:55:22] (03CR) 10Alexandros Kosiaris: [C: 032] otrs: Enable apache exporter [puppet] - 10https://gerrit.wikimedia.org/r/363347 (owner: 10Alexandros Kosiaris) [11:55:32] (03PS2) 10Alexandros Kosiaris: otrs: Enable apache exporter [puppet] - 10https://gerrit.wikimedia.org/r/363347 [11:55:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] otrs: Enable apache exporter [puppet] - 10https://gerrit.wikimedia.org/r/363347 (owner: 10Alexandros Kosiaris) [11:56:30] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [11:59:30] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [12:00:47] (03CR) 10Paladox: "This causes puppet to fail for me on labs" [puppet] - 10https://gerrit.wikimedia.org/r/341371 (owner: 10Chad) [12:04:31] (03CR) 10Paladox: Create hourly backup schedule, modeled on weekly and use for Gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341371 (owner: 10Chad) [12:08:00] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:09:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363567 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [12:12:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363567 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [12:13:08] (03CR) 10Muehlenhoff: "Looks good, but how about using" [puppet] - 10https://gerrit.wikimedia.org/r/362217 (owner: 10Jcrespo) [12:13:42] (03PS1) 10Alexandros Kosiaris: backup::set: Add a jobdefaults parameter [puppet] - 10https://gerrit.wikimedia.org/r/363581 [12:13:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363567 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [12:14:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1083 - T166204 (duration: 00m 44s) [12:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:12] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [12:14:13] !log Deploy alter table db1083 - https://phabricator.wikimedia.org/T166204 [12:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] backup::set: Add a jobdefaults parameter [puppet] - 10https://gerrit.wikimedia.org/r/363581 (owner: 10Alexandros Kosiaris) [12:15:42] (03CR) 10Alexandros Kosiaris: "Fixed in https://gerrit.wikimedia.org/r/#/c/363581/" [puppet] - 10https://gerrit.wikimedia.org/r/341371 (owner: 10Chad) [12:17:10] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [12:20:28] !log reboot lithium for kernel update [12:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:50] PROBLEM - bacula director process on helium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-dir [12:22:00] PROBLEM - Check systemd state on helium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:28:46] (03PS1) 10Alexandros Kosiaris: backup: Use profile instead of role [puppet] - 10https://gerrit.wikimedia.org/r/363583 [12:30:08] (03CR) 10Jcrespo: "No problem, but could we add another global with MYSQL_ROOT_CLIENTS? It is a different profiles (mariadb::client) used, we do not use cumi" [puppet] - 10https://gerrit.wikimedia.org/r/362217 (owner: 10Jcrespo) [12:30:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] backup: Use profile instead of role [puppet] - 10https://gerrit.wikimedia.org/r/363583 (owner: 10Alexandros Kosiaris) [12:30:30] (03PS2) 10Alexandros Kosiaris: ores: Use nutcracker for redis [puppet] - 10https://gerrit.wikimedia.org/r/363340 (https://phabricator.wikimedia.org/T122676) [12:30:33] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores: Use nutcracker for redis [puppet] - 10https://gerrit.wikimedia.org/r/363340 (https://phabricator.wikimedia.org/T122676) (owner: 10Alexandros Kosiaris) [12:31:46] (03CR) 10Jcrespo: "We could even add db1011 to that variable- technically is a root client." [puppet] - 10https://gerrit.wikimedia.org/r/362217 (owner: 10Jcrespo) [12:33:24] (03PS4) 10Gehel: maps - postgis-vt SQL lib has moved to a new location [puppet] - 10https://gerrit.wikimedia.org/r/363555 [12:34:28] (03CR) 10Gehel: [C: 032] maps - postgis-vt SQL lib has moved to a new location [puppet] - 10https://gerrit.wikimedia.org/r/363555 (owner: 10Gehel) [12:34:50] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[bacula-director] [12:37:22] (03CR) 10Jcrespo: "Where is $CUMIN_MASTERS defined?, I cannot grep it" [puppet] - 10https://gerrit.wikimedia.org/r/362217 (owner: 10Jcrespo) [12:39:10] (03CR) 10Muehlenhoff: "The general approach looks good. Some initial comments inline." (034 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse) [12:41:13] (03CR) 10Muehlenhoff: "$CUMIN_MASTERS is already used in the base ferm rules in the ssh-from-cumin-masters ferm service. It's defined in the generic network defi" [puppet] - 10https://gerrit.wikimedia.org/r/362217 (owner: 10Jcrespo) [12:42:56] (03CR) 10Muehlenhoff: "Adding another macro should be fine, similar to what's used for cumin_masters in constants.pp" [puppet] - 10https://gerrit.wikimedia.org/r/362217 (owner: 10Jcrespo) [12:46:42] (03CR) 10Herron: [C: 032] Change lists.wikimedia.org SPF record to soft fail (~all) [dns] - 10https://gerrit.wikimedia.org/r/361501 (https://phabricator.wikimedia.org/T167703) (owner: 10Herron) [12:46:58] (03PS2) 10Herron: Change lists.wikimedia.org SPF record to soft fail (~all) [dns] - 10https://gerrit.wikimedia.org/r/361501 (https://phabricator.wikimedia.org/T167703) [12:48:43] !log rebooting mc* servers in codfw for kernel update [12:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:05] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2058515 [12:54:10] (03PS1) 10Alexandros Kosiaris: Use $real_jobdefaults instead of $jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/363587 [12:54:27] (03CR) 10Gehel: Deploy mjolnir kafka daemon to relforge (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363384 (owner: 10EBernhardson) [12:54:31] (03CR) 10Gehel: [C: 04-1] Deploy mjolnir kafka daemon to relforge [puppet] - 10https://gerrit.wikimedia.org/r/363384 (owner: 10EBernhardson) [12:56:15] PROBLEM - IPsec on mc1019 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2019_v4 [12:56:23] (03CR) 10Jcrespo: [C: 032] mariadb: Correct systemd unit path [puppet] - 10https://gerrit.wikimedia.org/r/363360 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [12:56:27] (03PS2) 10Jcrespo: mariadb: Correct systemd unit path [puppet] - 10https://gerrit.wikimedia.org/r/363360 (https://phabricator.wikimedia.org/T169514) [12:58:49] !log changed lists.wikimedia.org spf to soft fail (~all) - T167703 [12:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170706T1300). [13:05:35] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [13:05:48] no patches [13:06:15] RECOVERY - IPsec on mc1019 is OK: Strongswan OK - 1 ESP OK [13:09:35] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [13:09:39] what are those mw exceptions? [13:09:50] database issue [13:09:58] I am looking at logstash [13:10:05] PROBLEM - HHVM jobrunner on mw1162 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [13:10:08] there is a spike of ChangeNotification jobs that failed to reach a DB apparently [13:10:43] https://logstash.wikimedia.org/goto/aa79134c52c024cfdb10697106a6c3cc [13:11:00] most from mw1161 , some from mw1166 [13:11:05] RECOVERY - HHVM jobrunner on mw1162 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [13:11:29] (03PS4) 10Jcrespo: mariadb: Add cluster manager hosts to allowed admin port users [puppet] - 10https://gerrit.wikimedia.org/r/362217 [13:11:32] Cannot access the database: Unknown error (10.64.16.102) [13:13:07] I am assuming the ChangeNotification jobs bailed out for like 10 minutes and are reenqueued/reprocessed [13:14:47] so the culprit seems to be db1084.eqiad.wmnet. [13:16:25] jynus: anything ongoing with db1084? [13:16:51] I cannot see anything wrong with that host [13:20:09] (03PS2) 10Gehel: shiny_server: Fix restart instructions [puppet] - 10https://gerrit.wikimedia.org/r/363486 (owner: 10Bearloga) [13:21:23] (03PS1) 10Marostegui: db-codfw.php: Depool db2056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363594 (https://phabricator.wikimedia.org/T169510) [13:22:26] (03CR) 10Gehel: [C: 032] shiny_server: Fix restart instructions [puppet] - 10https://gerrit.wikimedia.org/r/363486 (owner: 10Bearloga) [13:23:24] jynus: there is an unmerged puppet patch that seems to be from you. Should I merge it with mine? [13:23:42] yesm please [13:23:47] (03CR) 10Marostegui: [C: 031] mariadb: Add cluster manager hosts to allowed admin port users [puppet] - 10https://gerrit.wikimedia.org/r/362217 (owner: 10Jcrespo) [13:24:00] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363594 (https://phabricator.wikimedia.org/T169510) (owner: 10Marostegui) [13:24:08] jynus: done [13:25:18] I do not see any issue with db1084 [13:25:42] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363594 (https://phabricator.wikimedia.org/T169510) (owner: 10Marostegui) [13:26:12] (03CR) 10jenkins-bot: db-codfw.php: Depool db2056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363594 (https://phabricator.wikimedia.org/T169510) (owner: 10Marostegui) [13:27:03] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2056 - T169510 (duration: 00m 44s) [13:27:11] in fact, I do not see any connection errors in the past hour [13:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:16] T169510: Setup dbstore2002 with 2 new mysql instances from production and enable GTID - https://phabricator.wikimedia.org/T169510 [13:28:38] might be an issue on the jobrunners while connecting to db1084, weird [13:28:45] !log Stop MySQL on db2056 for maintenance - T169510 [13:28:50] no, no connection errors [13:28:52] the error message is not really descriptive :) [13:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:35] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [13:29:54] !log cp4013: upgrade to varnish 4.1.7-1wm1 and reboot for kernel update [13:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:27] jynus: I do see /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(995): Wikimedia\Rdbms\Database->reportConnectionError(string), this is why I thought it was upon connection to db1084 [13:30:35] no [13:30:41] that is a mediawiki exception [13:31:22] rephrase: I am not stating that it is a db issue, only writing to the chan to have others optinions :) [13:31:50] it seems a code error [13:31:55] where close connection [13:32:00] tries to change the database [13:32:35] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [13:32:41] RecentChange::save has some issue [13:32:45] ah right the stacktrace starts with a destructor [13:33:26] they are not user-started queries, however [13:33:31] it is [13:33:40] /rpc/RunJobs.php?wiki=commonswiki&type=ChangeNotification&maxtime=60&maxmem=300M [13:34:01] maybe it is trying to reuse a closed connection or something [13:34:26] hashar: would you have time to open a task to investigate why Closure$RecentChange::save fails in this way? [13:35:37] the thing is- if connections would fail to open- we would see them on the stats for aborted/killed/etc db connections [13:36:03] makes sense [13:36:19] I didn't check carefully the stacktrace, it is clearer now [13:36:24] some logic using the load balancer could be wrong [13:36:38] like connections being closed but still reused [13:36:43] or something like that [13:36:55] PROBLEM - puppet last run on cp4007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:15] PROBLEM - confd service on cp4007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:35] PROBLEM - traffic-pool service on cp4007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:04] opening a task [13:38:35] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4013_v4, cp4013_v6 [13:38:35] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4013_v4, cp4013_v6 [13:38:36] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4013_v4, cp4013_v6 [13:38:36] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4013_v4, cp4013_v6 [13:38:45] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4013_v4, cp4013_v6 [13:38:45] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4013_v4, cp4013_v6 [13:38:45] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4013_v4, cp4013_v6 [13:38:45] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4013_v4, cp4013_v6 [13:38:45] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4013_v4, cp4013_v6 [13:38:45] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4013_v4, cp4013_v6 [13:38:55] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4013_v4, cp4013_v6 [13:38:55] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4013_v4, cp4013_v6 [13:38:55] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4013_v4, cp4013_v6 [13:39:25] PROBLEM - puppet last run on cp4006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:25] PROBLEM - traffic-pool service on cp4006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:25] PROBLEM - confd service on cp4006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:35] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 72 ESP OK [13:39:35] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 58 ESP OK [13:39:35] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 72 ESP OK [13:39:35] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 72 ESP OK [13:39:46] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 72 ESP OK [13:39:46] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 72 ESP OK [13:39:46] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 72 ESP OK [13:39:46] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 72 ESP OK [13:39:46] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 72 ESP OK [13:39:54] sorry for the spam wrt cp4013, it took longer to reboot than usual [13:39:55] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 58 ESP OK [13:39:55] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 58 ESP OK [13:39:55] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 58 ESP OK [13:39:56] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 58 ESP OK [13:40:39] hashar: https://phabricator.wikimedia.org/T169884 [13:40:45] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [13:40:51] XioNoX: any network issues in ulsfo? cp4007 and cp4006 seem to have flapped? [13:40:55] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [13:41:45] 10Operations, 10Wikimedia-General-or-Unknown: Jobrunners generate mediawiki exceptions upon calling Closure$RecentChange::save - https://phabricator.wikimedia.org/T169884#3411557 (10elukey) [13:41:52] (03PS1) 10Muehlenhoff: Move ferm service out of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/363595 [13:42:04] looking [13:42:46] (03PS2) 10ArielGlenn: use 'require_package' for stats packages including python-yaml [puppet] - 10https://gerrit.wikimedia.org/r/363382 [13:43:58] (03PS3) 10ArielGlenn: use 'require_package' for stats packages including python-yaml [puppet] - 10https://gerrit.wikimedia.org/r/363382 [13:44:18] Jul 6 13:33:34 asw-ulsfo mib2d[1464]: SNMP_TRAP_LINK_DOWN: ifIndex 529, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-1/0/3 [13:44:18] Jul 6 13:33:53 asw-ulsfo last message repeated 3 times [13:44:31] that's the interface to cp4013.ulsfo.wmnet [13:44:35] ema: ^ [13:45:04] XioNoX: ok, that's the machine I've rebooted and took longer than usual to bring up its network interface [13:45:25] PROBLEM - IPsec on cp4006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:41] XioNoX: anything related to cp4006/7? [13:45:53] that's the only thing out of the ordinary I can see for now [13:46:36] there are a lot of 50X now https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X [13:46:49] starting at 13:36 [13:46:55] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 30 minutes ago with 0 failures [13:47:25] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 34 minutes ago with 0 failures [13:47:29] elukey: indeed, all ulsfo [13:47:33] all uploadsss [13:47:35] RECOVERY - traffic-pool service on cp4006 is OK: OK - traffic-pool is active [13:47:35] RECOVERY - confd service on cp4006 is OK: OK - confd is active [13:47:48] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A couple of small things, but LGTM overall." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363296 (https://phabricator.wikimedia.org/T169624) (owner: 10Ayounsi) [13:48:07] elukey: right, cp4006/7 are upload hosts [13:48:15] (03CR) 10ArielGlenn: use 'require_package' for stats packages including python-yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363382 (owner: 10ArielGlenn) [13:48:25] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [13:48:35] RECOVERY - traffic-pool service on cp4007 is OK: OK - traffic-pool is active [13:48:45] RECOVERY - Check Varnish expiry mailbox lag on cp4013 is OK: OK: expiry mailbox lag is 377 [13:48:50] <_joe_> should we depool ulsfo? [13:49:22] _joe_: both hosts seem to be back to normal now, but we should prep a patch to depool in case it happens again [13:49:22] it seems decreasing now, probably 400[67] are recovering [13:49:59] (03PS1) 10Gehel: wdqs - send ldf traffic to wdqs1003.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/363596 (https://phabricator.wikimedia.org/T166244) [13:50:35] PROBLEM - puppet last run on cp4006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:52:02] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/363382 (owner: 10ArielGlenn) [13:52:05] PROBLEM - puppet last run on cp4007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:52:06] from https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=cp4006&var-network=eth0 something happened, doesn't seem to be network related so far, still investigating [13:52:24] the box seems overloaded, ssh is struggling to give me a shell [13:52:35] iowait is skyrocketed on cp4006 [13:52:59] load average: 58.23, 55.12, 41.27 [13:53:00] (03PS1) 10Ema: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/363597 [13:53:25] RECOVERY - confd service on cp4007 is OK: OK - confd is active [13:53:27] (03PS2) 10ArielGlenn: monitor dataset hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 [13:53:35] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 40 minutes ago with 0 failures [13:54:11] 5xx-wise we're good now [13:54:19] (03PS1) 10Elukey: Temporary depool ulsfo for network issues [dns] - 10https://gerrit.wikimedia.org/r/363598 [13:54:26] ah sorry didn't see ema's patch [13:55:16] (03PS1) 10Alexandros Kosiaris: Revert "ores: Use nutcracker for redis" [puppet] - 10https://gerrit.wikimedia.org/r/363600 [13:55:26] PROBLEM - IPsec on cp4006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:55:45] PROBLEM - confd service on cp4006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:55:46] PROBLEM - traffic-pool service on cp4006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:56:08] (03PS2) 10Alexandros Kosiaris: Revert "ores: Use nutcracker for redis" [puppet] - 10https://gerrit.wikimedia.org/r/363600 [13:56:17] (03PS3) 10Alexandros Kosiaris: Revert "ores: Use nutcracker for redis" [puppet] - 10https://gerrit.wikimedia.org/r/363600 [13:56:21] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "ores: Use nutcracker for redis" [puppet] - 10https://gerrit.wikimedia.org/r/363600 (owner: 10Alexandros Kosiaris) [13:56:26] PROBLEM - confd service on cp4007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:56:35] PROBLEM - puppet last run on cp4006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:57:29] /dev/sda3 362G 361G 1.7G 100% /srv/sda3 [13:57:45] PROBLEM - traffic-pool service on cp4007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:58:24] ema: is it normal that /dev/sda3 is that busy? Maybe varnish doesn't like it much [13:58:45] 289G varnish.bin2 [13:58:50] this is on cp4007 [13:59:04] elukey: yeah that's fine [13:59:11] all right [13:59:13] !log rebooting restbase2001 for kernel update [13:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:45] PROBLEM - MD RAID on cp4006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:59:55] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 43 minutes ago with 0 failures [14:00:45] PROBLEM - Freshness of zerofetch successful run file on cp4006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:00:56] ema: can we try to restart varnish on one of the cps? [14:01:05] (03PS5) 10Ayounsi: Add diffscan module. [puppet] - 10https://gerrit.wikimedia.org/r/363296 (https://phabricator.wikimedia.org/T169624) [14:01:12] elukey: yeah I was about to say that we should do that on cp4006 :) [14:01:15] depooling it [14:01:18] super [14:01:34] it seems going awol, not sure what it is doing [14:01:45] RECOVERY - traffic-pool service on cp4007 is OK: OK - traffic-pool is active [14:01:59] elukey: can you please check the other upload-ulsfo hosts meanwhile? [14:02:12] !log depool cp4006 [14:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:24] ema: sure, I am on it... shall I restart varnish on it too ? [14:02:25] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [14:02:35] RECOVERY - Freshness of zerofetch successful run file on cp4006 is OK: OK [14:02:35] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 49 minutes ago with 0 failures [14:02:35] RECOVERY - MD RAID on cp4006 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [14:02:35] RECOVERY - traffic-pool service on cp4006 is OK: OK - traffic-pool is active [14:02:35] RECOVERY - confd service on cp4006 is OK: OK - confd is active [14:02:37] (03PS2) 10Muehlenhoff: Move ferm service out of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/363595 [14:02:52] elukey: no restarts, just a general check [14:03:05] PROBLEM - puppet last run on cp4007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:03:12] okok [14:03:24] cp4006 depooled but it's still receiving traffic [14:03:44] lvs lost etcd? [14:03:49] probably [14:04:45] PROBLEM - traffic-pool service on cp4007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:05:05] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 49 minutes ago with 0 failures [14:05:10] yeah it's still in lvs4002's ipvsadm list [14:05:25] RECOVERY - confd service on cp4007 is OK: OK - confd is active [14:05:29] (03CR) 10Alexandros Kosiaris: [C: 031] Move ferm service out of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/363595 (owner: 10Muehlenhoff) [14:05:35] RECOVERY - traffic-pool service on cp4007 is OK: OK - traffic-pool is active [14:06:24] !log restart pybal on lvs4004 T169765 [14:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:35] T169765: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 [14:07:07] so sda and sdb on cp4007 are 99% busy (iostat -x 2) and there is a ton of iowait [14:08:01] no http connections from lvs4002/4004 to etcd, restarting pybal on lvs4002 too [14:08:13] !log restart pybal on lvs4002 T169765 [14:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:55] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:09:36] so now cp4006 is actually depooled [14:09:58] load average went down to 1 immediately [14:10:19] and cp4007 is recovering too [14:10:58] wth [14:11:33] I'm gonna check the other LVSs [14:12:46] lvs4003 has no connection to etcd [14:13:35] lvs4001 has two in SYN_SENT [14:13:42] tcp 0 1 lvs4001.ulsfo.wmn:47952 conf2003.codfw.wmn:2379 SYN_SENT - [14:13:45] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:13:53] tcp 0 1 lvs4001.ulsfo.wmn:47950 conf2003.codfw.wmn:2379 SYN_SENT - [14:14:41] they all *did* establish connctions to conf2003 after the reboot [14:15:51] do they lost it when you saw what seemed to be a network hiccup? [14:15:54] *did [14:16:03] !log upgrade labmon to grafana 4.4.1 - T169773 [14:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:15] T169773: Upgrade grafana to 4.4 - https://phabricator.wikimedia.org/T169773 [14:16:46] oh [14:17:04] maybe things started going south with the depooled reboot of cp4013 [14:18:00] ema: for cp4007 at ~13:30 there are warnings for "unused backend" etc.. [14:18:07] that is more or less when the iowait started [14:18:48] first spike of 503s was at 13:36 [14:19:23] elukey: 'unused backend' is normal [14:19:23] (03CR) 10Alexandros Kosiaris: [C: 031] role::puppetmaster::common: add environments support [puppet] - 10https://gerrit.wikimedia.org/r/362985 (owner: 10Giuseppe Lavagetto) [14:19:48] ema: sure sure, I wanted to give you some datapoints about "maybe things started going south with the depooled reboot of cp4013" [14:20:08] (03CR) 10Volans: [C: 04-1] "Beside my comment on the check itself, any reason to not apply it to all hosts that have NFS? Also see few minor things inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363548 (owner: 10ArielGlenn) [14:20:09] timings match with your suspicion [14:20:29] !log restart pybal on lvs4003 T169765 [14:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:40] T169765: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 [14:21:09] <_joe_> ema: uh I thought you looked into ulsfo pybals after the conf2* reboots [14:21:25] now on cp4007 iostat says ~30/40% of usage for sda,sdb, it was 99% during the spike [14:21:59] _joe_: I did, and they looked good [14:22:07] <_joe_> uhm, ok [14:22:57] _joe_: I haven't restarted pybal on lvs4001 yet in case you want to take a look [14:24:03] it keeps on establishing new connections to conf2003 [14:24:07] <_joe_> ema: this is *very* strange and looks like a defect in twisted httpclient library [14:24:21] <_joe_> ema: let's look at conf2003, connections are to nginx [14:24:31] <_joe_> so hopes are we can see what the hell is going on on that side [14:26:14] !log rebooting prometheus2003 for kernel update [14:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:07] _joe_ could it be that 400[67] were the only ones serving traffic because the rest of ulsfo upload was somehow depooled? [14:27:29] <_joe_> elukey: pybal will tell you [14:27:56] netstat output for lvs4001 connections to conf2003 https://phabricator.wikimedia.org/P5689 [14:28:09] I've repeated the command a couple of times, with timestamps [14:28:20] <_joe_> ok I know what the problem is [14:28:25] <_joe_> restart pybal [14:28:36] <_joe_> and this is luckily something we can fix! [14:28:39] ok [14:28:49] <_joe_> {"errorCode":401,"message":"The event in requested index is outdated and cleared","cause":"the requested history has been cleared [73370/70449]","index":74369} [14:29:16] <_joe_> so restart all pybals in ulsfo [14:29:29] !log restart pybal on lvs4001 T169765 [14:29:32] Jul 6 13:30:42 cp4007 varnishd[28548]: Unused backend be_cp4021_ulsfo_wmnet,Unused backend be_cp4014_ulsfo_wmnet, Unused backend be_cp4013_ulsfo_wmnet, [14:29:37] _joe_: done, see SAL [14:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:39] T169765: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 [14:30:40] _joe_: can you translate etcd->english? Does that means that pybal was having a pointer to an event that was cleared because of the restart of conf2003? [14:30:59] so it couldn't reconnect the watcher since that event (or something similar) [14:31:03] (03PS2) 10Alexandros Kosiaris: Use $real_jobdefaults instead of $jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/363587 [14:31:13] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Use $real_jobdefaults instead of $jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/363587 (owner: 10Alexandros Kosiaris) [14:31:38] (03CR) 10DCausse: Switch this repo to a deb package (034 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse) [14:32:00] (03PS7) 10DCausse: Switch this repo to a deb package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) [14:32:53] <_joe_> volans: basically, yes [14:33:11] ok and the fix I guess is to catch it and restart the watcher from "zero" [14:33:22] <_joe_> volans: yes again [14:33:31] \o/ [14:33:32] :D [14:33:47] <_joe_> volans: so basically this could happen at any time, say you don't change anything on a pool for a very, very long time [14:34:02] <_joe_> and then you need to reconnect for $reason [14:34:13] <_joe_> we need to reset the index on any reconnection [14:34:29] <_joe_> let me write a ticket [14:35:18] 10Operations, 10Pybal, 10Traffic: pybal should reset the etcdindex it's looking at after losin a connection - https://phabricator.wikimedia.org/T169893#3411770 (10Joe) [14:35:26] <_joe_> ouch, damn phabricator ui [14:35:46] ok, I'm missing etcd internals to know if reset it every time might re-apply old changes or not [14:35:50] but makes sense [14:37:06] thanks [14:39:15] !log repool cp4006 [14:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:49] !log rebooting prometheus2004 for kernel update [14:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:37] 10Operations, 10Pybal, 10Traffic: pybal should reset the etcdindex it's looking at after losing a connection - https://phabricator.wikimedia.org/T169893#3411787 (10Joe) [14:43:39] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3411791 (10GWicke) IRC meeting summary: https://tools.wmflabs.org/meetbot/wiki... [14:48:36] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3349120 (10MZMcBride) >>! In T167906#3411791, @GWicke wrote: > Since there is... [14:50:34] !log rebooting prometheus1003/1004 for kernel update [14:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:46] 10Operations, 10User-fgiunchedi: Upgrade grafana to 4.4.1 - https://phabricator.wikimedia.org/T169773#3411803 (10fgiunchedi) p:05Triage>03Normal [14:51:45] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 6 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:51:45] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:52:35] (03PS1) 10Giuseppe Lavagetto: Reset the waitIndex when connection is lost or failed [debs/pybal] - 10https://gerrit.wikimedia.org/r/363611 (https://phabricator.wikimedia.org/T169893) [14:52:45] <_joe_> ema: ^^ [14:54:45] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:55:25] (03CR) 10Ema: [C: 031] Reset the waitIndex when connection is lost or failed [debs/pybal] - 10https://gerrit.wikimedia.org/r/363611 (https://phabricator.wikimedia.org/T169893) (owner: 10Giuseppe Lavagetto) [14:58:05] <_joe_> ema: we should test this on our test system [14:58:16] <_joe_> it should be relatively easy to do [14:58:25] (03PS1) 10Alexandros Kosiaris: Use $real_jobdefaults instead of $jobdefaults #2 [puppet] - 10https://gerrit.wikimedia.org/r/363613 [14:58:36] (03CR) 10Alexandros Kosiaris: [C: 032] Use $real_jobdefaults instead of $jobdefaults #2 [puppet] - 10https://gerrit.wikimedia.org/r/363613 (owner: 10Alexandros Kosiaris) [14:58:41] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Use $real_jobdefaults instead of $jobdefaults #2 [puppet] - 10https://gerrit.wikimedia.org/r/363613 (owner: 10Alexandros Kosiaris) [14:58:48] _joe_: ok, pybal-test2001 is already running 1.13.7 [15:01:34] (03CR) 10Herron: [C: 032] Add logrotate template to retain 60 days of exim mx logs [puppet] - 10https://gerrit.wikimedia.org/r/357723 (https://phabricator.wikimedia.org/T167333) (owner: 10Herron) [15:01:39] (03PS6) 10Herron: Add logrotate template to retain 60 days of exim mx logs [puppet] - 10https://gerrit.wikimedia.org/r/357723 (https://phabricator.wikimedia.org/T167333) [15:05:04] (03PS3) 10Paladox: Tweak ores::web config file user. [puppet] - 10https://gerrit.wikimedia.org/r/362097 (owner: 10Awight) [15:06:07] (03PS1) 10Alexandros Kosiaris: gerrit: Change the order profiles are applied [puppet] - 10https://gerrit.wikimedia.org/r/363617 [15:06:22] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] gerrit: Change the order profiles are applied [puppet] - 10https://gerrit.wikimedia.org/r/363617 (owner: 10Alexandros Kosiaris) [15:08:15] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:08:55] RECOVERY - bacula director process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir [15:09:05] RECOVERY - Check systemd state on helium is OK: OK - running: The system is fully operational [15:10:41] 10Operations, 10Mail, 10Patch-For-Review: Increase email log retention period for the main email relays - https://phabricator.wikimedia.org/T167333#3411940 (10herron) Merged https://gerrit.wikimedia.org/r/357723 and ran puppet on mx[1,2]001 to extend local log retention to 60 days. Will double check tomorro... [15:10:48] 10Operations, 10Mail: Increase email log retention period for the main email relays - https://phabricator.wikimedia.org/T167333#3411943 (10herron) [15:12:20] !log extend mx[1,2]001 exim log retention to 60 days - T167333 [15:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:30] T167333: Increase email log retention period for the main email relays - https://phabricator.wikimedia.org/T167333 [15:13:13] (03PS4) 10Awight: Tweak ores::web config file user. [puppet] - 10https://gerrit.wikimedia.org/r/362097 (https://phabricator.wikimedia.org/T169164) [15:13:40] (03CR) 10Alexandros Kosiaris: "That's almost ok, minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362097 (https://phabricator.wikimedia.org/T169164) (owner: 10Awight) [15:14:29] (03CR) 10Alexandros Kosiaris: "s/the latter/the following/" [puppet] - 10https://gerrit.wikimedia.org/r/362097 (https://phabricator.wikimedia.org/T169164) (owner: 10Awight) [15:15:30] (03CR) 10Alexandros Kosiaris: Tweak ores::web config file user. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362097 (https://phabricator.wikimedia.org/T169164) (owner: 10Awight) [15:18:07] (03PS1) 10Muehlenhoff: Extend account data for jrbranaa [puppet] - 10https://gerrit.wikimedia.org/r/363621 [15:18:50] (03PS5) 10Paladox: Tweak ores::web config file user. [puppet] - 10https://gerrit.wikimedia.org/r/362097 (https://phabricator.wikimedia.org/T169164) (owner: 10Awight) [15:19:57] (03CR) 10Alexandros Kosiaris: "Code LGTM, all we need is a more descriptive commit message of what this patch does and I 'll merge." [puppet] - 10https://gerrit.wikimedia.org/r/362097 (https://phabricator.wikimedia.org/T169164) (owner: 10Awight) [15:19:59] (03CR) 10jerkins-bot: [V: 04-1] Tweak ores::web config file user. [puppet] - 10https://gerrit.wikimedia.org/r/362097 (https://phabricator.wikimedia.org/T169164) (owner: 10Awight) [15:21:15] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2025177 [15:21:58] (03PS6) 10Paladox: Tweak ores::web config file user. [puppet] - 10https://gerrit.wikimedia.org/r/362097 (https://phabricator.wikimedia.org/T169164) (owner: 10Awight) [15:22:03] (03CR) 10Muehlenhoff: [C: 032] Extend account data for jrbranaa [puppet] - 10https://gerrit.wikimedia.org/r/363621 (owner: 10Muehlenhoff) [15:23:45] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 6 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:25:05] (03PS7) 10Awight: Conditionally switch ores::web config file user depending on context [puppet] - 10https://gerrit.wikimedia.org/r/362097 (https://phabricator.wikimedia.org/T169164) [15:25:31] (03CR) 10Alexandros Kosiaris: [C: 032] Conditionally switch ores::web config file user depending on context [puppet] - 10https://gerrit.wikimedia.org/r/362097 (https://phabricator.wikimedia.org/T169164) (owner: 10Awight) [15:25:38] (03PS8) 10Alexandros Kosiaris: Conditionally switch ores::web config file user depending on context [puppet] - 10https://gerrit.wikimedia.org/r/362097 (https://phabricator.wikimedia.org/T169164) (owner: 10Awight) [15:25:58] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Conditionally switch ores::web config file user depending on context [puppet] - 10https://gerrit.wikimedia.org/r/362097 (https://phabricator.wikimedia.org/T169164) (owner: 10Awight) [15:27:02] (03PS1) 10Filippo Giunchedi: thumbor: parametrize poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/363626 [15:28:00] (03CR) 10jerkins-bot: [V: 04-1] thumbor: parametrize poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/363626 (owner: 10Filippo Giunchedi) [15:29:42] sad_trombone.mkv [15:29:45] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:32:25] (03PS2) 10Filippo Giunchedi: thumbor: parametrize poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/363626 [15:33:35] (03PS3) 10Filippo Giunchedi: thumbor: parametrize poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/363626 (https://phabricator.wikimedia.org/T169114) [15:33:58] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/6944" [puppet] - 10https://gerrit.wikimedia.org/r/363626 (https://phabricator.wikimedia.org/T169114) (owner: 10Filippo Giunchedi) [15:34:45] (03PS5) 10EBernhardson: Deploy mjolnir kafka daemon to relforge [puppet] - 10https://gerrit.wikimedia.org/r/363384 [15:35:46] (03CR) 10Gilles: [C: 031] thumbor: parametrize poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/363626 (https://phabricator.wikimedia.org/T169114) (owner: 10Filippo Giunchedi) [15:35:57] (03CR) 10EBernhardson: Deploy mjolnir kafka daemon to relforge (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363384 (owner: 10EBernhardson) [15:36:47] (03CR) 10jerkins-bot: [V: 04-1] Deploy mjolnir kafka daemon to relforge [puppet] - 10https://gerrit.wikimedia.org/r/363384 (owner: 10EBernhardson) [15:38:06] (03PS6) 10EBernhardson: Deploy mjolnir kafka daemon to relforge [puppet] - 10https://gerrit.wikimedia.org/r/363384 [15:47:25] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3412015 (10Anomie) >>! In T167906#3411800, @MZMcBride wrote: >>>! In T167906#3... [15:50:19] (03CR) 10Andrew Bogott: "Is it safe to assume that there will never, ever be local changes to our clone of the repo?" [puppet] - 10https://gerrit.wikimedia.org/r/362928 (owner: 10BryanDavis) [15:51:15] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 4309 [15:56:09] (03PS2) 10Andrew Bogott: labsdb: Add babel table to public views [puppet] - 10https://gerrit.wikimedia.org/r/363492 (https://phabricator.wikimedia.org/T160713) (owner: 10BryanDavis) [15:57:53] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "Note: please hold off on merging this until Ib1c574cd00 is deployed. It turned out some type checking queries take surprisingly long if yo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 (https://phabricator.wikimedia.org/T168938) (owner: 10Lucas Werkmeister (WMDE)) [15:57:55] (03CR) 10Andrew Bogott: [C: 032] labsdb: Add babel table to public views [puppet] - 10https://gerrit.wikimedia.org/r/363492 (https://phabricator.wikimedia.org/T160713) (owner: 10BryanDavis) [15:58:44] (03CR) 10Andrew Bogott: "This probably needs me to run maintain-views in a bunch of places, I think..." [puppet] - 10https://gerrit.wikimedia.org/r/363492 (https://phabricator.wikimedia.org/T160713) (owner: 10BryanDavis) [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170706T1600). [16:00:04] RainbowSprinkles: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:01:28] RainbowSprinkles: looking at your patch [16:02:03] !log ema@neodymium conftool action : set/pooled=yes; selector: name=cp4014.ulsfo.wmnet,service=varnish-be [16:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:48] RainbowSprinkles: ping? [16:06:10] (03PS4) 10Filippo Giunchedi: thumbor: parametrize poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/363626 (https://phabricator.wikimedia.org/T169114) [16:07:40] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: parametrize poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/363626 (https://phabricator.wikimedia.org/T169114) (owner: 10Filippo Giunchedi) [16:09:49] (03CR) 10Thcipriani: [C: 031] "It looks like you have what you need from the scap side, still needs a scap/scap.cfg in the search/MjoLniR repo." [puppet] - 10https://gerrit.wikimedia.org/r/363384 (owner: 10EBernhardson) [16:12:56] !log bounce thumbor to apply https://gerrit.wikimedia.org/r/#/c/363626/ [16:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:01] (03CR) 10Gehel: [C: 031] "minus the scap/scap.cfg file in the mjolnir repo, this looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/363384 (owner: 10EBernhardson) [16:24:38] godog: Pong, sorry (was in a meeting) [16:25:47] RainbowSprinkles: no worries, is it applied in beta already? https://gerrit.wikimedia.org/r/#/c/323867/ that is [16:26:12] We hadn't cherry-picked it explicitly, no [16:27:44] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3412158 (10GWicke) > IMO the proposal continues to be far too optimistic in as... [16:28:26] RainbowSprinkles: ok, I'll cherry pick it [16:29:57] The only gotcha I think was the etcd configuration, but I got that in the change [16:30:10] Assuming that module works as advertised ☺️ [16:30:55] seems to be the case, https://puppet-compiler.wmflabs.org/6948/ [16:31:33] RainbowSprinkles: ok so in beta it isn't going to work [16:31:34] Error: Could not set home on user[mwdeploy]: Execution of '/usr/sbin/usermod -d /var/lib/mwdeploy mwdeploy' returned 6: usermod: user 'mwdeploy' does not exist in /etc/passwd [16:31:55] meaning home needs to be changed in ldap, I don't know how [16:32:41] Blah, f'ing [16:32:43] I can do that [16:32:56] Gotta fix in LDAP, then puppet [16:33:06] I hit this recently with gerrit2 or something iirc [16:34:13] RainbowSprinkles: ack, fixing it now? [16:34:14] Ok home edited on ldap, puppet should apply cleanly now [16:34:30] I'll check [16:34:37] `ldapvi -b ou=people cn=foouser` is the magic invocation, fwiw :) [16:36:38] ah ye olde ldapvi, thanks [16:37:05] yup fine in beta, needs nscd -i passwd to flush the cache [16:37:33] "I want to edit ldap records using vim" said no one ever ;-) [16:38:08] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/6948" [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) (owner: 10Chad) [16:38:16] (03PS7) 10Filippo Giunchedi: Move mwdeploy home to /var/lib where it belongs, it's a system user [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) (owner: 10Chad) [16:40:36] (03CR) 10Filippo Giunchedi: [C: 032] Move mwdeploy home to /var/lib where it belongs, it's a system user [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) (owner: 10Chad) [16:42:51] yep that's going to fail, the new home isn't created [16:44:25] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:44:25] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:44:45] PROBLEM - puppet last run on mw1202 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:44:45] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:44:45] PROBLEM - puppet last run on mw2218 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:44:56] yes yes [16:45:15] PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:45:15] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:45:15] PROBLEM - puppet last run on mw1184 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:45:15] PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:45:15] PROBLEM - puppet last run on mw2198 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:45:35] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:45:35] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:45:36] !log manually create mwdeploy's new home [16:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:55] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:46:15] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:46:25] PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:46:25] PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:46:35] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:46:36] PROBLEM - puppet last run on mw1276 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:46:45] PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:46:45] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:46:45] PROBLEM - puppet last run on mw2217 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:46:45] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/mwdeploy/.etcdrc] [16:47:25] RECOVERY - puppet last run on mw2238 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:47:25] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:47:25] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:47:25] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:47:26] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:47:35] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:47:36] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:47:36] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:47:36] RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:47:45] RECOVERY - puppet last run on mw1202 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [16:47:45] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:47:45] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:47:45] RECOVERY - puppet last run on mw2217 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:47:45] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:47:46] RECOVERY - puppet last run on mw2218 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:47:55] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [16:48:15] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:48:15] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:48:15] RECOVERY - puppet last run on mw1184 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:48:15] RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [16:48:15] RECOVERY - puppet last run on mw2198 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:48:15] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:48:41] RainbowSprinkles: can you try a deploy to make sure everything still works? [16:48:51] Okie dokie [16:50:20] !log demon@tin Synchronized README: Testing testing 1 2 3 (duration: 00m 44s) [16:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:47] ACKNOWLEDGEMENT - HP RAID on db2044 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:4 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T169906 [16:55:51] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T169906#3412418 (10ops-monitoring-bot) [16:56:20] godog: no errors on my end [16:56:33] 10Operations, 10ops-codfw, 10DBA: db2044: Disk on predictive failure - https://phabricator.wikimedia.org/T169693#3412422 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [16:56:59] RainbowSprinkles: ack, thanks [16:57:44] Yay teamwork [16:58:36] (03CR) 10Thcipriani: [C: 031] Deploy statsv with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/363579 (https://phabricator.wikimedia.org/T129139) (owner: 10Filippo Giunchedi) [16:58:37] Also, puppet now has a Happy Gilmore reference :p [16:58:40] 🙏 [16:58:52] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Decide on /var/lib vs /home as locations of homedir for mwdeploy - https://phabricator.wikimedia.org/T86971#3412439 (10demon) 05Open>03Resolved a:03demon [16:59:27] Hmm, we have a related task for l10nupdate user. I thought we fixed that already... [16:59:57] Nope still in /home [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170706T1700). [17:00:49] Nothing for ORES today [17:03:07] no parsoid deploy today [17:04:43] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T169906#3412418 (10Volans) Closing as duplicate of T169693 . The disk went to a failed state when removed and is now rebuilding. ``` $ sudo /usr/local/lib/nagios/plugins/get-raid-status-hpssacli Smart Array P420i in Slot 0... [17:04:56] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T169906#3412461 (10Volans) [17:04:58] 10Operations, 10ops-codfw, 10DBA: db2044: Disk on predictive failure - https://phabricator.wikimedia.org/T169693#3406026 (10Volans) [17:05:44] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3412471 (10Papaul) asw-d-codfw:ge-1/0/13 [17:06:26] (03CR) 10Thcipriani: Add 3d2png deploy repo to image scalers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [17:08:16] 10Operations, 10ops-codfw, 10DBA: db2044: Disk on predictive failure - https://phabricator.wikimedia.org/T169693#3412481 (10Marostegui) Thank you @Papaul : ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 17% complete) ``` [17:15:23] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3412521 (10Anomie) Yes, you've said that before. I have no idea how you plan t... [17:19:35] PROBLEM - Check Varnish expiry mailbox lag on cp4015 is CRITICAL: CRITICAL: expiry mailbox lag is 2016539 [17:26:34] 10Operations, 10Discovery, 10Maps, 10Traffic, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3412574 (10mpopov) >>! In T169175#3409501, @Gehel wrote: > @mpopov I love your graphs! They just look nice! Aw, thank you! :D > That being said, we probab... [17:28:11] (03CR) 10BryanDavis: "> Is it safe to assume that there will never, ever be local changes" [puppet] - 10https://gerrit.wikimedia.org/r/362928 (owner: 10BryanDavis) [17:35:37] (03CR) 10Herron: [C: 032] Remove exim4-heavy and exim4::ganglia from role requesttracker_server [puppet] - 10https://gerrit.wikimedia.org/r/363390 (https://phabricator.wikimedia.org/T169794) (owner: 10Herron) [17:35:44] (03PS2) 10Herron: Remove exim4-heavy and exim4::ganglia from role requesttracker_server [puppet] - 10https://gerrit.wikimedia.org/r/363390 (https://phabricator.wikimedia.org/T169794) [17:35:47] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3412608 (10RobH) a:05RobH>03Papaul I've gone ahead and setup the following: robh@asw-b-codfw# show | compare [edit interfaces interface-range vlan-la... [17:40:55] 10Operations, 10Discovery, 10Maps, 10Traffic, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3412637 (10Gehel) @BBlack / @ema we seem to have a good grasp on the "usual" maps traffic. I'll let you take over and see if we want to implement rate limit... [17:41:01] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3412639 (10RobH) Confirmed with @chasemp that the instances vlan is indeed where we want this. Once the OS install is done, assign back to me to enable th... [17:41:38] 10Operations, 10Discovery, 10Maps, 10Traffic, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3412640 (10Gehel) a:05mpopov>03ema [17:42:08] (03CR) 10BryanDavis: [C: 031] Move ferm service out of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/363595 (owner: 10Muehlenhoff) [17:44:21] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3412643 (10RobH) a:05RobH>03Papaul network port setup done [17:44:32] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3412645 (10RobH) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170706T1800). Please do the needful. [18:00:04] Niharika and Smalyshev: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:15] o/ [18:01:31] I can SWAT. [18:01:36] ebernhardson: Are you around? [18:02:17] (03PS1) 10Ppchelko: Deployment-Prep: Set correct restbase_uri for Change Propagation [puppet] - 10https://gerrit.wikimedia.org/r/363638 (https://phabricator.wikimedia.org/T169912) [18:02:26] Oh sorry, SMalyshev. [18:03:45] !log moved ununpentium to exim4-daemon-light - T169794 [18:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:55] T169794: Move profile::requesttracker::server to exim light variant - https://phabricator.wikimedia.org/T169794 [18:04:01] (03PS1) 10Reedy: Just run updateArticleCount.php over all.dblist [puppet] - 10https://gerrit.wikimedia.org/r/363639 [18:04:47] (03PS2) 10Niharika29: Add CodeMirror as a beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363497 (https://phabricator.wikimedia.org/T169284) [18:04:53] (03CR) 10Niharika29: [C: 032] Add CodeMirror as a beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363497 (https://phabricator.wikimedia.org/T169284) (owner: 10Niharika29) [18:05:47] here [18:06:26] Niharika: sorry, was distracted a bit [18:06:54] SMalyshev: No worries. I'll get to your patch after the one above. [18:07:19] (03PS1) 10BryanDavis: Remove ukwikimedia from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363640 (https://phabricator.wikimedia.org/T169488) [18:08:10] (03Merged) 10jenkins-bot: Add CodeMirror as a beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363497 (https://phabricator.wikimedia.org/T169284) (owner: 10Niharika29) [18:08:21] (03CR) 10jenkins-bot: Add CodeMirror as a beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363497 (https://phabricator.wikimedia.org/T169284) (owner: 10Niharika29) [18:08:31] Niharika: yes [18:08:47] oh you got SMalyshev, that works too :) [18:09:36] RECOVERY - Check Varnish expiry mailbox lag on cp4015 is OK: OK: expiry mailbox lag is 6 [18:10:29] Niharika: I'd need it on terbium [18:10:36] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363641 [18:10:44] SMalyshev: Okay. [18:11:56] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Add CodeMirror as a beta feature [mediawiki-config] - https://gerrit.wikimedia.org/r/363497 (duration: 00m 43s) [18:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:08] SMalyshev: You mean you want me to pull it on terbium for you for testing? [18:13:03] Niharika: yes, and in general I'll be using it on terbium :) [18:13:24] though probably on wasat too, but will test on terbium [18:15:07] 10Operations, 10ops-codfw, 10DBA: db2044: Disk on predictive failure - https://phabricator.wikimedia.org/T169693#3412737 (10Marostegui) 05Open>03Resolved All good now: ``` physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK) ``` [18:16:04] Waiting on dear old Zuul. [18:20:04] be careful... it can be a long wait sometimes :-P [18:23:25] PROBLEM - HHVM rendering on mw2141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:24:15] RECOVERY - HHVM rendering on mw2141 is OK: HTTP OK: HTTP/1.1 200 OK - 74777 bytes in 0.294 second response time [18:26:51] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Jobrunners generate mediawiki exceptions upon calling Closure$RecentChange::save - https://phabricator.wikimedia.org/T169884#3412766 (10Aklapper) [18:27:32] (03PS2) 10Ppchelko: Deployment-Prep: Set correct restbase_uri for Change Propagation [puppet] - 10https://gerrit.wikimedia.org/r/363638 (https://phabricator.wikimedia.org/T169912) [18:30:06] zuul finally done :) [18:30:12] SMalyshev: Yeah, saw. [18:31:58] !log niharika29@tin Synchronized php-1.30.0-wmf.7/extensions/CirrusSearch/: Fix metastore.php notices https://gerrit.wikimedia.org/r/#/c/363637/ (duration: 00m 54s) [18:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:38] SMalyshev: Ugh, sorry, I synced them instead of pulling on terbium first. I was distracted. [18:32:44] Can you check if it works? [18:32:48] I hope I didn't break anything. [18:33:20] Niharika: I still see the old code on terbium in /srv/mediawiki/php-1.30.0-wmf.7/extensions/CirrusSearch/maintenance/metastore.php ? [18:35:25] SMalyshev: Ah, hang on. [18:36:02] SMalyshev: Check now. [18:36:16] Niharika: works, thanks! [18:37:32] (03CR) 10C. Scott Ananian: [C: 031] OCG: Do not use the INFO command as a readiness check [puppet] - 10https://gerrit.wikimedia.org/r/363045 (owner: 10Mobrovac) [18:38:06] !log niharika29@tin Synchronized php-1.30.0-wmf.7/extensions/CirrusSearch/: Fix metastore.php notices https://gerrit.wikimedia.org/r/#/c/363637/ (duration: 00m 53s) [18:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:17] SMalyshev: Live everywhere now. [18:38:51] 10Operations, 10hardware-requests, 10Patch-For-Review: Decommission subra/suhail - https://phabricator.wikimedia.org/T169506#3412793 (10RobH) [18:46:23] (03PS1) 10RobH: decommission of subra/suhail [dns] - 10https://gerrit.wikimedia.org/r/363648 [18:47:39] (03PS1) 10RobH: decom of subra/suhail [puppet] - 10https://gerrit.wikimedia.org/r/363649 [18:47:52] (03CR) 10RobH: [C: 032] decommission of subra/suhail [dns] - 10https://gerrit.wikimedia.org/r/363648 (owner: 10RobH) [18:48:15] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3412825 (10herron) [18:48:48] (03PS2) 10Chad: Remove ukwikimedia from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363640 (https://phabricator.wikimedia.org/T169488) (owner: 10BryanDavis) [18:49:33] (03CR) 10RobH: [C: 032] decom of subra/suhail [puppet] - 10https://gerrit.wikimedia.org/r/363649 (owner: 10RobH) [18:50:12] 10Operations, 10hardware-requests, 10Patch-For-Review: Decommission subra/suhail - https://phabricator.wikimedia.org/T169506#3412834 (10RobH) [18:51:33] !log niharika29@tin Synchronized php-1.30.0-wmf.7/extensions/CodeMirror: (no justification provided) (duration: 00m 43s) [18:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:30] 10Operations, 10ops-codfw, 10hardware-requests: Decommission subra/suhail - https://phabricator.wikimedia.org/T169506#3412840 (10RobH) a:05RobH>03Papaul @Papaul, Please go ahead and wipe the disks and decom/unrack these systems. Thanks! [18:54:42] (03PS2) 10Andrew Bogott: Labs: Update cdnjs clone commands [puppet] - 10https://gerrit.wikimedia.org/r/362928 (owner: 10BryanDavis) [18:58:13] James_F: I updated CodeMirror and deployed the beta feature patch today but I did not get to run a full scap so the codemirror-beta-title string hasn't updated. I hope you aren't mad at me. :P :) [18:58:28] Niharika: I'll cope. :-) [18:58:39] 10Operations, 10Discovery, 10Maps, 10Traffic, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3412852 (10mpopov) >>! In T169175#3412637, @Gehel wrote: > > As a very short summary of @mpopov's analyis: > > We would not limit anyone in the sample wit... [18:58:45] (03CR) 10Andrew Bogott: [C: 032] Labs: Update cdnjs clone commands [puppet] - 10https://gerrit.wikimedia.org/r/362928 (owner: 10BryanDavis) [18:58:50] 10Operations, 10Gerrit, 10Release-Engineering-Team (Backlog): Reimage gerrit2001 as stretch - https://phabricator.wikimedia.org/T168562#3412853 (10greg) [18:58:52] Niharika: i18n update will auto-run in a bit anyway. [18:59:10] Ah, okay. [18:59:38] Well coincidentally, I want to run a full scap anyway [18:59:51] * RainbowSprinkles looks around, steals deploy conch [19:00:09] ALL HAIL RainbowSprinkles, HOLDER OF THE DEPLOY CONCH [19:00:27] (03CR) 10Chad: [C: 032] Remove ukwikimedia from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363640 (https://phabricator.wikimedia.org/T169488) (owner: 10BryanDavis) [19:02:06] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Get rid of "import realm.pp" in manifests/site.pp - https://phabricator.wikimedia.org/T154915#3412889 (10greg) [19:02:16] 10Operations, 10Discovery, 10Maps, 10Traffic, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3389649 (10MaxSem) Note that currently our servers are [[ https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=maps1... [19:02:31] (03Merged) 10jenkins-bot: Remove ukwikimedia from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363640 (https://phabricator.wikimedia.org/T169488) (owner: 10BryanDavis) [19:02:42] (03CR) 10jenkins-bot: Remove ukwikimedia from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363640 (https://phabricator.wikimedia.org/T169488) (owner: 10BryanDavis) [19:05:10] !log demon@tin Started scap: Forcing l10n rebuild for James_F, plus some wmf-config cleanup [19:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:20] * James_F laughs. [19:05:49] Maybe we should actually have an IRC bot talking about who has the deploy conch? I know a bunch of places do that. [19:07:36] we tried one at one point... [19:07:51] it was buggy as I recall and we didn't bother to fix it [19:08:21] it was cool though in that it had a stacking queue you could put yourself in and then get a ping when it was your turn [19:08:45] 10Puppet, 10Release-Engineering-Team (Watching / External): Preload TestingAccessWrapper in production mwrepl - https://phabricator.wikimedia.org/T143607#3412958 (10greg) Adding @EBernhardson because I git blamed modules/mediawiki/manifests/mwrepl.pp and modules/mediawiki/manifests/init.pp :) Erik: Thoughts? [19:12:52] 10Operations, 10Discovery, 10Icinga, 10Maps, and 2 others: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#3412983 (10debt) [19:13:02] 10Operations, 10Discovery, 10Maps, 10Traffic, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3412984 (10Gehel) I'm actually totally unsure of what conclusion we should have at this point, and that's why I'd like our friends from traffic to weight in... [19:18:12] 10Puppet, 10Release-Engineering-Team (Watching / External): Preload TestingAccessWrapper in production mwrepl - https://phabricator.wikimedia.org/T143607#3413032 (10EBernhardson) mwrepl has a 'bypass access checks' option. Just type: set bac on [19:22:32] !log demon@tin Finished scap: Forcing l10n rebuild for James_F, plus some wmf-config cleanup (duration: 17m 22s) [19:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:17] !log labstore2003 time bash restore.sh &> /tmp/restore_7_6_2017v1.log for T169774 [19:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:27] T169774: Toolforge data loss for permissive data July 2 2017 - https://phabricator.wikimedia.org/T169774 [19:29:01] 10Operations, 10DBA, 10Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3413078 (10aaron) >>! In T164173#3343495, @aaron wrote: > @daniel , can you look into the amount of purges happening in... [19:34:17] 10Operations, 10cloud-services-team, 10Upstream: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290#3393693 (10chasemp) >>! In T169290#3399875, @MoritzMuehlenhoff wrote: > Which NFS services/processes caused this? Summarizing from IRC for poserity :)... [20:15:59] 10Operations, 10ops-codfw: wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#3413261 (10RobH) a:03Papaul Confirmed with @mobrovac about this: Steps to depool: * sync up with @ssastry when we're offlining the host so services is aware * depool the host with confctl * remove from sca... [20:21:31] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3413280 (10Amire80) Hi, It seems to take a bit longer than usual... is there anything I can do to help? [20:33:34] 10Operations, 10ops-codfw: wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#3413305 (10ssastry) Okay, works for me. [20:43:01] PROBLEM - salt-minion processes on labtestservices2002 is CRITICAL: Return code of 255 is out of bounds [20:44:01] RECOVERY - salt-minion processes on labtestservices2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:44:42] 10Operations, 10ops-codfw, 10Patch-For-Review: wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#3413338 (10mobrovac) The above patch removes `wtp2019` from the list of deployment target nodes. Before putting the node down, merge the patch and revert it after it is back online and po... [20:47:12] 10Operations, 10Analytics, 10Performance-Team: Update jq to v1.4.0 or higher - https://phabricator.wikimedia.org/T159392#3413368 (10Krinkle) [20:47:46] 10Operations, 10ops-codfw, 10Parsoid, 10Patch-For-Review, 10Services (watching): wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#3413381 (10mobrovac) [20:53:13] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3413393 (10Papaul) [20:54:04] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3379828 (10Papaul) a:05Papaul>03chasemp @chasemp This is complete, you can take over. Thanks. [20:54:31] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3413398 (10Papaul) [20:54:46] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3379794 (10Papaul) a:05Papaul>03chasemp @chasemp This is complete, you can take over. Thanks. [20:56:12] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3413401 (10Papaul) [20:56:36] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3379811 (10Papaul) a:05Papaul>03chasemp @chasemp This is complete, you can take over. Thanks. [21:14:17] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 25 probes of 435 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [21:18:47] 10Operations, 10Services (done): Set up grafana alerting for services - https://phabricator.wikimedia.org/T162765#3413492 (10mobrovac) Really really nice to have this finally. Thank you @GWicke ! [21:19:17] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 4 probes of 435 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [21:22:15] 10Operations, 10DBA: Evaluate how hard would be to get aa(wikibooks|wiktionary) databases deleted - https://phabricator.wikimedia.org/T169928#3413499 (10MarcoAurelio) [21:24:17] (03PS1) 10Paladox: Gerrit: Upgrade to 2.14.2-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363728 [21:26:28] 10Operations, 10DBA: Evaluate how hard would be to get aa(wikibooks|wiktionary) and howiki databases deleted - https://phabricator.wikimedia.org/T169928#3413528 (10MarcoAurelio) [21:29:03] another gerrit update? Please dear Lord no [21:29:55] TabbyCat: see -releng [21:29:58] ;) [21:30:06] TabbyCat: why not? [21:30:24] legoktm: last time my account got broke :) [21:30:35] oh [21:30:36] and in the past-past update a bunch of them as well [21:30:53] but nah, it's meant to be a bit humorous [21:31:16] ;) [21:32:00] Sagan: lol, as I expected, someone we all like is fearing [21:32:22] TabbyCat: yeah :) [21:35:19] It's for my testing. I will not be updating any gerrit. [21:35:25] (03Abandoned) 10Paladox: Gerrit: Upgrade to 2.14.2-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363728 (owner: 10Paladox) [21:37:40] TabbyCat actually the next update we will do will fix this problem. Gerrit 2.13 reads from the index 2.14.2 reads from the db. Also your problem was probaly from when it first started in decemeber before the problem was fixed. [21:40:07] another way to put it: 2.13.x did it wrong [21:40:13] 2.14.x goes back to doing it the original (right) way [21:47:36] (03PS1) 10Paladox: Gerrit: Upgrade gerrit to 2.14.2-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [21:48:47] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [21:48:48] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3413600 (10madhuvishy) @jcrespo, okay, I'll do the announcements. @Halfak We are proposing labsdb1004 reboot (wikilabels db server) for Tuesday 11 July at 1400 UTC. Would that work for... [21:50:07] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:51:07] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:51:28] !log ppchelko@tin Started deploy [changeprop/deploy@e1230e6]: Extend automatic blacklisting T169911 [21:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:40] T169911: RESTBase, Change-Prop and MobileApps got in a loop - https://phabricator.wikimedia.org/T169911 [21:51:47] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [21:52:37] !log ppchelko@tin Finished deploy [changeprop/deploy@e1230e6]: Extend automatic blacklisting T169911 (duration: 01m 09s) [21:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:56] upload 503s are up, the alert is not bogus [21:54:07] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [21:55:46] (03PS1) 10Chad: Fix .gitfat config, I was an idiot [software/gerrit] - 10https://gerrit.wikimedia.org/r/363736 [21:55:56] (03CR) 10Chad: [V: 032 C: 032] Fix .gitfat config, I was an idiot [software/gerrit] - 10https://gerrit.wikimedia.org/r/363736 (owner: 10Chad) [21:57:34] (03CR) 10Chad: Fix .gitfat config, I was an idiot (031 comment) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363736 (owner: 10Chad) [21:58:30] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3413650 (10Halfak) Yup! That works! [21:58:35] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3413651 (10Zppix) [22:03:10] (03Draft1) 10MarcoAurelio: Set $wgCategoryCollation to 'uca-default' for fr.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363740 (https://phabricator.wikimedia.org/T169810) [22:03:28] (03PS2) 10MarcoAurelio: Set $wgCategoryCollation to 'uca-default' for fr.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363740 (https://phabricator.wikimedia.org/T169810) [22:03:59] (03Draft1) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [22:04:01] (03PS2) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [22:04:54] (03CR) 10MarcoAurelio: "Requires running afterwards. Please see (03PS3) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [22:08:07] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [22:08:20] (03PS1) 10Jdlrobson: Wikivoyage projects can show more than 1 related article [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363742 (https://phabricator.wikimedia.org/T164765) [22:10:15] (03PS2) 10Jdlrobson: Wikivoyage projects can show more than 3 related articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363742 (https://phabricator.wikimedia.org/T164765) [22:12:31] (03CR) 10Bmansurov: [C: 031] Wikivoyage projects can show more than 3 related articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363742 (https://phabricator.wikimedia.org/T164765) (owner: 10Jdlrobson) [22:19:12] (03PS4) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [22:19:14] (03PS2) 10Paladox: Gerrit: Upgrading gerrit to 2.14.2-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [22:20:57] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:23:46] 10Operations, 10Mail, 10fundraising-tech-ops: (re)move problemsdonating aliases - https://phabricator.wikimedia.org/T127488#2045104 (10eross) @Jgreen Hi, do you have a list of who was subscribed to these aliases? [22:25:53] 10Operations, 10DBA, 10Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3413753 (10aaron) a:05aaron>03None [22:26:56] (03PS1) 10MarcoAurelio: Remove Programs and Participation namespaces from meta.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363745 (https://phabricator.wikimedia.org/T61837) [22:27:58] (03CR) 10MarcoAurelio: "Requires running namespaceDupes.php afterwards." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363745 (https://phabricator.wikimedia.org/T61837) (owner: 10MarcoAurelio) [22:29:13] (03CR) 10Harej: [C: 031] "Thank you" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363745 (https://phabricator.wikimedia.org/T61837) (owner: 10MarcoAurelio) [22:32:21] (03PS5) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [22:34:26] (03PS7) 10Volans: Fix Pylint and other tools reported errors [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) [22:34:28] (03PS11) 10Volans: Package metadata and testing tools improvements [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) [22:34:30] (03PS4) 10Volans: Tests: convert unittest to pytest [software/cumin] - 10https://gerrit.wikimedia.org/r/361274 (https://phabricator.wikimedia.org/T154588) [22:34:32] (03PS3) 10Volans: TODO: remove rejected item [software/cumin] - 10https://gerrit.wikimedia.org/r/361638 [22:34:34] (03PS1) 10Volans: Move configuration loader from cli to main module [software/cumin] - 10https://gerrit.wikimedia.org/r/363746 (https://phabricator.wikimedia.org/T169640) [22:34:36] (03PS1) 10Volans: Configuration: automatically load backend's aliases [software/cumin] - 10https://gerrit.wikimedia.org/r/363747 (https://phabricator.wikimedia.org/T169640) [22:34:38] (03PS1) 10Volans: Query and grammar: add support for aliases [software/cumin] - 10https://gerrit.wikimedia.org/r/363748 (https://phabricator.wikimedia.org/T169640) [22:34:40] (03PS1) 10Volans: QueryBuilder: fix subgroup close at the end of query [software/cumin] - 10https://gerrit.wikimedia.org/r/363749 [22:34:42] (03PS1) 10Volans: QueryBuilder: move query string to build() method [software/cumin] - 10https://gerrit.wikimedia.org/r/363750 [22:39:57] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:42:07] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:42:29] 10Operations, 10DBA, 10Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3224448 (10Krinkle) ChangeNotificationJob https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/6cfd514ee9/cl... [22:42:49] 10Operations, 10DBA, 10Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3413807 (10Krinkle) p:05Normal>03High [22:46:57] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:49:30] 10Operations, 10Epic, 10Goal, 10Services (later): End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3413888 (10GWicke) [22:51:33] 10Operations, 10Puppet, 10Release-Engineering-Team (Watching / External): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066#3413911 (10greg) [22:52:40] 10Operations, 10Epic, 10Goal, 10Services (later): End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3413928 (10GWicke) 05Open>03stalled [22:56:05] (03PS20) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [22:56:24] (03PS8) 10Krinkle: varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170706T2300). [23:00:04] Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:12] I got it [23:04:21] (03CR) 10Krinkle: "Rebased and re-applied on Beta. Still passes on puppet compiler: https://puppet-compiler.wmflabs.org/6949/" [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [23:04:54] jdlrobson: You about? [23:07:33] RainbowSprinkles: yup [23:07:44] Ok let's do this, should be quick [23:07:47] sorry lost track of time [23:07:51] (03CR) 10Chad: [C: 032] Wikivoyage projects can show more than 3 related articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363742 (https://phabricator.wikimedia.org/T164765) (owner: 10Jdlrobson) [23:07:57] +2'd both [23:10:03] sweet... ill wait for the sync [23:11:33] (03Merged) 10jenkins-bot: Wikivoyage projects can show more than 3 related articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363742 (https://phabricator.wikimedia.org/T164765) (owner: 10Jdlrobson) [23:11:44] (03CR) 10jenkins-bot: Wikivoyage projects can show more than 3 related articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363742 (https://phabricator.wikimedia.org/T164765) (owner: 10Jdlrobson) [23:15:53] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Wikivoyage projects can show more than 3 related articles (duration: 00m 43s) [23:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:07] jdlrobson: That's live ^ [23:18:09] Everywhere [23:21:34] RainbowSprinkles: w00ttt [23:21:39] RainbowSprinkles: looks good! [23:22:43] Still waiting on Jenkins for the MF one [23:23:10] sounds good [23:24:27] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:17] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:26:39] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3414108 (10RobH) 05Open>03Resolved [23:26:40] !log demon@tin Synchronized php-1.30.0-wmf.7/extensions/MobileFrontend/includes/MobileFrontend.hooks.php: Only message box styles should be loaded on editor (duration: 00m 43s) [23:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:50] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3174562 (10RobH) removed all the descriptions from the disabled switch ports [23:28:02] jdlrobson: And the second one live now too ^^^ [23:28:16] RainbowSprinkles: looks good [23:28:19] Sweet [23:28:21] just have to wait and see if the logs calm down [23:28:26] but can be synced [23:28:59] 10Operations, 10DBA, 10Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3414111 (10aaron) I also wonder why some of those log warnings come from close() and others have the proper commitMaste... [23:33:06] jdlrobson: Well it's already live everywhere :) [23:34:25] RainbowSprinkles: what's 1.30.0-alpha ? [23:34:35] Hm? [23:34:47] im seeing events on that mwversion [23:35:16] What wiki? I'm curious [23:35:23] (gimme the shorturl to logstash, if you've got it) [23:35:56] https://logstash.wikimedia.org/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-7d,mode:quick,to:now))&_a=(columns:!(_source),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'channel:%22resourceloader%22%20AND%20message:%22mobile.messageBox%22')),sort:!('@timestamp',desc)) [23:36:30] 10Operations, 10Mail, 10fundraising-tech-ops: (re)move problemsdonating aliases - https://phabricator.wikimedia.org/T127488#3414158 (10Dzahn) @eross Hi, the thing is that problemsdonating@ is already on the Google side, but we have all these aliases that additionally point to that address. so problems.donati... [23:37:08] RainbowSprinkles: weird... it's not showing anymore [23:37:17] PROBLEM - HHVM rendering on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:37:19] showing as 1.30.0-wmf.7 now [23:37:51] RainbowSprinkles: oh wait.. i see what's happening. it's getting too late in the day and i was seeing that version on [23:37:51] https://logstash-beta.wmflabs.org/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-7d,mode:quick,to:now))&_a=(columns:!(_source),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'channel:%22resourceloader%22%20AND%20message:%22mobile.messageBox%22!'')),sort:!('@timestamp',desc)) :) [23:37:55] ie. beta cluster [23:37:58] hah that makes more sense [23:38:07] RECOVERY - HHVM rendering on mw2130 is OK: HTTP OK: HTTP/1.1 200 OK - 74707 bytes in 0.331 second response time [23:38:08] Gotcha. Ok we're good then :) [23:40:12] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3414180 (10RobH) 05Open>03Resolved a:03RobH [23:40:17] (03Draft1) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 [23:40:19] (03PS2) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 [23:41:08] (03PS3) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 [23:42:06] (03CR) 10jerkins-bot: [V: 04-1] WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox) [23:43:05] (03PS4) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 [23:48:25] (03CR) 10Chad: [C: 04-1] "This will also need adding to scap::sources in deployment_server.yaml (see the examples where the repository has a different name)" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox) [23:49:04] (03CR) 10Paladox: WIP: Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox) [23:49:38] (03CR) 10Paladox: WIP: Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox) [23:53:45] (03PS5) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 [23:55:53] 10Operations, 10Mail, 10fundraising-tech-ops: (re)move problemsdonating aliases - https://phabricator.wikimedia.org/T127488#3414307 (10eross) @Dzahn OK, I just created the Google group with problems.donating@wikimedia.org, problemdonating@wikimedia.org, problem.donating@wikimedia.org and comentarios@wikimed... [23:58:27] (03CR) 10Chad: WIP: Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox) [23:58:37] 10Operations, 10Mail, 10fundraising-tech-ops: (re)move problemsdonating aliases - https://phabricator.wikimedia.org/T127488#3414310 (10Dzahn) >>! In T127488#3414307, @eross wrote: > pointing to problemsdonate@wikimedia.org . problemsdonating@ not problemsdonate@ right? > Just need to find who was subscr... [23:59:18] (03CR) 10Paladox: "Ah ok. How do i add an ssh key to gerrit2 to use through labs?" [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox)