[00:00:27] (03PS1) 10Ori.livneh: Switch over the 'sessions' ObjectCache to nutcracker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227620 (https://phabricator.wikimedia.org/T106986) [00:01:13] !log Switching over the sessions ObjectCache instance to use nutcracker. Users with an existing edit session in progress will have their session reset and will need to re-login. [00:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:01:42] (03CR) 10Ori.livneh: [C: 032] Switch over the 'sessions' ObjectCache to nutcracker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227620 (https://phabricator.wikimedia.org/T106986) (owner: 10Ori.livneh) [00:01:50] (03Merged) 10jenkins-bot: Switch over the 'sessions' ObjectCache to nutcracker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227620 (https://phabricator.wikimedia.org/T106986) (owner: 10Ori.livneh) [00:02:31] !log ori Synchronized wmf-config/CommonSettings.php: Iccd317c6: Switch over the 'sessions' ObjectCache to nutcracker (T106986) (duration: 00m 13s) [00:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:05:16] 6operations, 10Traffic: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#1490281 (10BBlack) 3NEW a:3BBlack [00:05:35] 6operations, 10Traffic: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#1490289 (10BBlack) p:5Triage>3Normal [00:07:20] (03PS1) 10Yuvipanda: admin: Add another key for myself [puppet] - 10https://gerrit.wikimedia.org/r/227621 [00:08:11] 6operations, 6Phabricator, 10VisualEditor: Unable to load https://phabricator.wikimedia.org/tag/visualeditor/ - https://phabricator.wikimedia.org/T107229#1490292 (10Christopher) It seems that https://phabricator.wikimedia.org/project/view/483/ loads, so this is sprint extension related. [00:09:41] (03PS4) 10BBlack: enable ipsec for all codfw caches [puppet] - 10https://gerrit.wikimedia.org/r/219813 (https://phabricator.wikimedia.org/T81543) [00:09:51] (03CR) 10BBlack: [C: 032 V: 032] enable ipsec for all codfw caches [puppet] - 10https://gerrit.wikimedia.org/r/219813 (https://phabricator.wikimedia.org/T81543) (owner: 10BBlack) [00:12:58] 6operations, 6Phabricator, 10VisualEditor: Unable to load https://phabricator.wikimedia.org/tag/visualeditor/ - https://phabricator.wikimedia.org/T107229#1490294 (10Jdforrester-WMF) > Related to {T107197} Duplicate of… [00:14:49] (03PS1) 10Tim Landscheidt: Labs: Subscribe self-hosted puppetmaster to hiera.yaml changes [puppet] - 10https://gerrit.wikimedia.org/r/227622 (https://phabricator.wikimedia.org/T107205) [00:23:00] (03CR) 10Tim Landscheidt: "I tested this on toolsbeta-puppetmaster3 to be a no-op for existing self-hosted puppet masters. I think it cannot be tested if it really " [puppet] - 10https://gerrit.wikimedia.org/r/227622 (https://phabricator.wikimedia.org/T107205) (owner: 10Tim Landscheidt) [00:24:11] Does anyone know how Varnish caching works for API modules? [00:24:22] (03PS1) 10BBlack: remove codfw cache::parsoid ipsec role (no tier2, breaks puppet) [puppet] - 10https://gerrit.wikimedia.org/r/227623 [00:24:33] 6operations, 6Phabricator, 10VisualEditor: Unable to load https://phabricator.wikimedia.org/tag/visualeditor/ - https://phabricator.wikimedia.org/T107229#1490309 (10Josve05a) >>! In T107229#1490294, @Jdforrester-WMF wrote: > Duplicate of… Feel free to merge. I wasn't sure if related. [00:24:39] (03CR) 10BBlack: [C: 032 V: 032] remove codfw cache::parsoid ipsec role (no tier2, breaks puppet) [puppet] - 10https://gerrit.wikimedia.org/r/227623 (owner: 10BBlack) [00:24:39] matt_flaschen: what specifically do you want to know? [00:25:12] ori, are GET requests cached and how long? [00:25:42] 6operations, 10RESTBase, 10Traffic: Restbase insecure POST requests to MW api.php - https://phabricator.wikimedia.org/T107030#1490312 (10GWicke) This will be significantly cleaner to configure with [request templates](https://github.com/wikimedia/restbase/pull/283). Once that feature is merged, we could repl... [00:27:28] 6operations, 10RESTBase, 10Traffic: Restbase insecure POST requests to MW api.php - https://phabricator.wikimedia.org/T107030#1484744 (10GWicke) [00:29:02] matt_flaschen: Requests with a session cookie are special-cased to bypass the cache, but otherwise varnish respects cache-control headers. [00:30:02] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1490328 (10RobH) a:5Tfinc>3RobH [00:30:17] 6operations, 6Phabricator, 10VisualEditor: Unable to load https://phabricator.wikimedia.org/tag/visualeditor/ - https://phabricator.wikimedia.org/T107229#1490329 (10Christopher) debug tick is also set for 15 seconds, meaning that a possible full load time window of 30 seconds is never reached. @mmodell? [00:31:51] matt_flaschen: the logic is entire in mediawiki and extensions; see ApiMain::sendCacheHeaders / setCacheMaxAge / setCacheControl [00:32:15] ori, thanks, I'm also looking at getCacheMode. [00:32:56] ori: https://gerrit.wikimedia.org/r/#/c/227621 [00:33:17] (03PS2) 10Yuvipanda: admin: Add another key for myself [puppet] - 10https://gerrit.wikimedia.org/r/227621 [00:33:18] matt_flaschen: the EventLogging schema API module is designed to be cacheable for anons; you can run curl -I "https://meta.wikimedia.org/w/api.php?action=jsonschema&revid=12518424" [00:33:22] I think there's going to be some unavoiable icinga spam coming up soon, for ipsec checks on cp20xx nodes. please ignore, will ack after they appear. [00:33:26] (03CR) 10Yuvipanda: [C: 032 V: 032] admin: Add another key for myself [puppet] - 10https://gerrit.wikimedia.org/r/227621 (owner: 10Yuvipanda) [00:34:48] ori, thanks, comparing that to the request I'm interested in, it looks like the other one will not be cached. [00:35:42] ACKNOWLEDGEMENT - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:35:42] ACKNOWLEDGEMENT - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 44 problems (not-connected: 44) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:35:42] ACKNOWLEDGEMENT - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 16 problems (not-connected: 16) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:35:42] ACKNOWLEDGEMENT - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:35:42] ACKNOWLEDGEMENT - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:35:43] ACKNOWLEDGEMENT - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 44 problems (not-connected: 44) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:35:43] ACKNOWLEDGEMENT - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:35:44] ACKNOWLEDGEMENT - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 44 problems (not-connected: 44) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:35:44] ACKNOWLEDGEMENT - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 16 problems (not-connected: 16) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:35:45] ACKNOWLEDGEMENT - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:35:45] ACKNOWLEDGEMENT - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 44 problems (not-connected: 44) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:35:46] ACKNOWLEDGEMENT - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 16 problems (not-connected: 16) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:35:46] ACKNOWLEDGEMENT - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:35:47] ACKNOWLEDGEMENT - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 16 problems (not-connected: 16) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:36:17] heh I managed to ack before 3/3, so I guess the acks themselves are the only spam [00:38:56] 6operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Wikimedia-Stream: stream.wikimedia.org - redirect http(s) to docs - https://phabricator.wikimedia.org/T70528#1490357 (10Krenair) [00:40:13] ACKNOWLEDGEMENT - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:40:13] ACKNOWLEDGEMENT - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 44 problems (attempting-to-connect: 44) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:40:13] ACKNOWLEDGEMENT - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 16 problems (attempting-to-connect: 16) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:40:13] ACKNOWLEDGEMENT - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 16 problems (attempting-to-connect: 16) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:40:13] ACKNOWLEDGEMENT - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 44 problems (attempting-to-connect: 44) Brandon Black ipsec not fully deployed yet, valid but uninteresting [00:41:58] ACKNOWLEDGEMENT - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 44 problems (attempting-to-connect: 44) Brandon Black ipsec not fully deployed yet, valid but [00:41:58] ACKNOWLEDGEMENT - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 16 problems (not-connected: 16) Brandon Black ipsec not fully deployed yet, valid but [00:41:58] ACKNOWLEDGEMENT - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 44 problems (attempting-to-connect: 44) Brandon Black ipsec not fully deployed yet, valid but [00:43:51] !log ori Synchronized php-1.26wmf15/extensions/AbuseFilter: Revert "Revert "Conversion to using getMainStashInstance()"" (duration: 00m 12s) [00:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:47:20] (03PS3) 10BBlack: enable ipsec for half eqiad text caches [puppet] - 10https://gerrit.wikimedia.org/r/219816 (https://phabricator.wikimedia.org/T81543) [00:48:52] (03PS3) 10BBlack: enable ipsec for all eqiad text caches [puppet] - 10https://gerrit.wikimedia.org/r/219817 (https://phabricator.wikimedia.org/T81543) [00:50:17] (03CR) 10BBlack: [C: 032] Fix typo in reverse DNS for ms-fe2003.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/227474 (owner: 10Alex Monk) [00:51:22] 6operations, 6Phabricator, 10VisualEditor: Unable to load https://phabricator.wikimedia.org/tag/visualeditor/ - https://phabricator.wikimedia.org/T107229#1490375 (10mmodell) @christopher I turned the debug time back to 0 [00:52:49] (03CR) 10Krinkle: "*bump*" [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [00:59:32] (03CR) 10BBlack: [C: 032] enable ipsec for half eqiad text caches [puppet] - 10https://gerrit.wikimedia.org/r/219816 (https://phabricator.wikimedia.org/T81543) (owner: 10BBlack) [01:02:06] 6operations, 6Phabricator, 10VisualEditor: Unable to load https://phabricator.wikimedia.org/tag/visualeditor/ - https://phabricator.wikimedia.org/T107229#1490391 (10Jdforrester-WMF) [01:08:56] ACKNOWLEDGEMENT - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (attempting-to-connect: 42) Brandon Black ipsec not fully deployed yet [01:08:56] ACKNOWLEDGEMENT - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (attempting-to-connect: 42) Brandon Black ipsec not fully deployed yet [01:08:56] ACKNOWLEDGEMENT - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (attempting-to-connect: 42) Brandon Black ipsec not fully deployed yet [01:10:33] (03PS1) 10Gergő Tisza: Add configuration for authmetrics logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227630 (https://phabricator.wikimedia.org/T91701) [01:11:06] (03CR) 10BBlack: [C: 032] enable ipsec for all eqiad text caches [puppet] - 10https://gerrit.wikimedia.org/r/219817 (https://phabricator.wikimedia.org/T81543) (owner: 10BBlack) [01:16:41] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [01:16:52] (03PS19) 10Gergő Tisza: [WIP] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [01:17:14] 6operations, 5Patch-For-Review, 5WMF-deploy-2015-07-21_(1.26wmf15): High number of (session) redis connection failures - https://phabricator.wikimedia.org/T106986#1490413 (10ori) I configured MediaWiki to use Nutcracker for connections to the sessions redis cluster a little over an hour ago. We have not had... [01:17:46] 6operations, 5Patch-For-Review, 5WMF-deploy-2015-07-21_(1.26wmf15): High number of (session) redis connection failures - https://phabricator.wikimedia.org/T106986#1490414 (10ori) 5Open>3Resolved a:3ori [01:22:02] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) [01:23:32] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) [01:23:32] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) [01:23:42] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) [01:26:46] 6operations, 5Interdatacenter-IPsec: IPsec: roll-out plan - https://phabricator.wikimedia.org/T92604#1490419 (10BBlack) Status update: - ipsec puppet role is running on - codfw: all multi-tier cache clusters (text, mobile, bits, upload) - eqiad: only the text cluster - esams: only cp3030 (text cluster)... [01:27:04] 6operations, 5Interdatacenter-IPsec: IPsec: roll-out plan - https://phabricator.wikimedia.org/T92604#1490422 (10BBlack) [01:29:02] ACKNOWLEDGEMENT - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) Brandon Black ipsec not fully deployed yet [01:29:02] ACKNOWLEDGEMENT - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) Brandon Black ipsec not fully deployed yet [01:29:02] ACKNOWLEDGEMENT - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) Brandon Black ipsec not fully deployed yet [01:29:02] ACKNOWLEDGEMENT - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - 2 ESP transports installed, 42 problems (not-connected: 42) Brandon Black ipsec not fully deployed yet [01:41:14] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1490448 (10BBlack) [01:41:15] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#1490449 (10BBlack) [01:41:17] 6operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#1490450 (10BBlack) [01:41:19] 6operations, 10Traffic: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#1490447 (10BBlack) [01:41:31] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1451751 (10BBlack) [01:41:32] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1490451 (10BBlack) [02:03:04] !log LocalisationUpdate failed (1.26wmf15) at 2015-07-29 02:03:03+00:00 [02:03:04] !log LocalisationUpdate failed (1.26wmf16) at 2015-07-29 02:03:04+00:00 [02:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:07:17] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 29 02:07:17 UTC 2015 (duration 7m 16s) [02:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:20:56] 10Ops-Access-Requests, 6operations: Access to analytics cluster for user krinkle - https://phabricator.wikimedia.org/T107243#1490489 (10Krinkle) 3NEW [02:33:36] (03PS1) 10Springle: reduce tendril memory footprint due to OOM, and switch to /srv [puppet] - 10https://gerrit.wikimedia.org/r/227641 [02:34:25] (03CR) 10Springle: [C: 032] reduce tendril memory footprint due to OOM, and switch to /srv [puppet] - 10https://gerrit.wikimedia.org/r/227641 (owner: 10Springle) [02:37:11] !log l10nupdate Synchronized php-1.26wmf15/cache/l10n: (no message) (duration: 10m 08s) [02:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:43:27] !log LocalisationUpdate completed (1.26wmf15) at 2015-07-29 02:43:27+00:00 [02:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:42] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:45:42] PROBLEM - dhclient process on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:12] PROBLEM - salt-minion processes on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:42] 10Ops-Access-Requests, 6operations: Access to analytics cluster for user krinkle - https://phabricator.wikimedia.org/T107243#1490536 (10ori) Approved. [02:51:06] 6operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Wikimedia-Stream: stream.wikimedia.org - redirect http(s) to docs - https://phabricator.wikimedia.org/T70528#723098 (10MZMcBride) It'd be neat if the index page for stream.wikimedia.org could pull from a protected wiki page on Meta-Wiki (similar t... [02:52:57] 6operations, 6Services, 10Traffic: Provide an API listing at /api/ - https://phabricator.wikimedia.org/T107086#1490540 (10MZMcBride) >>! In T107086#1486516, @BBlack wrote: > Could we simply source it from a page on meta-wiki? (as in, rewrite the request internally to pull from meta-wiki?) Or something of th... [02:59:03] (03PS20) 10Gergő Tisza: [WIP] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [03:09:50] !log l10nupdate Synchronized php-1.26wmf16/cache/l10n: (no message) (duration: 10m 47s) [03:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:22] (03PS21) 10Gergő Tisza: [WIP] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [03:15:56] !log LocalisationUpdate completed (1.26wmf16) at 2015-07-29 03:15:56+00:00 [03:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:25:43] !log upgrade reboot db1011 trusty [03:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:31:41] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 43 failures [03:41:32] 6operations, 7Monitoring: Migrate monitoring alerts from watchmouse to catchpoint - https://phabricator.wikimedia.org/T107092#1490581 (10RobH) we got in an alert this evening at 19:50 PDT for an alert and then the clear through without issues. [03:49:04] 6operations, 6Community-Advocacy, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1490591 (10Reedy) https://gerrit.wikimedia.org/r/#/c/227172/ wants updating to not remove the the pa.us rewrites, but then needs merging... [03:49:24] (03CR) 10Reedy: [C: 04-1] "pa.us lines need to stay for now..." [puppet] - 10https://gerrit.wikimedia.org/r/227172 (https://phabricator.wikimedia.org/T102814) (owner: 10Reedy) [03:55:32] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [03:58:53] (03CR) 10Chad: [C: 032] Add special wikipedias to wikipedia.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227351 (owner: 10Alex Monk) [03:59:03] Krenair: Doing ^ [03:59:16] (03Merged) 10jenkins-bot: Add special wikipedias to wikipedia.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227351 (owner: 10Alex Monk) [04:00:18] !log demon Synchronized wmf-config/InitialiseSettings.php: moving special wikipedias to wikipedia.dblist (duration: 00m 12s) [04:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:00:58] !log demon Synchronized database lists: moving special wikipedias to wikipedia.dblist (duration: 00m 13s) [04:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:04:18] (03PS2) 10Chad: Remove multiple subdomain wiki rewrites [puppet] - 10https://gerrit.wikimedia.org/r/227172 (https://phabricator.wikimedia.org/T102814) (owner: 10Reedy) [04:04:26] Reedy: Rebased + removed the pa.us.wm one :) [04:08:13] (03CR) 10Yuvipanda: "And once that's done we can start running the other backups on a once-every-other-day schedule, I believe. Or even better - I think tools " [puppet] - 10https://gerrit.wikimedia.org/r/227462 (https://phabricator.wikimedia.org/T106474) (owner: 10coren) [04:19:57] (03CR) 10Glaisher: [C: 031] varnish: Update default varnish error page [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [04:22:11] (03PS6) 10Krinkle: varnish: Update default varnish error page [puppet] - 10https://gerrit.wikimedia.org/r/223012 [04:25:48] 6operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Wikimedia-Stream: stream.wikimedia.org - redirect http(s) to docs - https://phabricator.wikimedia.org/T70528#1490633 (10Glaisher) [04:29:37] (03CR) 10Krinkle: "There's a bunch of mentions in @wikimedia Git in at least two prominent places." [puppet] - 10https://gerrit.wikimedia.org/r/227172 (https://phabricator.wikimedia.org/T102814) (owner: 10Reedy) [04:36:41] Krenair: Is it not a problem that wikis are now both in 'wikipedia' and 'special' dblist at the same time? [04:36:46] I imagine that could confuse things [04:36:54] not sure what sitematrix and labs/meta_p will do [04:38:26] ostriches: ^ [04:38:27] Krenair: you updates that patch right? I'll take a look tomorrow [04:39:12] Hrm. [04:40:22] Krinkle: SiteMatrix seems to Do The Right Thing and keeps them in special. [04:40:57] Good question re: labs tho :\ [04:41:05] * ostriches shall revert for now [04:41:50] (03PS1) 10Chad: Revert "Add special wikipedias to wikipedia.dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227645 [04:41:58] (03CR) 10Chad: [C: 032 V: 032] Revert "Add special wikipedias to wikipedia.dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227645 (owner: 10Chad) [04:42:41] !log demon Synchronized database lists: rv myself (duration: 00m 12s) [04:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:43:07] !log demon Synchronized wmf-config/InitialiseSettings.php: rv myself (duration: 00m 13s) [04:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:59:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 29097 seconds ago, expected 28800 [05:04:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 29397 seconds ago, expected 28800 [05:09:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 29697 seconds ago, expected 28800 [05:14:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 29997 seconds ago, expected 28800 [05:19:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 30297 seconds ago, expected 28800 [05:24:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 30597 seconds ago, expected 28800 [05:29:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 30897 seconds ago, expected 28800 [05:34:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 31197 seconds ago, expected 28800 [05:39:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 31498 seconds ago, expected 28800 [05:44:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 31798 seconds ago, expected 28800 [05:45:38] Frack I guess? [05:49:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 32097 seconds ago, expected 28800 [05:54:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 32397 seconds ago, expected 28800 [05:59:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 32697 seconds ago, expected 28800 [06:04:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 32997 seconds ago, expected 28800 [06:09:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 33297 seconds ago, expected 28800 [06:14:22] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 33597 seconds ago, expected 28800 [06:19:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 33897 seconds ago, expected 28800 [06:24:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 34197 seconds ago, expected 28800 [06:29:22] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 34497 seconds ago, expected 28800 [06:31:23] PROBLEM - puppet last run on cp2013 is CRITICAL Puppet has 2 failures [06:31:52] PROBLEM - puppet last run on wtp2017 is CRITICAL Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on cp2001 is CRITICAL Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:33:02] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures [06:33:31] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures [06:33:41] PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 1 failures [06:34:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 34798 seconds ago, expected 28800 [06:39:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 35097 seconds ago, expected 28800 [06:44:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 35397 seconds ago, expected 28800 [06:49:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 35697 seconds ago, expected 28800 [06:54:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 35997 seconds ago, expected 28800 [06:55:52] RECOVERY - puppet last run on wtp2017 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:57:21] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:22] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:32] RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:52] RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 36298 seconds ago, expected 28800 [07:04:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 36597 seconds ago, expected 28800 [07:09:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 36897 seconds ago, expected 28800 [07:14:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 37197 seconds ago, expected 28800 [07:19:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 37497 seconds ago, expected 28800 [07:24:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 37797 seconds ago, expected 28800 [07:29:22] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 38097 seconds ago, expected 28800 [07:34:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 38397 seconds ago, expected 28800 [07:39:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 38697 seconds ago, expected 28800 [07:41:54] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 29 07:41:54 UTC 2015 (duration 41m 53s) [07:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:44:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 38997 seconds ago, expected 28800 [07:49:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 39298 seconds ago, expected 28800 [07:54:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 39598 seconds ago, expected 28800 [07:59:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 39897 seconds ago, expected 28800 [08:02:25] !log disabled puppet on labnodepool1001.eqiad.wmnet [08:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:04:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 40197 seconds ago, expected 28800 [08:09:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 40497 seconds ago, expected 28800 [08:14:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 40797 seconds ago, expected 28800 [08:16:32] 6operations, 5Continuous-Integration-Isolation: Reinstall labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T107158#1490849 (10hashar) Before rebuilding the system, I wanted to make sure all .deb package dependencies are on apt.wikimedia.org. The Nodepool requirements.txt file list the python mo... [08:19:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 41097 seconds ago, expected 28800 [08:20:31] good morning folks [08:21:32] PROBLEM - puppet last run on stat1002 is CRITICAL Puppet last ran 6 hours ago [08:22:33] PROBLEM - DPKG on labnodepool1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:24:22] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 41397 seconds ago, expected 28800 [08:28:32] RECOVERY - DPKG on labnodepool1001 is OK: All packages OK [08:28:41] PROBLEM - puppet last run on ganeti2003 is CRITICAL puppet fail [08:29:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 41697 seconds ago, expected 28800 [08:29:41] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Bump our Nodepool package to 0.1.0 - https://phabricator.wikimedia.org/T104971#1490853 (10hashar) [08:32:14] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Bump our Nodepool package to 0.1.0 - https://phabricator.wikimedia.org/T104971#1490854 (10hashar) In operations/debs/nodepool.git ``` $ git diff debian..0.1.0 requirements.txt ... -python-novaclient +python-novaclient>=2.21.0 $ ``` [08:34:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 41997 seconds ago, expected 28800 [08:39:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 42297 seconds ago, expected 28800 [08:44:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 42597 seconds ago, expected 28800 [08:48:55] (03PS1) 10Hashar: nodepool: use python-novaclient from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/227663 (https://phabricator.wikimedia.org/T104971) [08:49:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 42897 seconds ago, expected 28800 [08:54:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 43197 seconds ago, expected 28800 [08:54:43] RECOVERY - puppet last run on ganeti2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:59:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 43497 seconds ago, expected 28800 [09:04:09] (03CR) 10GoldenRing: "What is the `--venv` option to uwsgi? It's not mentioned in `uwsgi --help`." [puppet] - 10https://gerrit.wikimedia.org/r/227503 (https://phabricator.wikimedia.org/T104374) (owner: 10Yuvipanda) [09:04:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 43797 seconds ago, expected 28800 [09:09:22] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 44097 seconds ago, expected 28800 [09:10:03] <_joe_> what is this check_puppetrun that changes every few minutes? [09:12:35] (03PS1) 10Hashar: Merge tag '0.1.0' into debian [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/227664 [09:12:37] (03PS1) 10Hashar: Bump Debian package to 0.1.0-wmf1 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/227665 (https://phabricator.wikimedia.org/T104971) [09:14:14] <_joe_> !log depooling mw1159-60 from the imagescalers pool [09:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:14:22] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 44397 seconds ago, expected 28800 [09:14:46] any bright mind around? I am building a package that requires a package in jessie-backports and I can't figure out how to inject it in cowbuilder for jessie-wikimedia [09:15:12] <_joe_> hashar: uhm we might need to add jessie-backports to the config [09:15:23] <_joe_> I have no time right now though, sorry [09:15:28] or jessie-backports-wikimedia :-D [09:16:18] _joe_: no worries, I am RTFM at https://wiki.debian.org/BuildingFormalBackports [09:19:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 44697 seconds ago, expected 28800 [09:24:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 44997 seconds ago, expected 28800 [09:29:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 45297 seconds ago, expected 28800 [09:34:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 45597 seconds ago, expected 28800 [09:39:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 45898 seconds ago, expected 28800 [09:41:22] 6operations, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1491005 (10jcrespo) 3NEW [09:43:31] 6operations, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1491014 (10jcrespo) This basically does a full table scan. The query should either banned or paged by an index: ``` MariaDB PRODUCTION s4 localhost... [09:44:22] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 46197 seconds ago, expected 28800 [09:49:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 46497 seconds ago, expected 28800 [09:50:26] 6operations, 5Continuous-Integration-Isolation, 7Nodepool, 5Patch-For-Review: Bump our Nodepool package to 0.1.0 - https://phabricator.wikimedia.org/T104971#1491022 (10hashar) Build the package and put it at: https://people.wikimedia.org/~hashar/debs/nodepool_0.1.0-wmf1/ terbium.eqiad.wmnet:/home/hashar... [09:51:23] 6operations, 5Continuous-Integration-Isolation, 7Nodepool, 5Patch-For-Review: Bump our Nodepool package to 0.1.0 - https://phabricator.wikimedia.org/T104971#1491023 (10hashar) a:3hashar [09:51:30] (03PS3) 10Filippo Giunchedi: ganglia: cleanup old temporary graphs [puppet] - 10https://gerrit.wikimedia.org/r/226087 (https://phabricator.wikimedia.org/T97637) [09:51:38] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] ganglia: cleanup old temporary graphs [puppet] - 10https://gerrit.wikimedia.org/r/226087 (https://phabricator.wikimedia.org/T97637) (owner: 10Filippo Giunchedi) [09:54:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 46797 seconds ago, expected 28800 [09:55:07] 6operations, 7Database: Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - https://phabricator.wikimedia.org/T107072#1491029 (10jcrespo) I can confirm that this is no longer an s3 only problem. It is affecting s1 and s4, too. [09:55:59] 6operations, 5Continuous-Integration-Isolation, 7Nodepool, 5Patch-For-Review: Bump our Nodepool package to 0.1.0 - https://phabricator.wikimedia.org/T104971#1491030 (10hashar) ``` root@labnodepool1001:/root# dpkg -i nodepool_0.1.0-wmf1_amd64.deb (Reading database ... 52061 files and directories currently i... [09:57:42] (03CR) 10Hashar: [C: 031] "For now I am manually installing the package with:" [puppet] - 10https://gerrit.wikimedia.org/r/227663 (https://phabricator.wikimedia.org/T104971) (owner: 10Hashar) [09:58:00] (03CR) 10Hashar: [C: 032 V: 032] Merge tag '0.1.0' into debian [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/227664 (owner: 10Hashar) [09:58:10] (03CR) 10Hashar: [C: 032 V: 032] Bump Debian package to 0.1.0-wmf1 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/227665 (https://phabricator.wikimedia.org/T104971) (owner: 10Hashar) [09:58:51] PROBLEM - puppet last run on uranium is CRITICAL puppet fail [09:59:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 47097 seconds ago, expected 28800 [09:59:56] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 5 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1491042 (10Joe) I think I found what is the latest gotcha with imagescalers: ``` curl -H 'X-Forwarded-Proto: https' -H 'Host: commons.wikimedia.or... [10:01:37] (03PS1) 10Filippo Giunchedi: ganglia: use recursion to tidy /tmp [puppet] - 10https://gerrit.wikimedia.org/r/227670 (https://phabricator.wikimedia.org/T97637) [10:01:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] ganglia: use recursion to tidy /tmp [puppet] - 10https://gerrit.wikimedia.org/r/227670 (https://phabricator.wikimedia.org/T97637) (owner: 10Filippo Giunchedi) [10:04:17] (03CR) 10Hashar: [C: 04-1] "Seems we will want to install python-openstackclient from backports as well or:" [puppet] - 10https://gerrit.wikimedia.org/r/227663 (https://phabricator.wikimedia.org/T104971) (owner: 10Hashar) [10:04:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 47397 seconds ago, expected 28800 [10:04:32] 6operations, 5Patch-For-Review: stray ganglia-graph files left in /tmp - https://phabricator.wikimedia.org/T97637#1491053 (10fgiunchedi) 5Open>3Resolved tidy resource will clean up graphs older than a week [10:05:01] RECOVERY - puppet last run on uranium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:07:33] (03PS2) 10Hashar: nodepool: use OpenStack modules from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/227663 (https://phabricator.wikimedia.org/T104971) [10:09:16] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation: Acquire old production API servers for use in CI - https://phabricator.wikimedia.org/T84940#1491057 (10hashar) 5Open>3declined a:3hashar We are using labs infrastructure for now. There is no plan to reuse the old... [10:09:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 47697 seconds ago, expected 28800 [10:14:16] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Bump our Nodepool Debian package to - https://phabricator.wikimedia.org/T107266#1491065 (10hashar) 3NEW a:3hashar [10:14:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 47997 seconds ago, expected 28800 [10:14:29] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Bump our Nodepool Debian package to 0.1.1 - https://phabricator.wikimedia.org/T107266#1491076 (10hashar) [10:14:44] 6operations, 5Continuous-Integration-Isolation, 7Nodepool, 5Patch-For-Review: Bump our Nodepool package to 0.1.0 - https://phabricator.wikimedia.org/T104971#1433466 (10hashar) [10:14:47] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Bump our Nodepool Debian package to 0.1.1 - https://phabricator.wikimedia.org/T107266#1491065 (10hashar) [10:18:05] <_joe_> !log repooling the zend imagescalers until https://gerrit.wikimedia.org/r/#/c/227676 is reviewed and deployed [10:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:19:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 48298 seconds ago, expected 28800 [10:22:18] 6operations, 10ops-eqiad, 6Discovery, 10Wikidata, and 2 others: Change hardware RAID controller on wmf3543, wmf3544 - https://phabricator.wikimedia.org/T107152#1491088 (10Joe) @cmjohnson let me know when this is done, so that I can proceed to install the servers. [10:24:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 48597 seconds ago, expected 28800 [10:29:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 48897 seconds ago, expected 28800 [10:31:42] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1491090 (10hashar) 3NEW [10:31:55] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Isolation: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1433420 (10hashar) [10:31:58] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1491100 (10hashar) [10:32:12] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 27 failures [10:33:54] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1491090 (10hashar) [10:33:59] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Isolation: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1491115 (10hashar) [10:34:19] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Bump our Nodepool Debian package to 0.1.1 - https://phabricator.wikimedia.org/T107266#1491118 (10hashar) [10:34:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 49197 seconds ago, expected 28800 [10:34:22] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Isolation: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1433420 (10hashar) [10:34:30] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Bump our Nodepool Debian package to 0.1.1 - https://phabricator.wikimedia.org/T107266#1491065 (10hashar) [10:34:46] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Bump our Nodepool Debian package to 0.1.1 - https://phabricator.wikimedia.org/T107266#1491065 (10hashar) [10:39:18] (03PS2) 10Hashar: nodepool: stop using diskimage [puppet] - 10https://gerrit.wikimedia.org/r/227461 (https://phabricator.wikimedia.org/T102281) [10:39:22] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 49497 seconds ago, expected 28800 [10:39:43] 6operations, 5Continuous-Integration-Isolation, 5Patch-For-Review: Figure out fine sudo rules for the nodepool service / diskimage-builder - https://phabricator.wikimedia.org/T102281#1491147 (10hashar) a:3hashar [10:40:22] 6operations, 5Continuous-Integration-Isolation, 5Patch-For-Review: Figure out fine sudo rules for the nodepool service / diskimage-builder - https://phabricator.wikimedia.org/T102281#1361446 (10hashar) https://gerrit.wikimedia.org/r/227461 causes Nodepool to no more rely on disk image builder. We will provid... [10:42:34] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1491156 (10Joe) 3NEW [10:42:52] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1491163 (10Joe) p:5Triage>3High [10:44:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 49797 seconds ago, expected 28800 [10:49:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 50097 seconds ago, expected 28800 [10:51:48] <_joe_> !log restarted apertium-apy on sca1001, freed 54 GB of RAM (processes were OOMing) [10:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:13] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1491198 (10Joe) p:5High>3Unbreak! [10:53:46] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1491156 (10Joe) I raised the priority as sca1001 was swapping to OOM death since forever, probably. I suggest we add some protec... [10:54:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 50397 seconds ago, expected 28800 [10:58:01] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 50697 seconds ago, expected 28800 [11:03:50] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1491215 (10Joe) The pretty amaizing effect of restarting apertium on sca1001 {F282719} [11:04:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 50997 seconds ago, expected 28800 [11:06:25] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1491221 (10Joe) [11:09:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 51297 seconds ago, expected 28800 [11:14:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 51597 seconds ago, expected 28800 [11:19:21] PROBLEM - check_puppetrun on rigel is CRITICAL Puppet last ran 51897 seconds ago, expected 28800 [11:24:21] RECOVERY - check_puppetrun on rigel is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:29:00] _joe_: thanks for T107270 [11:29:13] I was about to ask for restart. [11:29:20] (03PS5) 10Hashar: Support spaces in Gearman functions names [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/205564 [11:29:22] (03PS5) 10Hashar: Stop all threads on SIGUSR1 [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/225410 [11:34:13] (03PS1) 10Muehlenhoff: Add a role to run a debdeploy master [puppet] - 10https://gerrit.wikimedia.org/r/227682 [11:34:15] (03PS1) 10Muehlenhoff: Add base role for debdeploy clients [puppet] - 10https://gerrit.wikimedia.org/r/227683 [11:40:06] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1491295 (10KartikMistry) Thanks @Joe I'm in process of updating Apertium in Debian, so that we can use fresh packages from backpo... [11:45:25] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review: Preload HSTS - https://phabricator.wikimedia.org/T104244#1491299 (10Chmarkine) Before wikimedia.org is ready to preload, how about emailing agl@chromium.org to request preloading some high traffic and sensitive subdomains of wikimedia.org, like... [12:00:32] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [12:10:41] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [12:16:18] (03CR) 10ArielGlenn: [C: 031] Use fixed ports for dataset NFS server [puppet] - 10https://gerrit.wikimedia.org/r/226717 (https://phabricator.wikimedia.org/T105040) (owner: 10Muehlenhoff) [12:22:32] 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 7Elasticsearch: Use fixed ports for elasticsearch - https://phabricator.wikimedia.org/T107278#1491333 (10MoritzMuehlenhoff) 3NEW [12:23:28] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review: Preload HSTS - https://phabricator.wikimedia.org/T104244#1491341 (10BBlack) Technically, I think we can do that without a custom exception. We'd need to do a few things on our end regardless: 1. Actually fix those cases (e.g. right now, www.com... [12:23:30] 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 7Elasticsearch: Use fixed ports for elasticsearch - https://phabricator.wikimedia.org/T107278#1491342 (10MoritzMuehlenhoff) [12:38:11] (03CR) 10Hashar: [C: 031] "On beta cluster, the instances relay their syslog to:" [puppet] - 10https://gerrit.wikimedia.org/r/226084 (owner: 10Muehlenhoff) [12:41:06] (03PS2) 10Muehlenhoff: Use fixed ports for dataset NFS server [puppet] - 10https://gerrit.wikimedia.org/r/226717 (https://phabricator.wikimedia.org/T105040) [12:41:15] (03CR) 10Muehlenhoff: [C: 032 V: 032] Use fixed ports for dataset NFS server [puppet] - 10https://gerrit.wikimedia.org/r/226717 (https://phabricator.wikimedia.org/T105040) (owner: 10Muehlenhoff) [12:55:23] (03PS1) 10Merlijn van Deen: toollabs: uwsgi-plain: remove python specifics [puppet] - 10https://gerrit.wikimedia.org/r/227690 [12:55:39] (03PS2) 10Merlijn van Deen: toollabs: uwsgi-plain: remove python specifics [puppet] - 10https://gerrit.wikimedia.org/r/227690 (https://phabricator.wikimedia.org/T104374) [13:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150729T1300). Please do the needful. [13:03:35] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Isolation: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1491403 (10hashar) git clone git://anonscm.debian.org/openstack/python-os-client-config.git dch --bpo Modify chang... [13:05:48] andrewbogott: good morning :-} labnodepool1001.eqiad.wmnet can be re imaged :D [13:06:04] I think I figured out all the .deb packages I needed [13:08:15] 6operations, 7Database: Reduce memory commitment on database hosts with many objects, specially s3 and labs - https://phabricator.wikimedia.org/T107282#1491423 (10jcrespo) 3NEW [13:08:41] 6operations, 7Database: Reduce memory commitment on database hosts with many objects, specially s3, dbstore/research and labs - https://phabricator.wikimedia.org/T107282#1491432 (10jcrespo) [13:11:22] 6operations, 7Database: Reduce memory commitment on database hosts with many objects, specially s3, dbstore/research and labs - https://phabricator.wikimedia.org/T107282#1491437 (10jcrespo) This is related to T107070, but a short term change with existing hardware. [13:12:46] (03PS1) 10BBlack: enable ipsec on cp3031,40,41 [puppet] - 10https://gerrit.wikimedia.org/r/227692 [13:12:48] (03PS1) 10BBlack: enable ipsec on all remaining esams text [puppet] - 10https://gerrit.wikimedia.org/r/227693 [13:14:02] (03PS2) 10BBlack: enable ipsec on cp3031,40,41 [puppet] - 10https://gerrit.wikimedia.org/r/227692 (https://phabricator.wikimedia.org/T92604) [13:14:04] (03PS2) 10BBlack: enable ipsec on all remaining esams text [puppet] - 10https://gerrit.wikimedia.org/r/227693 (https://phabricator.wikimedia.org/T92604) [13:15:00] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: IPsec: roll-out plan - https://phabricator.wikimedia.org/T92604#1491450 (10BBlack) CPU impact on cp3030 seems to be minimal. You can see it if you squint at the graph, but it's not significant in any decision-making sort of way. Moving forward today w... [13:16:50] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: IPsec: roll-out plan - https://phabricator.wikimedia.org/T92604#1491457 (10BBlack) However, I forget to test another scenario: should do a cache wipe (backend + frontend) on a depooled cp3030 and then repool it, to see the ipsec spike from cache reload.... [13:19:38] !log depooling cp3030 (all layers) [13:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:09] (03PS2) 10Muehlenhoff: Add ferm rules for syslog-ng [puppet] - 10https://gerrit.wikimedia.org/r/226084 [13:23:19] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for syslog-ng [puppet] - 10https://gerrit.wikimedia.org/r/226084 (owner: 10Muehlenhoff) [13:24:43] 6operations, 7Database: Reduce memory commitment on database hosts with many objects, specially s3, dbstore/research and labs - https://phabricator.wikimedia.org/T107282#1491470 (10jcrespo) Buffer pool sizes: ``` analytics.my.cnf.erb:innodb_buffer_pool_size = 4G beta.my.cnf.erb:innodb_buffer_pool_size =... [13:27:11] !log repooling cp3030 with wiped caches [13:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:27:55] 6operations, 10OCG-General-or-Unknown: ocg alarm ocg_job_status_queue 'flapping' - https://phabricator.wikimedia.org/T97524#1491476 (10fgiunchedi) aannd it jumped again and the warnings are back, @cscott are these actionable or we should be alerting or something else like failed jobs? [13:32:56] (03PS1) 10Muehlenhoff: Enable base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/227697 [13:38:06] 6operations, 10Continuous-Integration-Infrastructure: Upload new Zuul .deb package on apt.wikimedia.org for precise-wikimedia and trusty-wikimedia - https://phabricator.wikimedia.org/T106499#1491499 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi {{done}} ``` root@carbon:~# reprepro -C main --ignore=wrongd... [13:41:44] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: IPsec: roll-out plan - https://phabricator.wikimedia.org/T92604#1491505 (10BBlack) Cache-wipe test didn't induce any notable spike, probably because the order-of-magnitude (or more) traffic reduction we see from fe->be in the text-cache case makes text-... [13:42:22] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [13:43:10] ^ fixed [13:43:32] bblack: nitpick: is it IPSec or IPsec? :) [13:44:12] we should just start calling it IpSeC, it looks way cooler :P [13:44:23] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:44:38] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: IPSec: roll-out plan - https://phabricator.wikimedia.org/T92604#1491529 (10BBlack) a:5Gage>3BBlack [13:44:39] that does look cooler :p [13:44:55] (03CR) 10GoldenRing: [C: 031] "This looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/227690 (https://phabricator.wikimedia.org/T104374) (owner: 10Merlijn van Deen) [13:46:23] (03PS3) 10BBlack: enable ipsec on cp3031,40,41 [puppet] - 10https://gerrit.wikimedia.org/r/227692 (https://phabricator.wikimedia.org/T92604) [13:48:59] (03CR) 10BBlack: [C: 032] enable ipsec on cp3031,40,41 [puppet] - 10https://gerrit.wikimedia.org/r/227692 (https://phabricator.wikimedia.org/T92604) (owner: 10BBlack) [13:51:14] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107060#1491540 (10EWilfong_WMF) [13:52:12] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107060#1485577 (10EWilfong_WMF) @CCogdill_WMF - I've updated the task description with the DNS additions we need for the Mahor Gifts' event tool. [13:53:23] (03PS4) 10coren: Add cleanup-snapshots script [puppet] - 10https://gerrit.wikimedia.org/r/227462 (https://phabricator.wikimedia.org/T106474) [14:00:45] (03PS1) 10Aude: Enable usage tracking on ptwiki + azbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227704 [14:00:53] (03PS1) 10John F. Lewis: add event{donations} CNAMEs for Major Gift [dns] - 10https://gerrit.wikimedia.org/r/227705 (https://phabricator.wikimedia.org/T107060) [14:01:03] (03CR) 10jenkins-bot: [V: 04-1] add event{donations} CNAMEs for Major Gift [dns] - 10https://gerrit.wikimedia.org/r/227705 (https://phabricator.wikimedia.org/T107060) (owner: 10John F. Lewis) [14:01:28] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107060#1491565 (10JohnLewis) The above patch should add the needed CNAMEs. [14:02:27] bah dots [14:03:00] (03PS2) 10John F. Lewis: add event{donations} CNAMEs for Major Gift [dns] - 10https://gerrit.wikimedia.org/r/227705 (https://phabricator.wikimedia.org/T107060) [14:03:34] Yeah was about to give you a review [14:03:38] But you got it :) [14:04:20] JohnFLewis: Was typing azure into the dns entries as painful as it looks? [14:04:26] I twitched involuntarily :p [14:05:28] it was painful submitting it for review, especially the aws one :p [14:05:54] (03PS1) 10coren: Add some package requirements for labstore* [puppet] - 10https://gerrit.wikimedia.org/r/227710 (https://phabricator.wikimedia.org/T102478) [14:06:16] aws urls always look like someone is sat there smashing their hands on a keyboard and going 'yeah, that's a good hostname' [14:09:48] 6operations, 7discovery-system: Create a conftool "agent" that overcomes confd deficiencies - https://phabricator.wikimedia.org/T107285#1491591 (10Joe) 3NEW [14:11:43] !log aude Synchronized php-1.26wmf16/extensions/Wikidata: add usage tracking job (duration: 00m 24s) [14:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:12:06] 6operations, 7discovery-system: implement write locking in conftool - https://phabricator.wikimedia.org/T107286#1491599 (10Joe) 3NEW [14:13:08] !log aude Synchronized php-1.26wmf15/extensions/Wikidata: add usage tracking job (duration: 00m 20s) [14:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:07] !log aude Synchronized php-1.26wmf15/extensions/Wikidata: rv add usage tracking job (duration: 00m 20s) [14:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:19] (03PS1) 10Muehlenhoff: Add ferm rules for dataset NFS server [puppet] - 10https://gerrit.wikimedia.org/r/227711 (https://phabricator.wikimedia.org/T104991) [14:16:38] (03PS1) 10Muehlenhoff: Enable base::firewall on dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/227712 (https://phabricator.wikimedia.org/T104991) [14:17:23] (03CR) 10jenkins-bot: [V: 04-1] Enable base::firewall on dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/227712 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [14:19:58] (03PS1) 10Muehlenhoff: Enable base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/227713 (https://phabricator.wikimedia.org/T104991) [14:20:52] (03CR) 10jenkins-bot: [V: 04-1] Enable base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/227713 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [14:21:53] (03PS7) 10BBlack: Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) [14:27:37] (03CR) 10John F. Lewis: [C: 04-1] "Missing commas after port rules." [puppet] - 10https://gerrit.wikimedia.org/r/227711 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [14:27:54] (03CR) 10Aude: [C: 032] Enable usage tracking on ptwiki + azbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227704 (owner: 10Aude) [14:28:00] (03Merged) 10jenkins-bot: Enable usage tracking on ptwiki + azbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227704 (owner: 10Aude) [14:28:45] !log aude Synchronized usagetracking.dblist: Enable usage tracking on ptwiki and azbwiki (duration: 00m 12s) [14:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:29:37] (03PS2) 10Muehlenhoff: Add ferm rules for dataset NFS server [puppet] - 10https://gerrit.wikimedia.org/r/227711 (https://phabricator.wikimedia.org/T104991) [14:32:34] (03CR) 10John F. Lewis: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/227711 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [14:36:56] (03PS8) 10BBlack: Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) [14:38:22] (03PS2) 10Muehlenhoff: Enable base::firewall on dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/227712 (https://phabricator.wikimedia.org/T104991) [14:38:45] (03PS2) 10Muehlenhoff: Enable base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/227713 (https://phabricator.wikimedia.org/T104991) [14:39:16] 6operations, 6Services, 10hardware-requests: Assign wmf4541,wmf4543 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1491649 (10Joe) 3NEW [14:39:43] 6operations, 6Services, 10hardware-requests: Assign wmf4541,wmf4543 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1491662 (10Joe) [14:39:46] 6operations, 6Services, 3Mobile-Content-Service, 7service-deployment-requests: New Service Request mobileapps - https://phabricator.wikimedia.org/T105538#1491661 (10Joe) [14:43:10] (03CR) 10Filippo Giunchedi: [C: 031] Ferm rules for Logstash log ingestion [puppet] - 10https://gerrit.wikimedia.org/r/227192 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [14:44:44] (03CR) 10BBlack: "PS8 fixed a minor VCL defect (the default action for vcl_miss is "fetch", not "miss", to bypass bits-unrelated code there). This is teste" [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [14:44:49] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 3LE-CX6-Sprint 1: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1491675 (10Arrbee) [14:45:51] hashar_: sorry, got a late start today :( are you still working for a bit? Want to rebuild that server now? [14:47:08] RECOVERY - Cassanda CQL query interface on restbase1007 is OK: TCP OK - 0.005 second response time on port 9042 [14:47:34] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: stricter permissions on cassandra data dir - https://phabricator.wikimedia.org/T106133#1491676 (10fgiunchedi) 5Open>3Resolved fixed [14:54:48] (03PS1) 10Alex Monk: Revert "Revert "Add special wikipedias to wikipedia.dblist"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227718 [14:55:15] (03PS2) 10Alex Monk: Revert "Revert "Add special wikipedias to wikipedia.dblist"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227718 [14:55:28] who's doing swat today? [14:55:29] (03PS3) 10Alex Monk: Revert "Revert "Add special wikipedias to wikipedia.dblist"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227718 [14:55:37] (03PS3) 10BBlack: enable ipsec on all remaining esams text [puppet] - 10https://gerrit.wikimedia.org/r/227693 (https://phabricator.wikimedia.org/T92604) [14:56:17] (03PS1) 10Muehlenhoff: add ferm rules for udp2log [puppet] - 10https://gerrit.wikimedia.org/r/227719 [14:56:19] (03PS1) 10Muehlenhoff: Enable base::firewall for fluorine [puppet] - 10https://gerrit.wikimedia.org/r/227720 [14:56:44] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1491695 (10fgiunchedi) bootstrapping restbase1007 was successful! ``` 15:47 -icinga-wm:#wikimedia-operations- RECOVERY - Cassanda CQL q... [14:57:58] (03CR) 10BBlack: [C: 032] enable ipsec on all remaining esams text [puppet] - 10https://gerrit.wikimedia.org/r/227693 (https://phabricator.wikimedia.org/T92604) (owner: 10BBlack) [14:59:13] twentyafterfour: around? [15:00:00] Krenair: I wasn't sure, specifically, with labs db replication. [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150729T1500). Please do the needful. [15:00:11] And since arbcom wikis are private, we erred on side of caution and reverted. [15:01:05] I'm available to SWAT if anyone needs it, although nothing is posted. [15:01:14] thcipriani: ok [15:01:14] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1491703 (10fgiunchedi) >>! In T102557#1397382, @Cmjohnson wrote: > I am going to RMA the 2 Samsung disks from restbase1008 @cmjohnson the new ssd seem to be working fine from ou... [15:01:40] just that 1) i am going to deploy https://gerrit.wikimedia.org/r/#/c/227716/ (proper revert of our extension update) [15:01:43] (03PS4) 10Muehlenhoff: Ferm rules for Logstash log ingestion [puppet] - 10https://gerrit.wikimedia.org/r/227192 (https://phabricator.wikimedia.org/T104964) [15:01:44] soon as jenkins allows [15:01:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Ferm rules for Logstash log ingestion [puppet] - 10https://gerrit.wikimedia.org/r/227192 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [15:02:05] 2) Wikidata submodule was not correct on wmf16 today (it was on something old) [15:02:13] i don't think http://git.wikimedia.org/commitdiff/mediawiki%2Fcore.git/2b9032e61d833a374d782bf5c01137c199963227 got properly deployed [15:02:22] and other submodules might also be on wrong versions [15:02:23] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1491704 (10GWicke) /me high-fives @fgiunchedi and @eevans! It took a bit longer than planned to add that node, but it's all good in the... [15:03:11] i'm busy with wikidata stuff and can't look at the submodules right now [15:04:57] hmm, that's no good, I wonder if branch was cut from outdated config.json? I'll poke. [15:05:11] i think twentyafterfour tried to do something new / different [15:06:34] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1491719 (10Amire80) [15:06:49] well it may just be that he didn't merge in the move to wmf/1.26wmf16 for wikidata. Only recent change in config.json. [15:07:05] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: IPSec: roll-out plan - https://phabricator.wikimedia.org/T92604#1491720 (10BBlack) Update: all of esams text cluster is now using ipsec for backhaul to tier1. [15:07:24] thcipriani: the submodule patch was not on tin [15:07:28] until i rebased [15:08:09] http://git.wikimedia.org/commit/mediawiki%2Fcore.git/2b9032e61d833a374d782bf5c01137c199963227 was not there [15:13:46] are you deploying aude? [15:14:13] (03PS1) 10Muehlenhoff: Add ferm rules for new Logstash ingestion module logstash::input::udp [puppet] - 10https://gerrit.wikimedia.org/r/227723 (https://phabricator.wikimedia.org/T104964) [15:15:33] (03PS1) 10Giuseppe Lavagetto: Add mobileapps LVS IP [dns] - 10https://gerrit.wikimedia.org/r/227724 (https://phabricator.wikimedia.org/T92627) [15:16:04] (03CR) 10Alex Monk: [C: 032] Revert "Revert "Add special wikipedias to wikipedia.dblist"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227718 (owner: 10Alex Monk) [15:16:29] (03Merged) 10jenkins-bot: Revert "Revert "Add special wikipedias to wikipedia.dblist"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227718 (owner: 10Alex Monk) [15:17:01] revert revert revert your boat [15:17:26] Krenair: i am [15:17:55] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/227718/3 (duration: 00m 12s) [15:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:08] !log krenair Synchronized wikipedia.dblist: https://gerrit.wikimedia.org/r/#/c/227718/3 (duration: 00m 12s) [15:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:08] (03PS1) 10Giuseppe Lavagetto: Introducing mobileapps role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/227725 [15:20:10] (03PS1) 10Giuseppe Lavagetto: Assign mobileapps service to sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/227726 [15:20:12] (03PS1) 10Giuseppe Lavagetto: Setup LVS for mobileapps service on sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/227727 [15:20:19] doing [15:20:35] !log aude Synchronized php-1.26wmf15/extensions/Wikidata: rv usage tracking change (duration: 00m 20s) [15:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:52] Ah. I forgot the merge step :) [15:20:54] done for now [15:21:31] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/227718/3 (duration: 00m 12s) [15:21:36] <_joe_> mobrovac: ^^ all generated with alex's script [15:21:43] <_joe_> in like 5 minutes [15:21:43] !log krenair Synchronized wikipedia.dblist: https://gerrit.wikimedia.org/r/#/c/227718/3 (duration: 00m 12s) [15:22:05] _joe_: cool! [15:22:11] there, that fixed it [15:22:20] <_joe_> mobrovac: yeah it is [15:22:37] _joe_: (see my question in #services) [15:22:51] 10Ops-Access-Requests, 6operations: Access to analytics cluster for user krinkle - https://phabricator.wikimedia.org/T107243#1490489 (10RobH) p:5Triage>3Normal [15:23:34] andrewbogott: going to leave soon sorry :/ [15:23:45] ah, there you are :) [15:23:50] andrewbogott: but I guess you can rebuild labnodepool1001.eqiad.wmnet [15:23:56] can catch up later [15:24:02] My fault for oversleeping. You need me to merge that patch first, right? [15:24:07] https://gerrit.wikimedia.org/r/#/c/227663/2/modules/nodepool/manifests/init.pp [15:24:23] yeah potentially :-} [15:24:26] (03CR) 10Alex Monk: [C: 032] Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T14423) (owner: 10coren) [15:24:44] 6operations, 6Services: Find spares for SCA services - https://phabricator.wikimedia.org/T107137#1491765 (10mobrovac) [15:24:49] turns out I will need a few packages from backports with the bump of nodepool to v1.0.0 [15:24:51] (03Merged) 10jenkins-bot: Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T14423) (owner: 10coren) [15:25:31] andrewbogott: and https://gerrit.wikimedia.org/r/#/c/227461/ get rid of disk image since that requires root [15:25:41] ok [15:25:56] so that earlier patch with jessie-backports; that’s our custom repo right? [15:25:59] I am going to join greg-g for our 1/1 then commute back home :/ [15:26:06] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1491778 (10mobrovac) [15:26:08] !log krenair Synchronized database lists: https://gerrit.wikimedia.org/r/#/c/194549/ (duration: 00m 11s) [15:26:09] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1491779 (10mobrovac) [15:26:12] 6operations, 6Services: Find spares for SCA services - https://phabricator.wikimedia.org/T107137#1491773 (10mobrovac) 5Open>3Resolved a:3mobrovac We are assigning `wmf4541` and `wmf4543` as `scb100[12]`, see continuation of this work in {T107287} [15:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:30] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/194549/ (duration: 00m 13s) [15:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:39] 6operations, 6Services, 10hardware-requests: Assign wmf4541,wmf4543 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1491649 (10mobrovac) [15:26:42] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1116669 (10mobrovac) [15:26:53] 6operations, 6Services, 10hardware-requests: Assign wmf4541,wmf4543 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1491649 (10mobrovac) [15:26:56] 6operations, 6Services, 3Mobile-Content-Service, 7service-deployment-requests: New Service Request mobileapps - https://phabricator.wikimedia.org/T105538#1491782 (10mobrovac) [15:27:06] !log krenair Synchronized tests/dblistTest.php: https://gerrit.wikimedia.org/r/#/c/194549/ (duration: 00m 13s) [15:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:17] 6operations, 6Services, 10hardware-requests: Assign wmf4541,wmf4543 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1491649 (10mobrovac) [15:27:18] andrewbogott: labnodepool has both jessie and jessie-backports :) [15:27:20] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1206310 (10mobrovac) [15:27:34] (03PS1) 10RobH: add krinkle ot statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/227728 [15:27:39] ok [15:27:54] hashar: I’ll rebuild that box and you can admire its working (or broken) state when you return :) [15:28:01] You’re sure it’s safe for me to wipe its current contents? [15:28:06] andrewbogott: maybe the /etc/apt/sources.list is wrong :/ [15:28:16] andrewbogott: yeah you can wipe it [15:28:17] hmm... might need to revert that [15:28:32] if something is lost, I will rebuild it but afaik all credentials are in private git repo [15:29:08] (03PS1) 10BBlack: bugfix to HTTPS redirect regex [puppet] - 10https://gerrit.wikimedia.org/r/227729 [15:29:10] (03PS1) 10BBlack: disable ipsec config for cp3011 (down for hw issue) [puppet] - 10https://gerrit.wikimedia.org/r/227730 [15:29:29] (03CR) 10BBlack: [C: 032 V: 032] bugfix to HTTPS redirect regex [puppet] - 10https://gerrit.wikimedia.org/r/227729 (owner: 10BBlack) [15:29:41] 10Ops-Access-Requests, 6operations: Access to analytics cluster for user krinkle - https://phabricator.wikimedia.org/T107243#1491788 (10RobH) Patchset https://gerrit.wikimedia.org/r/#/c/227728/ has been submitted into gerrit for adding @Krinkle to the statistics-privatedata-users group. This has to wait until... [15:29:44] (03CR) 10BBlack: [C: 032 V: 032] disable ipsec config for cp3011 (down for hw issue) [puppet] - 10https://gerrit.wikimedia.org/r/227730 (owner: 10BBlack) [15:30:24] !log krenair Synchronized wikisource.dblist: https://gerrit.wikimedia.org/r/#/c/194549/ (duration: 00m 12s) [15:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:10] it doesn't make sense [15:31:28] wmgUseProofreadPage is set to true for wikisources [15:31:43] and sourceswiki was added to wikisource.dblist, and I verified that sync'd properly [15:31:56] and yet it doesn't load the proofreadpage extension [15:32:06] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1491792 (10fgiunchedi) @cmjohnson also please swap the intel SSD on restbase1008 with new ones we got last, thanks! [15:33:00] Krenair: The sourceswiki config was full of special casing and fail; there's probably some still hanging around somewhere. [15:33:44] !log krenair Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 12s) [15:33:46] andrewbogott: the recent nodepool rely on some more advanced openstack packages. Luckily we have them in backports \O/ [15:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:47] okay, I'm going to have to revert [15:34:59] (03PS1) 10Alex Monk: Revert "Move sourceswiki special.dblist->wikisource.dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227734 [15:35:25] (03PS2) 10Alex Monk: Revert "Move sourceswiki special.dblist->wikisource.dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227734 [15:35:32] 10Ops-Access-Requests, 6operations: Access to analytics cluster for user krinkle - https://phabricator.wikimedia.org/T107243#1491799 (10Ottomata) [15:35:33] (03CR) 10Alex Monk: [C: 032] Revert "Move sourceswiki special.dblist->wikisource.dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227734 (owner: 10Alex Monk) [15:35:39] (03Merged) 10jenkins-bot: Revert "Move sourceswiki special.dblist->wikisource.dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227734 (owner: 10Alex Monk) [15:36:02] !log krenair Synchronized database lists: (no message) (duration: 00m 12s) [15:36:03] 10Ops-Access-Requests, 6operations: Access to analytics cluster for user krinkle - https://phabricator.wikimedia.org/T107243#1490489 (10Ottomata) @Robh, `statistics-privatedata-users` was the wrong group. Please use `analytics-privatedata-users` [15:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:10] Krenair: do you have time to deply a couple small cherry-picks for an hhvm imagescaler bug? https://gerrit.wikimedia.org/r/#/q/Ia610a43789e6ff14cfc0964f285bbec39c890152,n,z [15:36:18] !log krenair Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 12s) [15:36:42] !log krenair Synchronized tests/dblistTest.php: (no message) (duration: 00m 10s) [15:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:10] Krenair: Blah, you didn't even respond to my ping... [15:38:16] ostriches, ? [15:38:34] Krenair: I wasn't sure, specifically, with labs db replication. [15:38:35] And since arbcom wikis are private, we erred on side of caution and reverted. [15:38:48] Also, Krinkle_ brought up meta_p [15:40:07] (03PS2) 10RobH: add krinkle ot statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/227728 [15:40:48] ostriches, I'm not sure how labs db replication works [15:41:08] I wasn't sure exactly either w.r.t. sanitarium, so I erred and reverted. [15:41:11] I know this will set family=wikipedia on those sites in meta_p rather than family=special [15:41:44] fwiw, it did keep them in the special section in SiteMatrix [15:41:52] What's that about labs replication? [15:41:56] what is the issue with labsdbs? [15:42:21] We added a bunch of sites missing from wikipedia.dblist into wikipedia.dblist [15:42:25] some of them are also in private.dblist [15:42:28] 6operations, 6Services, 10hardware-requests: Assign wmf4541,wmf4543 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1491817 (10RobH) a:3mark This is being discussed in IRC, as it needs Mark to also sign off on it. I'll note that this will use up two really hig... [15:42:30] I hope you are handling private.dblist properly [15:43:03] I just wanted to be cautious :) [15:44:39] Krenair: Two lines of defence, IIRC. Sanitarium shouldn't be replicating those at all, but the view creator will also refuse to create views to those dbs even if they end up replicated. [15:44:58] I'm not sure about that second part [15:45:21] not sure abut the first part either [15:45:24] View creation loop as "next if defined $db->{'private'};" specifically for that [15:45:44] 10Ops-Access-Requests, 6operations: Access to analytics cluster for user krinkle - https://phabricator.wikimedia.org/T107243#1491836 (10RobH) Thanks for catching that, https://gerrit.wikimedia.org/r/#/c/227728/ has been updated. [15:45:49] (Populated from private.dblist) [15:46:11] in maintain-replicas? [15:46:18] Krenair: yep [15:46:31] 6operations, 6Services, 10hardware-requests: Assign wmf4541,wmf4543 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1491837 (10Joe) I can add I tried to avoid spares with SSDs, and the ones out of warranty. I expect 48 GB of RAM to be a minimum requirements for a... [15:46:38] Yeah that's just for meta_p, I don't think it controls views [15:47:34] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1491838 (10KartikMistry) I had discussion with upstream and restarting... [15:47:52] Coren, ^ [15:48:18] Krenair: Checking [15:48:23] should probably be doing that before sql("CREATE DATABASE ${dbk}_p;") [15:48:52] where does sanitarium check for these? [15:49:18] there are no checks, there is no security at all [15:49:18] Krenair: ... you are clearly correct. D'oh. [15:51:24] robh: might also want to update the commit title+desc for the krinkle patch? [15:51:31] Krenair: Inspection shows that the %_p databases were created for private wikis, with no views in them. [15:51:36] (03PS1) 10Alex Monk: Don't try to create views for private/deleted wikis [software] - 10https://gerrit.wikimedia.org/r/227735 [15:52:06] (03CR) 10Chad: [C: 031] Don't try to create views for private/deleted wikis [software] - 10https://gerrit.wikimedia.org/r/227735 (owner: 10Alex Monk) [15:52:20] yeah, still probably shouldn't even get that far [15:52:29] Krenair: Because the underlying database isn't there. E.g. 'fdcwiki' isn't there even though there is an (empty) 'fdcwiki_p' [15:52:36] Yep. Making a patch now. [15:52:41] (03PS3) 10Andrew Bogott: nodepool: use OpenStack modules from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/227663 (https://phabricator.wikimedia.org/T104971) (owner: 10Hashar) [15:52:47] I already uploaded a patch [15:53:15] (03PS2) 10Alex Monk: maintain-replicas: Don't try to create _p DBs for private/deleted wikis [software] - 10https://gerrit.wikimedia.org/r/227735 [15:53:22] so, there are no private databases on labs [15:53:26] no [15:53:30] why, I don't know [15:53:48] (03CR) 10Andrew Bogott: [C: 032] nodepool: use OpenStack modules from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/227663 (https://phabricator.wikimedia.org/T104971) (owner: 10Hashar) [15:54:15] (03PS3) 10RobH: add krinkle ot analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/227728 [15:54:36] (03CR) 10coren: [C: 032] "I was /just/ about to send an identical patch. :-)" [software] - 10https://gerrit.wikimedia.org/r/227735 (owner: 10Alex Monk) [15:54:46] (03CR) 10coren: [V: 032] "I was /just/ about to send an identical patch. :-)" [software] - 10https://gerrit.wikimedia.org/r/227735 (owner: 10Alex Monk) [15:55:03] (03PS3) 10Andrew Bogott: nodepool: stop using diskimage [puppet] - 10https://gerrit.wikimedia.org/r/227461 (https://phabricator.wikimedia.org/T102281) (owner: 10Hashar) [15:55:06] 10Ops-Access-Requests, 6operations: Access to analytics cluster for user krinkle - https://phabricator.wikimedia.org/T107243#1491869 (10Milimetric) oops. @Krinkle, it's actually analytics-privatedata-users you need to be a part of, I'll edit the description [15:55:34] heh, note to self: refresh page [15:56:43] (03CR) 10Andrew Bogott: [C: 032] nodepool: stop using diskimage [puppet] - 10https://gerrit.wikimedia.org/r/227461 (https://phabricator.wikimedia.org/T102281) (owner: 10Hashar) [15:56:57] i lied, there the barrier exists on sanitarium, too [15:58:22] jynus, where is the check for that? [15:58:43] there is a replication filter [15:59:07] but I do not know how it is maintained and I highly suspect it is opt in and manually mantained [15:59:55] so what file am I looking at? [15:59:59] oh [16:00:03] I am worng again [16:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150729T1600). Please do the needful. [16:00:14] I should have more faith [16:00:16] (03PS1) 10Alex Monk: Revert "Revert "Move sourceswiki special.dblist->wikisource.dblist"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227738 [16:00:29] it is automatically mantained, let me give you the path, krenair [16:00:54] operations/puppet/templates/mariadb/sanitarium.my.cnf.erb [16:00:59] jynus: There are a lot of leftover uglies in the replication setup, but it's not 100% bad either. :-) [16:00:59] !log installed qemu security updates on labvirt* [16:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:18] andrewbogott: I reenabled puppet [16:01:19] I am sorry, I tend to be untrusstful [16:01:48] Coren, I wasn't talkig about labs [16:01:50] jynus: Not a bad quality, even though it makes you sound a little cynical at times. :-) [16:01:53] but sanitarium [16:02:01] it has given problems in the past [16:02:30] oh wow [16:02:32] andrewbogott: packages are pinned and disk image is gone :-} [16:02:34] a copy of private.dblist in puppet? [16:02:39] cool [16:02:46] better hope that always matches the real private.dblist... [16:02:51] I’m going to give this module one last look and then I’ll re-image [16:03:12] andrewbogott: great, I will catch up tomorrow morning [16:03:50] At least it matches at the moment [16:03:56] so, unmanteined [16:04:07] (03PS1) 10BBlack: Revert "No need for wgSecureLogin on our wikis, HTTPS is forced everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227740 [16:04:10] and the cynical me was partially right [16:04:12] (03PS2) 10BBlack: Revert "No need for wgSecureLogin on our wikis, HTTPS is forced everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227740 [16:04:27] unmantained == manually unmantained for me [16:04:29] At least there is documentation for it: https://wikitech.wikimedia.org/wiki/Add_a_wiki#IMPORTANT:_For_Private_Wikis [16:04:31] *mant [16:05:29] (03PS3) 10BBlack: Revert "No need for wgSecureLogin on our wikis, HTTPS is forced everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227740 (https://phabricator.wikimedia.org/T103021) [16:05:33] and this is dba staff, not labs, Coren, it is a production host. I am more cynical with myself than with anyone else [16:06:08] jynus: Seriously, I wasn't complaining - being paranoid about such things is a good thing. :-) [16:06:14] not working, bblack? [16:06:42] so, best course of action? checkout the file ? [16:07:06] Krenair: On the original topic, I can't find anything in mediawiki-config that could explain why sourceswiki doesn't get the proofreadpage extension. :-( [16:07:26] yeah, I don't understand that either [16:07:42] (03CR) 10Alex Monk: [C: 04-1] "TODO: Figure out why this breaks everything." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227738 (owner: 10Alex Monk) [16:08:09] Krenair: Is there a way to find largest images on commons - pixelwise ?or size wise ? [16:08:13] Krenair: complicated, and not for this channel [16:08:18] ok [16:08:53] matanya, well, there is image.img_size [16:09:06] there's also img_width and img_height there [16:09:06] are there ongoing scaps and such, or are we clear to sync a wmf-config change? [16:09:14] matanya, so yeah, probably [16:09:20] bblack: not deploying yet/now [16:09:27] ori might be doing something [16:09:39] but I think to core and not to mediawiki-config, so you should be fine [16:09:50] Krenair: can i query that? get some report or something ? [16:09:55] (03CR) 10BBlack: [C: 032] Revert "No need for wgSecureLogin on our wikis, HTTPS is forced everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227740 (https://phabricator.wikimedia.org/T103021) (owner: 10BBlack) [16:09:57] matanya, sure, via labs replicas [16:10:42] ah, directly in the DB, ok. A user asked me for a freindly special page or something. [16:11:08] matanya, DB with quarry? [16:11:15] !log ori Synchronized php-1.26wmf16/thumb.php: 2c9518ed78: Add Content-Length header to thumb.php redirects (duration: 00m 12s) [16:11:17] that was my next shot [16:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:31] !log ori Synchronized php-1.26wmf15/thumb.php: 2c9518ed78: Add Content-Length header to thumb.php redirects (duration: 00m 12s) [16:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:57] matanya, I do not know the api, you could check that, too [16:11:59] (03PS15) 10Dduvall: beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) [16:12:32] thanks jynus I don't think that user is capable of using sql or api, but thanks! [16:12:32] !log ori Synchronized wmf-config: Revert "No need for wgSecureLogin on our wikis, HTTPS is forced everywhere" (duration: 00m 13s) [16:12:34] ^ bblack [16:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:47] ori: thanks [16:13:39] !log depooled Precise image scalers (mw1159 / mw1160)to see if 2c9518ed78 helped. [16:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:19] <_joe_> ori: I was about to do the same :P [16:14:30] (03CR) 10Dduvall: "Rebased and removed the hit-for-pass bits." [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) (owner: 10Dduvall) [16:14:31] !log re-imaging labnodepool1001 [16:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:59] 6operations, 10Citoid, 6Services: Package and test Zotero for Jessie - https://phabricator.wikimedia.org/T107302#1491929 (10mobrovac) 3NEW [16:15:04] _joe_: 503 rate is kind of high, i bet the mediawiki exception log is being ignored again :( [16:15:10] <_joe_> ori: my test image works now [16:15:22] _joe_: \o/ [16:15:44] _joe_: wow, 503s plummeted [16:15:45] <_joe_> let's not chant for victory, but I guess the 503 rate just plunged [16:15:49] haha [16:17:21] <_joe_> good. [16:17:29] * ori chants [16:17:48] PROBLEM - Host labnodepool1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:19:37] RECOVERY - Host labnodepool1001 is UPING OK - Packet loss = 0%, RTA = 1.85 ms [16:20:46] there's a correlation in 4xx rates around there too [16:21:01] might be normal for generating thumbs for previously broken requests, though [16:21:44] yes, it's great, the whole setup was basically designed to be impossible to debug [16:22:09] :) [16:22:10] thumb generation is also retried up to N times by repeatedly issuing redirects [16:22:25] matanya: Naive pass could be 'select * from image order by img_width*img_height desc limit 5;' - for instance - as a quarry query. But that'll get svg with large values also I think and won't be fast (because no index on w*h clearly) [16:22:34] (03PS18) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [16:22:36] (03PS11) 10BryanDavis: [WIP] Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) [16:22:54] thanks Coren i will give it a shot and fine-tune [16:23:05] basically every HTTP status code for the image scaler stack has two meanings: "business as usual" and "serious error" [16:23:32] <_joe_> ori: "it depends on the wind direction" [16:23:47] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: Connection refused by host [16:24:05] <_joe_> ori: in other words, https://www.teezily.com/systemsengineer341rb2u#item=2102865&side=front [16:24:06] PROBLEM - salt-minion processes on labnodepool1001 is CRITICAL: Connection refused by host [16:24:16] PROBLEM - configured eth on labnodepool1001 is CRITICAL: Connection refused by host [16:24:26] heh cute [16:24:47] PROBLEM - Check size of conntrack table on labnodepool1001 is CRITICAL: Connection refused by host [16:24:59] <_joe_> in this case, those of questionable knowledge is ourselves :P [16:25:06] PROBLEM - DPKG on labnodepool1001 is CRITICAL: Connection refused by host [16:25:20] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 6Services: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#1491973 (10mobrovac) 3NEW [16:25:26] PROBLEM - RAID on labnodepool1001 is CRITICAL: Connection refused by host [16:25:27] PROBLEM - Disk space on labnodepool1001 is CRITICAL: Connection refused by host [16:25:37] PROBLEM - dhclient process on labnodepool1001 is CRITICAL: Connection refused by host [16:26:06] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 6Services: Test CXServer in Jessie - https://phabricator.wikimedia.org/T107307#1491980 (10mobrovac) 3NEW [16:27:32] (03PS1) 10Chad: beta: Swap text caches to -text04, which is jessie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227743 (https://phabricator.wikimedia.org/T98758) [16:27:45] (03PS1) 10Chad: beta: swap text caches to text04, which is jessie [puppet] - 10https://gerrit.wikimedia.org/r/227744 (https://phabricator.wikimedia.org/T98758) [16:28:55] wheee [16:30:46] (03PS2) 10Ori.livneh: Add ferm rules for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/226506 (https://phabricator.wikimedia.org/T104972) (owner: 10Muehlenhoff) [16:30:59] (03CR) 10Ori.livneh: [C: 032] Add ferm rules for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/226506 (https://phabricator.wikimedia.org/T104972) (owner: 10Muehlenhoff) [16:31:19] (03CR) 10Ori.livneh: [V: 032] Add ferm rules for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/226506 (https://phabricator.wikimedia.org/T104972) (owner: 10Muehlenhoff) [16:32:23] (03CR) 10Dzahn: [C: 04-1] "before going forward this neeeds the meeting mentioned on T107059 , also see general concerns about pointing to 3rd parties described ther" [dns] - 10https://gerrit.wikimedia.org/r/227705 (https://phabricator.wikimedia.org/T107060) (owner: 10John F. Lewis) [16:34:47] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 6Services: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#1492032 (10KartikMistry) Thanks. Do we've script that can rebuild current trusty packages for Jessie? [16:36:48] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107060#1492037 (10faidon) Two notes: - Why can't we have one domain for both? Polluting our namespace like that doesn't sound great. - "events" also sounds a bit generic — especial... [16:41:53] 6operations, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1492041 (10jcrespo) From varnish: ``` {"hostname":"cp1068.eqiad.wmnet","sequence":14321379515,"dt":"2015-07-29T16:24:30","time_firstbyte":60.163419... [16:41:56] RECOVERY - RAID on labnodepool1001 is OK Active: 6, Working: 6, Failed: 0, Spare: 0 [16:41:57] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [16:41:58] RECOVERY - dhclient process on labnodepool1001 is OK: PROCS OK: 0 processes with command name dhclient [16:42:47] RECOVERY - configured eth on labnodepool1001 is OK - interfaces up [16:43:28] RECOVERY - Check size of conntrack table on labnodepool1001 is OK nf_conntrack is 0 % full [16:43:37] RECOVERY - DPKG on labnodepool1001 is OK: All packages OK [16:44:36] 6operations: Backport ffmpeg 2.7.3 to Trusty - https://phabricator.wikimedia.org/T107313#1492064 (10ori) 3NEW [16:45:18] (03PS2) 10Dzahn: Enable base::firewall on lithium [puppet] - 10https://gerrit.wikimedia.org/r/227697 (owner: 10Muehlenhoff) [16:48:04] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 6Services: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#1492073 (10mobrovac) p:5Triage>3High [16:48:43] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 6Services: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#1491973 (10mobrovac) >>! In T107306#1492032, @KartikMistry wrote: > Do we've script that can rebuild current trusty packages for Jessie? Or it need... [16:48:53] hashar: "Unable to locate package nodepool" [16:49:00] robh: hi, what is the next step regarding my access request for server-side upload? pending mark's comment ? [16:49:02] 6operations: Backport ffmpeg 2.7.3 to Trusty - https://phabricator.wikimedia.org/T107313#1492080 (10ori) [16:49:04] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1492079 (10ori) [16:49:25] andrewbogott: looks like I forgot to get it in apt.wm.o :/ [16:49:47] matanya: yea, it was discussed during the meeting and everyone was pretty much 'we trust matanya but this is messed up that this is how it has to happen' [16:49:47] is it something you built yourself? Or can you give me a download link? [16:49:49] andrewbogott: https://phabricator.wikimedia.org/T104971#1491022 has the links [16:50:05] so i think mark wanted to give it a glance and decide the amount of work to make it work otherwise, not entirely certain [16:50:14] but since he said he wanted to review, i turfed it to him [16:50:20] andrewbogott: either https://people.wikimedia.org/~hashar/debs/nodepool_0.1.0-wmf1/ or scp terbium.eqiad.wmnet:/home/hashar/public_html/debs/nodepool_0.1.0-wmf1/* [16:50:22] hashar: ok! [16:50:46] hashar: that’s built from source that’s in our gerrit? (Sorry, I’m sure we’ve discussed this before.) [16:50:54] andrewbogott: definitely. https://gerrit.wikimedia.org/r/#/q/project:operations/debs/nodepool+branch:debian,n,z [16:51:08] andrewbogott: the related changelog is https://gerrit.wikimedia.org/r/#/c/227665/1/debian/changelog,unified [16:51:25] robh: can't agree more [16:51:28] ACKNOWLEDGEMENT - nova-network process on labnet1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-network andrew bogott Andrew is still building this box. [16:52:04] andrewbogott: what I do is I review upstream changes, bump the 'upstream' branch, merge it in 'debian' branch and rebuild. [16:52:23] ok, sounds good. I will add it to the repo [16:52:37] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1492107 (10ori) [16:52:38] at least the reimage caught that! [16:56:18] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1492128 (10mobrovac) [16:57:49] andrewbogott: i am off, will catch up tomorrow [16:58:02] so long! [16:59:26] andrewbogott: Can we get a RAM bump on deployment-prep? Trying to replace some nodes but don't have enough wiggle room [16:59:39] ostriches: yep, stay tuned [17:01:00] RECOVERY - puppet last run on labnodepool1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:02:14] ostriches: try now? [17:02:24] uno momento, in a mtg. [17:08:50] andrewbogott: Works, thx! [17:10:11] 6operations, 5Continuous-Integration-Isolation: Reinstall labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T107158#1492182 (10Andrew) I re-imaged labnodepool1001 and got a clean puppet run. Nice work! Hashar, you can verify that it's behaving adequately and then Chase or I will yank your root... [17:12:32] SELECT * from image WHERE img_name = "The_New_International_Encyclopædia_1st_ed._v._14.djvu"; [17:12:32] WTF [17:12:46] it is normal that the whole .djvu is stored in the img_metadata table? [17:13:22] RECOVERY - salt-minion processes on labnodepool1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:14:53] hoo^^ [17:20:17] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1492209 (10faidon) My take (pardon me for repeating what others have said): - I don't think we should be adding any new non-HTTPS sites at this point. We've taken a very deliberate policy decision... [17:26:29] is anyone deploying now? [17:27:24] need to bump wikidata on wmf16 [17:31:01] bd808: ori https://gerrit.wikimedia.org/r/#/c/227733/ is merged but not deployed yet [17:31:14] i'll stash it and deploy my wikidata change [17:31:53] aude: [10:11] !log ori Synchronized php-1.26wmf16/thumb.php: 2c9518ed78: Add Content-Length header to thumb.php redirects (duration: 00m 12s) [17:31:59] oh [17:32:06] not on 15? [17:32:08] maybe he fetched but didn't rebase? [17:32:16] 15 was right after [17:32:18] ah [17:32:34] 16 [17:32:52] i can do that to be sure [17:33:09] _joe_, ^ [17:33:22] it's not on tin yet [17:36:51] doing [17:37:01] !log aude Synchronized php-1.26wmf16/thumb.php: 2c9518ed78: Add Content-Length header to thumb.php redirects (duration: 00m 13s) [17:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:38:46] !log aude Synchronized php-1.26wmf16/extensions/Wikidata: fix focus when entering site links (duration: 00m 22s) [17:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:41:01] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107060#1492264 (10CCogdill_WMF) @faidon: > Why can't we have one domain for both? Polluting our namespace like that doesn't sound great. We need both domains because the event pag... [17:41:41] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107060#1492266 (10CCogdill_WMF) p:5Triage>3High [17:44:10] 7Blocked-on-Operations, 10MediaWiki-Database, 7Schema-change: Change pp_sortkey from float to double - https://phabricator.wikimedia.org/T107323#1492269 (10Mattflaschen) 3NEW [17:46:02] 6operations, 10RESTBase, 10Traffic: Restbase insecure POST requests to MW api.php - https://phabricator.wikimedia.org/T107030#1492286 (10Pchelolo) a:3Pchelolo [17:46:06] 7Blocked-on-Operations, 10MediaWiki-Database, 7Schema-change: Change pp_sortkey from float to double - https://phabricator.wikimedia.org/T107323#1492288 (10Mattflaschen) [17:49:04] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107060#1492295 (10Krenair) >>! In T107060#1492037, @faidon wrote: > - "events" also sounds a bit generic — especially under our global movement domain (wikimedia.org). For example,... [17:49:22] 7Blocked-on-Operations, 10MediaWiki-Database, 5Patch-For-Review, 7Schema-change: Change pp_sortkey from float to double - https://phabricator.wikimedia.org/T107323#1492298 (10Mattflaschen) a:3matthiasmullie [17:55:30] (03PS3) 10Yuvipanda: toollabs: uwsgi-plain: remove python specifics [puppet] - 10https://gerrit.wikimedia.org/r/227690 (https://phabricator.wikimedia.org/T104374) (owner: 10Merlijn van Deen) [17:55:41] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: uwsgi-plain: remove python specifics [puppet] - 10https://gerrit.wikimedia.org/r/227690 (https://phabricator.wikimedia.org/T104374) (owner: 10Merlijn van Deen) [18:00:04] twentyafterfour greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150729T1800). [18:03:11] (03CR) 10Dzahn: "i'm not sure about this either, adding (tool)labs reviewers" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [18:06:15] (03PS5) 10Dzahn: Ignore warnings about URLs without modules for private repository [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [18:07:54] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1492345 (10Cmjohnson) [18:07:57] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1492343 (10Cmjohnson) 5Open>3Resolved Swapped the ssds in restbase1008 -- will create a new task to RMA the other ssds [18:08:58] 6operations, 10ops-eqiad: RMA Samsung EVO ssds - https://phabricator.wikimedia.org/T107326#1492351 (10Cmjohnson) 3NEW a:3Cmjohnson [18:09:37] 6operations, 10Continuous-Integration-Infrastructure: Phase out lanthanum.eqiad.wmnet - https://phabricator.wikimedia.org/T86658#1492366 (10Cmjohnson) 5Open>3Resolved Added lanthanum to server spares. Resolving this ticket [18:13:01] (03CR) 10Dzahn: [C: 032] "very nice that we can merge this now and it's so much smaller than it used to be" [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [18:15:31] (03CR) 10Dzahn: "i would prefer it if we can break this down into multiple smaller changes" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [18:18:44] (03CR) 10Yuvipanda: [C: 031] "We can probably make this slightly OOP later on, but good to go for now" [puppet] - 10https://gerrit.wikimedia.org/r/227462 (https://phabricator.wikimedia.org/T106474) (owner: 10coren) [18:21:41] (03PS2) 10Dzahn: Changed my blog address to new Jekyll from Wordpress [puppet] - 10https://gerrit.wikimedia.org/r/225952 (owner: 1001tonythomas) [18:22:04] (03CR) 10Dzahn: [C: 032] Changed my blog address to new Jekyll from Wordpress [puppet] - 10https://gerrit.wikimedia.org/r/225952 (owner: 1001tonythomas) [18:22:06] mutante: welcome back [18:22:14] YuviPanda: thanks [18:24:48] !log manually attached User:Flow talk page manager accounts [18:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:24:56] Coren: the patch looks ok. Wanna merge and do a test run? [18:25:47] YuviPanda: Sounds good to me. Also, https://gerrit.wikimedia.org/r/#/c/227710/ could use a quick review? (It's trivial) [18:26:22] (03CR) 10Dzahn: "out of curiosity: is it possible to fix the issue with the SSL cert and use https?" [puppet] - 10https://gerrit.wikimedia.org/r/225952 (owner: 1001tonythomas) [18:26:52] (03PS2) 10Yuvipanda: labstore: Add some package requirements for labstore* [puppet] - 10https://gerrit.wikimedia.org/r/227710 (https://phabricator.wikimedia.org/T102478) (owner: 10coren) [18:26:59] (03CR) 10Yuvipanda: [C: 031] labstore: Add some package requirements for labstore* [puppet] - 10https://gerrit.wikimedia.org/r/227710 (https://phabricator.wikimedia.org/T102478) (owner: 10coren) [18:27:09] (03PS5) 10coren: Add cleanup-snapshots script [puppet] - 10https://gerrit.wikimedia.org/r/227462 (https://phabricator.wikimedia.org/T106474) [18:27:56] * Coren patiently waits for jenkins. [18:28:10] (03CR) 10coren: [C: 032] Add cleanup-snapshots script [puppet] - 10https://gerrit.wikimedia.org/r/227462 (https://phabricator.wikimedia.org/T106474) (owner: 10coren) [18:28:50] (03PS3) 10coren: labstore: Add some package requirements for labstore* [puppet] - 10https://gerrit.wikimedia.org/r/227710 (https://phabricator.wikimedia.org/T102478) [18:29:45] (03CR) 10coren: [C: 032] labstore: Add some package requirements for labstore* [puppet] - 10https://gerrit.wikimedia.org/r/227710 (https://phabricator.wikimedia.org/T102478) (owner: 10coren) [18:31:00] Coren: :) waiting for jenkins is great. re: the packages, is labstore Debian? i see "python-paramiko" vs. python3-paramiko [18:31:46] mutante: Jessie [18:31:49] Package python3-paramiko: [18:31:49] i 1.15.1-1 stable 500 [18:32:05] cool :) [18:32:39] mutante: labstore is all new python3 now [18:33:00] nice! [18:33:04] (03PS1) 10coren: Fix template name for cleanup-snapshots [puppet] - 10https://gerrit.wikimedia.org/r/227761 [18:33:06] ori: nutcracker redis relay is "listen => '127.0.0.1:6380'," [18:33:11] YuviPanda: ^^ silly mistake fix [18:33:23] but redis natively is 6379 [18:33:34] (03CR) 10Yuvipanda: [C: 031] Fix template name for cleanup-snapshots [puppet] - 10https://gerrit.wikimedia.org/r/227761 (owner: 10coren) [18:34:01] (03CR) 10coren: [C: 032] "Trivial fix." [puppet] - 10https://gerrit.wikimedia.org/r/227761 (owner: 10coren) [18:35:51] * Coren headdesks. [18:35:59] chasemp: but tin has a redis instance for trebuchet on 6379 [18:36:17] chasemp: https://gerrit.wikimedia.org/r/#/c/227573/ [18:36:29] !log fixed content models of MediaWiki and Module namespace pages on azbwiki [18:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:16] ori: so tin has redis on box which uses native 6379, you changed nutcracker to listen on 6380, where did you change prod to connect to local 6380? [18:37:27] part is the issue is I guess beta wasn't actually changed over so they are discovering issues [18:37:44] (03PS1) 10coren: Fix typos [puppet] - 10https://gerrit.wikimedia.org/r/227762 [18:37:51] * Coren sucks. ^^ YuviPanda. [18:38:34] (03PS2) 10Yuvipanda: labstore: Fix typos [puppet] - 10https://gerrit.wikimedia.org/r/227762 (owner: 10coren) [18:38:43] (03CR) 10Yuvipanda: [C: 031] labstore: Fix typos [puppet] - 10https://gerrit.wikimedia.org/r/227762 (owner: 10coren) [18:38:52] ah I see [18:38:55] Coren: prefix commit message with module name? [18:39:03] it seems odd to me to have all of prod be the oneoff and not in [18:39:04] tin even [18:39:25] YuviPanda: Ah, yes. Sorry, that was a ridiculously terse commit message borne out of frustration. :-) [18:39:38] (03CR) 10coren: [C: 032] labstore: Fix typos [puppet] - 10https://gerrit.wikimedia.org/r/227762 (owner: 10coren) [18:47:11] (03PS1) 1020after4: group1 wikis to 1.26wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227763 [18:47:36] twentyafterfour: ping [18:56:45] aude: yo [18:57:06] PROBLEM - puppet last run on labstore2001 is CRITICAL Puppet has 1 failures [18:57:59] twentyafterfour: i think we need to run scap for wikidata [18:58:15] because it was on an old version (somehow) from june on wmf16 core [18:58:39] it's on the correct version now but i didn't run scap yet (in case we want to check any other extensions) [18:59:01] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107060#1492488 (10Qgil) I agree that events.wikimedia.org is too generic for this purpose. >>! In T107060#1492264, @CCogdill_WMF wrote: > That said, if you have a suggestion that... [18:59:14] and http://git.wikimedia.org/commit/mediawiki%2Fcore.git/7b3c7d619fab66334abf5d89afdec5dbfff15d9b wasn't on tin earlier today [18:59:20] aude: ok [18:59:34] i think wikidata is the only thing affected afaik [19:00:22] the .gitmodules thing wasn't on tin because tin doesn't even support it properly - it's a newer git feature and tin is old version of git [19:00:28] ok [19:01:07] I'll do scap now [19:01:10] thanks [19:01:34] (03CR) 1020after4: [C: 032] group1 wikis to 1.26wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227763 (owner: 1020after4) [19:01:39] (03Merged) 10jenkins-bot: group1 wikis to 1.26wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227763 (owner: 1020after4) [19:03:25] !log twentyafterfour Started scap: group1 wikis to 1.26wmf16 [19:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:43] (03PS1) 10coren: labstore: fix typo in package name [puppet] - 10https://gerrit.wikimedia.org/r/227767 [19:08:55] YuviPanda: That one's yours. :-) ^^ [19:10:15] (03CR) 10Yuvipanda: [C: 031] labstore: fix typo in package name [puppet] - 10https://gerrit.wikimedia.org/r/227767 (owner: 10coren) [19:10:28] (03CR) 10coren: [C: 032] labstore: fix typo in package name [puppet] - 10https://gerrit.wikimedia.org/r/227767 (owner: 10coren) [19:14:48] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1492509 (10jcrespo) [19:17:17] (03PS1) 10Ottomata: Debianize 0.8.2.1 tag [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227768 [19:17:45] (03PS2) 10Ottomata: Debianize 0.8.2.1 tag [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227768 (https://phabricator.wikimedia.org/T106581) [19:23:47] RECOVERY - puppet last run on labstore2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:24:53] (03CR) 10Rush: "thoughts" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/224093 (owner: 10Filippo Giunchedi) [19:26:07] PROBLEM - Disk space on snapshot1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=58%) [19:26:33] (03PS1) 10Thcipriani: Add redis nutcracker group for beta [puppet] - 10https://gerrit.wikimedia.org/r/227770 (https://phabricator.wikimedia.org/T107288) [19:27:28] (03CR) 10Ori.livneh: [C: 031] "LGTM, but see comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [19:31:36] <_joe_> hey, someone searched me? [19:31:44] <_joe_> Krenair: what was your ping about? [19:41:33] (03CR) 10Ori.livneh: [C: 032] Add redis nutcracker group for beta [puppet] - 10https://gerrit.wikimedia.org/r/227770 (https://phabricator.wikimedia.org/T107288) (owner: 10Thcipriani) [19:47:34] 7Puppet, 6operations: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1492580 (10hashar) [19:48:10] 7Puppet, 6operations: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#984443 (10hashar) From a recent puppet lint strict run, seems the left over is migrating legacy manifests to puppet modules. [19:48:38] !log twentyafterfour Finished scap: group1 wikis to 1.26wmf16 (duration: 45m 12s) [19:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:39] (03PS1) 10Aude: Bump cache epoche for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227815 [19:51:50] !log scap sync failed on snapshot1001 due to full disk [19:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:12] 6operations, 5Continuous-Integration-Isolation: Reinstall labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T107158#1492591 (10hashar) 5stalled>3Resolved a:3hashar Amazing thanks @andrew . The daemon does not run right now because it depends on statsd 2.0 whereas Jessie has 3.0. Will try... [19:53:29] 6operations, 5Continuous-Integration-Isolation: Remove hashar and dduvall root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1492594 (10hashar) labnodepool has been reinstalled from scratch. I might still need root over the next two days to install some new nodepool .deb... [19:54:15] 6operations, 5Continuous-Integration-Isolation: Remove hashar and dduvall root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1492598 (10hashar) [19:54:18] 6operations, 5Continuous-Integration-Isolation, 5Patch-For-Review: Figure out fine sudo rules for the nodepool service / diskimage-builder - https://phabricator.wikimedia.org/T102281#1492596 (10hashar) 5Open>3Resolved Nodepool no more rely on sudo / diskimage-builder [19:56:59] twentyafterfour: not that urgent, but can you (or i) deploy https://gerrit.wikimedia.org/r/#/c/227815/ [19:57:16] to go with changes in wikibase in the new branch [19:58:12] aude: you are listed as an admin for scrumbugz project in labs and phab08 there has been spamming pretty hard from cron [19:58:15] is that vm still in use? [19:58:29] chasemp: you have to ask christopher (or tobi) [19:58:40] is tobi on irc? [19:58:47] i think it can be shut off [19:58:55] i wouldn't remove the instance w/o asking [19:59:10] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [19:59:26] tobi is on irc during european hours, but probably not now [19:59:37] what's his nick? [19:59:43] Tobi_WMDE_SW_NA: [19:59:47] !log restarting restbase1001 to apply logstash config [19:59:48] k thanks I'll try to catch him [19:59:51] ok [19:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:04] gwicke cscott arlolra subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150729T2000). Please do the needful. [20:00:17] * aude back in ~15 min or so [20:02:12] (03PS1) 10BBlack: enable ipsec on ulsfo text cluster [puppet] - 10https://gerrit.wikimedia.org/r/227867 (https://phabricator.wikimedia.org/T92604) [20:02:23] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107060#1492621 (10CCogdill_WMF) @CaitVirtue and I just checked in and we are willing to change the events.wikimedia.org domain to benefactorevents.wikimedia.org. I am going to upda... [20:03:27] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107060#1492622 (10CCogdill_WMF) [20:04:41] !log bouncing cassandra on restbase1002 to apply logstash config [20:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:05:02] (03CR) 10BBlack: [C: 032] enable ipsec on ulsfo text cluster [puppet] - 10https://gerrit.wikimedia.org/r/227867 (https://phabricator.wikimedia.org/T92604) (owner: 10BBlack) [20:07:28] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - Security Associations: 42 ESP transports installed [20:07:29] chasemp: ASFAIK phab08 is dead. there is an instance http://phab09.wmflabs.org/ which is in use by christopher. I think its save to remove phab08 if that does not affect phab09 in an yway [20:07:37] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - Security Associations: 42 ESP transports installed [20:07:50] Tobi_WMDE_SW_NA: thanks much [20:07:56] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - Security Associations: 42 ESP transports installed [20:08:10] thx aude for poking :) [20:09:16] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - Security Associations: 42 ESP transports installed [20:09:17] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - Security Associations: 42 ESP transports installed [20:11:35] (03PS3) 10John F. Lewis: add benefactorevents & eventsdonations CNAMEs for Major Gift [dns] - 10https://gerrit.wikimedia.org/r/227705 (https://phabricator.wikimedia.org/T107060) [20:11:51] !log bouncing cassandra on restbase1003 to apply logstash config [20:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:14:06] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: IPSec: roll-out plan - https://phabricator.wikimedia.org/T92604#1492650 (10BBlack) Update: all of the text cluster has ipsec turned on globally now. Holding here until tomorrow in case there's some subtle fallout not yet being observed. [20:15:32] !log bouncing cassandra on restbase1004 to apply logstash config [20:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:47] (03CR) 10Aaron Schulz: [C: 031] varnish: Update default varnish error page [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [20:18:43] !log bouncing cassandra on restbase1005 to apply logstash config [20:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:19:23] ah, Tobi_WMDE_SW_NA is around :) [20:20:09] aude: sometimes happens I'm sleepless.. ;) [20:20:10] i'm going to deploy https://gerrit.wikimedia.org/r/#/c/227815/ if no one is deploying [20:20:13] :) [20:20:20] legoktm: can you have another look at https://gerrit.wikimedia.org/r/#/c/187654/ I tested the cache key generation and now it uses the old keys so the risk for a negative performance implication is elimated (beside the regular risk that every change has) [20:22:59] (03CR) 10Dzahn: [C: 031] varnish: Update default varnish error page [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [20:25:14] (03PS7) 10BBlack: varnish: Update default varnish error page [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [20:26:27] !log bouncing cassandra on restbase1006 to apply logstash config [20:26:31] (03CR) 10BBlack: [C: 032] varnish: Update default varnish error page [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [20:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:12] PROBLEM - Parsoid on wtp1004 is CRITICAL - Socket timeout after 10 seconds [20:27:40] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1492678 (10RobH) I should have asked if you wanted your shell login name to be the same as your wikitech name. (Most folks are yes.) I'm preparing your patchset assumin... [20:28:56] (03CR) 10Jforrester: "Yay." [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [20:29:49] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1492681 (10CCogdill_WMF) I appreciate the concerns regarding setting up a new page on http. It has become clear to us that we cannot allow this page to be http for the long-term. However, for this... [20:30:32] !log ori Synchronized php-1.26wmf16/extensions/Wikidata/extensions/Wikibase/repo/includes/EditEntity.php: Live-hack stats increment call for session_fail_preview (duration: 00m 12s) [20:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:30:40] (03PS9) 10BBlack: Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) [20:31:01] robh: ^ can you see about the parsoid on wtp1004 above [20:31:04] it's acting up [20:31:06] !log ori Synchronized php-1.26wmf16/extensions/Wikidata/extensions/Wikibase/repo/includes/actions/SubmitEntityAction.php: Live-hack stats increment call for session_fail_preview (duration: 00m 12s) [20:31:07] needs a kill [20:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:22] (03CR) 10Ori.livneh: [C: 031] Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [20:31:23] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1492698 (10CCogdill_WMF) p:5Normal>3High [20:31:45] yep [20:31:50] thanks [20:32:06] arlolra: was this expected from updates? [20:32:15] yes [20:32:25] cool, just wanted to ensure i didnt need to pull debug data =] [20:32:35] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: configure less aggressive cassandra log rotation / send cassandra logs to logstash - https://phabricator.wikimedia.org/T100970#1492700 (10Eevans) 5Open>3Resolved All production instances are now reporting to Logstash, and are viewable fr... [20:32:38] whats the restbase process? [20:32:43] robh, looks like 14890 (parsoid) [20:33:24] robh, once in a while, we do get stuck processes (which prevent a clean restart). [20:33:53] hrmm, wont lemme normal kill, killing with fire (9) [20:34:13] yes, always the case. hence asking root :) [20:34:25] ok [20:34:29] its gone [20:34:32] thanks [20:34:39] welcome [20:35:32] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1476 bytes in 0.019 second response time [20:37:56] bblack: Thanks! Can't wait to see it live :) [20:38:19] database locked for maintenance? [20:39:50] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1492718 (10BBlack) >>! In T107059#1492681, @CCogdill_WMF wrote: > XP is still a common OS in large corporations as well as for older generations Ignoring the rest of the current discussion and ju... [20:41:07] (03PS2) 10Aude: Bump cache epoche for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227815 [20:41:33] !log manually fixed content models for wikidata's Module namespace (T107340) [20:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:21] !log updated Parsoid to version 6e095a92 [20:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:44:42] !log update page set page_content_model="Scribunto" where page_id=12134769; on wikidatawiki [20:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:40] <_joe_> catchpoint paged us with a restbase failure [20:46:49] 6operations, 6Reading-Admin, 6Zero, 5Patch-For-Review: Set Content-Type to application/x-web-app-manifest+json for Wikipedia for Firefox OS webapp.manifest - https://phabricator.wikimedia.org/T107165#1492740 (10dr0ptp4kt) Mozilla marked this as resolved. Thanks @bblack! [20:47:02] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1492742 (10CCogdill_WMF) > Ignoring the rest of the current discussion and just going little deeper on this one point: Microsoft ended all support for Windows XP a little over a year ago ( https:/... [20:47:28] looking at restbase [20:48:22] PROBLEM - puppet last run on ms-be2006 is CRITICAL puppet fail [20:48:27] <_joe_> the page loads fine for me btw [20:48:57] there was a spike in connection timeouts on 1001 [20:49:00] me too [20:49:08] http://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad [20:50:31] OutboundTcpConnection.java [20:50:32] :313 - error writing to /10.64.48.100 [20:50:36] (03PS1) 10RobH: adding new employee Mikhail Popov to discovery hosts [puppet] - 10https://gerrit.wikimedia.org/r/227877 [20:52:16] it looks like C* was restarted on 1001 fairly recently [20:52:49] ebernhardson: are you aware of the spike in error messages on the CirrusSearchChangeFailed.log on fluorine? [20:53:16] _joe_, godog: that might have been urandom applying logstash config changes [20:53:21] see earlier log entries [20:53:51] ori: nope, thanks for mentioning it [20:54:29] (03CR) 10RobH: [C: 032] adding new employee Mikhail Popov to discovery hosts [puppet] - 10https://gerrit.wikimedia.org/r/227877 (owner: 10RobH) [20:55:00] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1492761 (10RobH) Chatted with both @ironholds and @mpopov in irc. So we're replicating all of @ironhold's permissions, as @mpopov will have to run similar commands and r... [20:55:15] gwicke: could be, last restart was 20:26 [20:55:45] yeah, the time the timeouts on 1001 started [20:56:09] fixed itself it seems [20:56:28] http://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad?panelId=22&fullscreen [20:57:45] 6operations, 10Architecture, 10Incident-20150423-Commons, 10MediaWiki-RfCs, and 6 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1492774 (10Spage) #ArchCom basically approves, details will be worked out in gerrit review. @tstarling will wri... [20:58:14] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1492775 (10BBlack) >>! In T107059#1492742, @CCogdill_WMF wrote: > Per http://www.netmarketshare.com/, this quarter XP has the second highest market share of all OS versions. We can't assume this w... [20:59:51] gwicke: yeah I was looking at why, can't find a smoking gun so far [21:05:51] 6operations, 10Architecture, 10Incident-20150423-Commons, 10MediaWiki-RfCs, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1492816 (10Spage) #ArchCom basically approves, details will come from gerrit review. @tstarling will write down the recommended approach to... [21:08:07] (03PS22) 10Gergő Tisza: [WIP] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [21:08:29] (03PS1) 10John F. Lewis: add wmf-officeit group to metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227879 (https://phabricator.wikimedia.org/T106724) [21:13:13] 6operations, 10Incident-20150205-SiteOutage, 7Availability: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1492846 (10ori) It's because we don't set `server_retry_timeout`. Per Nutcracker's [recommendation](https://github.com/twitter/... [21:14:17] (03Abandoned) 10Dzahn: add base::firewall on codfw redis nodes [puppet] - 10https://gerrit.wikimedia.org/r/188715 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [21:14:19] robh, wait, so... statistics-admins group needs ops meeting review, restricted doesn't? [21:14:39] i didnt think restricted let them sudo... if i did i messed up and get to revert. [21:14:56] oh, it sure does [21:14:58] right in the damn thing [21:15:00] sure, it lets you sudo as www-data and apache [21:15:00] lemme fix. [21:15:09] it used to let you do even more [21:15:16] thanks for catching [21:15:18] when that was just granted to wikidev or whatever [21:15:21] RECOVERY - puppet last run on ms-be2006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:16:30] bblack: OK, something went wrong [21:16:31] https://test.wikipedia.org/Special:RecordImpression?&campaign=wm2015register [21:16:39] bleeeeeehhh bad rob. [21:16:39] What are those characters [21:16:39] �77777702�77777640 [21:16:41] (03PS1) 10RobH: accidental sudo escalation for bearloga [puppet] - 10https://gerrit.wikimedia.org/r/227880 [21:17:05] (03CR) 10RobH: [C: 032] accidental sudo escalation for bearloga [puppet] - 10https://gerrit.wikimedia.org/r/227880 (owner: 10RobH) [21:17:40] Oh, interesting. I put a non-breaking space there to avoid widow words [21:17:46] I guess vcp is not UTF-8? [21:17:49] cvl [21:17:50] vcl [21:17:52] or erb [21:17:53] or something [21:18:42] yeah probably not :) [21:19:08] can we do them as html character entities or whatever they're called? [21:19:38] (03PS1) 10Ori.livneh: nutcracker: prevent servers from being marked as dead indefinitely [puppet] - 10https://gerrit.wikimedia.org/r/227881 [21:20:21] (03PS2) 10Ori.livneh: nutcracker: prevent servers from being marked as dead indefinitely [puppet] - 10https://gerrit.wikimedia.org/r/227881 (https://phabricator.wikimedia.org/T88730) [21:20:39] (03PS3) 10Ori.livneh: nutcracker: prevent servers from being marked as dead indefinitely [puppet] - 10https://gerrit.wikimedia.org/r/227881 (https://phabricator.wikimedia.org/T88730) [21:20:53] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1492873 (10RobH) @krinkle caught my mistake. restricted is clearly a sudo group, so it also has to get reviewed during the ops meeting. https://gerrit.wikimedia.org/r/#... [21:21:07] (03CR) 10Ori.livneh: [C: 032 V: 032] nutcracker: prevent servers from being marked as dead indefinitely [puppet] - 10https://gerrit.wikimedia.org/r/227881 (https://phabricator.wikimedia.org/T88730) (owner: 10Ori.livneh) [21:21:24] robh, I'm not Krinkle :p [21:21:42] yes.. im stupid. [21:21:49] i thought my other mistake made that clear [21:22:04] !log fixed Module:*/doc pages on wikidatawiki [21:22:10] fixed ;D [21:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:12] :D [21:22:37] today is just one of those days; its almost over! [21:23:54] (03PS1) 10BBlack: varnish default error page: Convert literal utf8 nbsp to html entities [puppet] - 10https://gerrit.wikimedia.org/r/227882 [21:24:02] robh, I know the feeling... [21:24:24] Krinkle: ^ https://gerrit.wikimedia.org/r/227882 ? [21:26:22] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [21:28:04] heya robh, do you know how or why madhuvishy can't log into graphite.wikimedia.org? [21:28:12] i'm trying to check her ldap groups, but i'm not really sure how [21:28:19] this doesn't seem to work for me [21:28:19] https://wikitech.wikimedia.org/wiki/Ldapsearch [21:29:49] ive not messed with ldap outside of pulling uids for account setups [21:29:56] but happy to find out [21:30:11] (03CR) 10Krinkle: [C: 031] varnish default error page: Convert literal utf8 nbsp to html entities [puppet] - 10https://gerrit.wikimedia.org/r/227882 (owner: 10BBlack) [21:30:17] bblack: Yep, looks right. [21:30:18] robh, according to https://wikitech.wikimedia.org/wiki/Graphite.wikimedia.org [21:30:25] what username is she logging in with? [21:30:28] all she needs is to be in the ldap/wmf group [21:30:33] madhuvishy: ^^? [21:30:50] Krenair: Madhuvishy [21:31:10] same as wikitech credentials? [21:31:15] Krenair: yup [21:31:29] yeah you're not in the group [21:31:31] (03CR) 10BBlack: [C: 032] varnish default error page: Convert literal utf8 nbsp to html entities [puppet] - 10https://gerrit.wikimedia.org/r/227882 (owner: 10BBlack) [21:31:59] ottomata, to check the list of people in ldap/wmf, log into a labs host (or terbium, IIRC) and run "ldaplist -l group wmf" [21:32:05] bblack: Let's also strip the leading line break before the hm ,welp, madhuvishy is a wmf staffer [21:32:28] Sorry for putting that there [21:32:30] can I just add her? [21:32:49] ostriches ^ [21:32:56] Probably? [21:33:43] ottomata: yes, you can just add, if they are an actual employee [21:33:53] part of the employment paperwork is the nda [21:33:57] (03PS1) 10BBlack: varnish default error page: no linebreak before DOCTYPE [puppet] - 10https://gerrit.wikimedia.org/r/227883 [21:33:57] i've had to ask that in the past =] [21:34:12] nda confirmation only becomes really painful when its volunteers or contractors [21:34:24] ottomata: unless they are a contractor, then i dunno. [21:34:49] modify-ldap-group wmf --addmembers madhuvishy" [21:34:53] should do it, I think? [21:34:56] robh: :) I'm full time at the office, so should be okay i guess [21:35:02] aye, Krenair thanks, wikitech search actually was relevant this time! [21:35:07] found this [21:35:07] https://wikitech.wikimedia.org/wiki/Add-labs-user [21:35:08] :) [21:35:12] oh awesome :/ [21:35:14] ?[1;31mError: /Stage[main]/Varnish::Common::Vcl/File[/etc/varnish/errorpage.inc.vcl]/content: change from {md5}168caa07914a0a6a4a60d9bd561deeb3 to {md5}6f88e03bb7206cd48865a4a3295ec259 failed: invalid byte sequence in US-ASCII?[0m [21:35:26] !:) [21:35:27] it didn't complain on the initial push, but now it complains when trying to remove them :P [21:35:48] cool, madhuvishy try logging in now [21:36:02] ohh, that link is indeed awesome [21:36:03] looking into it [21:36:16] ottomata: yay works. Thanks robh and Krenair [21:36:16] ottomata: this is all on terbium right? [21:37:13] yes [21:37:13] PROBLEM - puppet last run on cp1055 is CRITICAL Puppet has 1 failures [21:37:39] we're about to get a lot of those ^ [21:37:42] PROBLEM - puppet last run on cp1054 is CRITICAL Puppet has 1 failures [21:37:43] PROBLEM - puppet last run on cp1066 is CRITICAL Puppet has 1 failures [21:37:43] PROBLEM - puppet last run on cp1065 is CRITICAL Puppet has 1 failures [21:37:44] ignore it, already being dealt with [21:37:52] PROBLEM - puppet last run on cp1053 is CRITICAL Puppet has 1 failures [21:38:11] PROBLEM - puppet last run on cp1052 is CRITICAL Puppet has 1 failures [21:40:27] (03CR) 10BBlack: [C: 032] varnish default error page: no linebreak before DOCTYPE [puppet] - 10https://gerrit.wikimedia.org/r/227883 (owner: 10BBlack) [21:43:42] RECOVERY - puppet last run on cp1054 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:44:12] RECOVERY - puppet last run on cp1052 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:45:22] RECOVERY - puppet last run on cp1055 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [21:46:02] RECOVERY - puppet last run on cp1053 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:46:18] yay for less spam than expected [21:47:32] PROBLEM - puppet last run on cp2021 is CRITICAL Puppet has 2 failures [21:47:51] RECOVERY - puppet last run on cp1066 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [21:47:51] RECOVERY - puppet last run on cp1065 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [21:48:50] (03PS1) 10coren: nagios_common: add new checks for systemd unit health [puppet] - 10https://gerrit.wikimedia.org/r/227887 [21:48:59] YuviPanda: ^^ for sanity checking [21:49:41] RECOVERY - puppet last run on cp2021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:52:21] Coren: looking [21:53:10] (03CR) 10Yuvipanda: [C: 04-1] "the check_command definitions should be under modules/nagios_common/files/check_commands/$command_name.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/227887 (owner: 10coren) [21:53:46] (03CR) 10Yuvipanda: nagios_common: add new checks for systemd unit health (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227887 (owner: 10coren) [21:53:52] Coren: outside these two commends looks ok [21:54:02] YuviPanda: Hm? I based myself on existing checks. Change in convention? [21:54:41] (03CR) 10Dzahn: [C: 04-1] "role::syslog::centralserver does not have any ferm rules. it uses misc::syslog-server which also does not seem to have any. it seems this " [puppet] - 10https://gerrit.wikimedia.org/r/227697 (owner: 10Muehlenhoff) [21:55:22] Coren: so check_commands.cfg is a massive clusterfuck - it's copied off some ancient version of nagios + lots of modifications. When I converted stuff to nagios_common I split off as much as I couldn't but couldn't split everything... [21:55:55] Coren: also the check_command define defaults to looking at that path for the config files [21:56:01] PROBLEM - puppet last run on mw2133 is CRITICAL Puppet has 1 failures [21:56:16] (if you look at the docs for the define it'll be clearer) [21:56:52] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [21:57:07] I'm not sure what could be documented except "unit_status_log_starting_stopping_reloading" is the value of CODE_FUNCTION in log entries for when the unit is started, stopped or reloaded as observed." :-) [21:57:51] ok [21:58:40] Ah, I note a minor error. I take the time to --reverse the log order but forgot to break once I found the useful timestamp. [22:02:31] (03CR) 10Dzahn: "kind of unrelated note, but the " file { '/a':" thing should really be in a role and not on a node directly" [puppet] - 10https://gerrit.wikimedia.org/r/227417 (owner: 10Muehlenhoff) [22:03:55] (03PS2) 10coren: nagios_common: add new checks for systemd unit health [puppet] - 10https://gerrit.wikimedia.org/r/227887 [22:04:10] (03CR) 10coren: nagios_common: add new checks for systemd unit health (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227887 (owner: 10coren) [22:06:37] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1492983 (10CCogdill_WMF) > Keeping in mind we already killed IE6 access some time ago, IE7/8-on-XP (the ones with horrible TLS security and no SNI support) at our TLS terminations still only accou... [22:07:22] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:11:18] eh, how did we fix: [22:11:21] ssl.PROTOCOL_SSLv3: OpenSSL.SSL.SSLv3_METHOD, [22:11:22] AttributeError: 'module' object has no attribute 'PROTOCOL_SSLv3' [22:11:26] with git review [22:11:39] it's python requests [22:12:10] issues? [22:12:12] with python requests? [22:12:27] we don't want to use SSLv3 [22:12:27] legoktm [22:12:32] but what was the fix to make it use TLS [22:12:42] umm [22:12:52] don't use git review? ;) [22:12:55] lol [22:12:57] grrr :) [22:13:09] mutante: exactly, https://github.com/legoktm/grr [22:13:17] you'll tell me to just update using pip, right [22:13:22] but i wanted distro packages :) [22:13:35] legoktm: hehehe [22:13:43] the version packaged for fedora 22 works just fine :) [22:13:57] ii git-review 1.24-2 [22:14:10] jessie [22:14:29] fedora is 1.24-5.fc22 [22:14:33] works for me [22:14:46] Why do you want to use a distro package? :) [22:14:49] ii python-requests 2.4.3-6 [22:15:01] (03PS1) 10BryanDavis: Ignore debug level messages from the 'redis' logging channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227892 [22:15:30] goes to bug Debian [22:15:57] dnf info python-requests -> 2.7.001.fc22 [22:17:26] i guess i need to upgrade to sid :) [22:19:32] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1493043 (10BBlack) >>! In T107059#1492983, @CCogdill_WMF wrote: > Our major donors are not our normal users, and if I'm doing my math right, 0.6% of traffic still comes out to ~2.7 million users.... [22:20:52] RECOVERY - puppet last run on mw2133 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [22:25:11] (03CR) 10Yuvipanda: [C: 031] nagios_common: add new checks for systemd unit health [puppet] - 10https://gerrit.wikimedia.org/r/227887 (owner: 10coren) [22:27:05] !log update page set page_content_model ="wikitext" where page_id=12134769; on wikidatawiki [22:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:27:51] AaronSchulz, I did a mess on gerrit:227878, hope you can understand what I am trying to say [22:37:51] (03PS1) 10Gergő Tisza: Version update: 7.6.2 -> 7.7.0 [software/sentry] - 10https://gerrit.wikimedia.org/r/227899 [22:41:16] (03PS1) 10Ori.livneh: Set nutcracker log verbosity to LOG_INFO, per deployment recommendations [puppet] - 10https://gerrit.wikimedia.org/r/227902 [22:48:50] (03PS2) 10Ori.livneh: Set nutcracker log verbosity to LOG_INFO, per deployment recommendations [puppet] - 10https://gerrit.wikimedia.org/r/227902 [22:51:28] (03CR) 10Ori.livneh: [C: 032] Set nutcracker log verbosity to LOG_INFO, per deployment recommendations [puppet] - 10https://gerrit.wikimedia.org/r/227902 (owner: 10Ori.livneh) [22:51:37] (03PS3) 10Dzahn: Add ferm rules for dataset NFS server [puppet] - 10https://gerrit.wikimedia.org/r/227711 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [22:59:22] PROBLEM - HHVM rendering on mw1220 is CRITICAL - Socket timeout after 10 seconds [23:00:05] RoanKattouw ostriches rmoen Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150729T2300). Please do the needful. [23:00:05] JohnLewis jamesofur bd808: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:10] hey [23:00:24] Hey [23:00:31] o/ [23:00:43] PROBLEM - nutcracker port on mw1220 is CRITICAL: Connection refused [23:00:48] hey [23:00:54] PROBLEM - nutcracker process on mw1220 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [23:01:11] (03CR) 10Alex Monk: [C: 032] add wmf-officeit group to metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227879 (https://phabricator.wikimedia.org/T106724) (owner: 10John F. Lewis) [23:01:17] (03Merged) 10jenkins-bot: add wmf-officeit group to metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227879 (https://phabricator.wikimedia.org/T106724) (owner: 10John F. Lewis) [23:01:34] RECOVERY - HHVM rendering on mw1220 is OK: HTTP OK: HTTP/1.1 200 OK - 62914 bytes in 0.543 second response time [23:02:03] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/227879/ (duration: 00m 12s) [23:02:04] uh oh [23:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:02:32] !log snapshot1001 - No space left on device [23:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:03:25] !log snapshot1001 - apt-get clean - 107M avail [23:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:03:44] JohnFLewis, Jamesofur: done :) [23:03:50] thank ye much [23:03:50] Krenair: on the bright side, sync works and confirmed on meta :) [23:04:20] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1493280 (10mobrovac) [23:04:23] PROBLEM - Apache HTTP on mw1220 is CRITICAL - Socket timeout after 10 seconds [23:04:23] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1493281 (10mobrovac) [23:05:01] PROBLEM - HHVM busy threads on mw1220 is CRITICAL 57.14% of data above the critical threshold [115.2] [23:05:29] mutante, sync-common on snapshot1001 failed [23:05:51] PROBLEM - HHVM rendering on mw1220 is CRITICAL - Socket timeout after 10 seconds [23:05:58] still no space [23:06:14] is mw1220 related to the sync? [23:06:22] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.313 second response time [23:06:29] looking what else to delete on snapshot [23:07:03] ah... there was a second error in that sync [23:07:31] for mw1010 [23:07:50] removing old kernel packages [23:07:55] we probably need to prune some old l10n files to help snapshot1001 out [23:08:05] something's wrong [23:08:27] some very weird error messages going through hhvm.log [23:08:59] https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor looks a mess [23:09:05] bd808, yeah, it actually failed at /srv/mediawiki/php-1.26wmf16/cache/l10n/upstream/l10n_cache-nso.cdb.json [23:09:21] basically the entire space is used by /srv/mediawiki [23:09:25] it's mostly slow queries for MIMEsearchpage at the moment [23:09:35] i could free like another 200M [23:09:48] a bunch of "entire web request took longer than 290 seconds and timed out in" from random points in the code [23:10:12] we still have l10n for wmf12, 13, and 14 on the cluster [23:10:21] do we need to have all versions from php-1.26wmf12 through wmf16 ? [23:10:41] mutante: yes but we can free a lot of space by dumpting l10n [23:10:42] PROBLEM - Apache HTTP on mw1220 is CRITICAL - Socket timeout after 10 seconds [23:10:44] 5.0G php-1.26wmf13 [23:11:08] twentyafterfour: are you around? want to prune stale l10n? [23:11:16] wmf12 would've started getting deployed on the 30th of june and then should've ceased being used on the 9th of july [23:11:45] 12 is just 3GB, 13 is 5GB, 16 is 4GB [23:11:53] RECOVERY - HHVM rendering on mw1220 is OK: HTTP OK: HTTP/1.1 200 OK - 62907 bytes in 0.572 second response time [23:11:54] surprisingly different [23:12:01] (03PS1) 10Andrew Bogott: Drop our cache ttls WAY down [puppet] - 10https://gerrit.wikimedia.org/r/227905 (https://phabricator.wikimedia.org/T107325) [23:12:26] ori was doing some l10n cache change experiments just before wikimania [23:12:35] that would have taken more space [23:12:58] Krenair: I'll purge some l10n [23:13:00] ok [23:13:01] no LVM :p [23:13:13] snapshoot1001 needs a reimage [23:13:15] https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Purge_localization_cache_for_now_unused_versions [23:13:16] as trusty [23:13:40] (03PS2) 10Andrew Bogott: Drop our cache ttls WAY down [puppet] - 10https://gerrit.wikimedia.org/r/227905 (https://phabricator.wikimedia.org/T107325) [23:13:48] !log bd808 Purged l10n cache for 1.26wmf12 [23:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:19] !log bd808 Purged l10n cache for 1.26wmf13 [23:14:20] how about php-1.26wmf9 [23:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:40] !log bd808 Purged l10n cache for 1.26wmf14 [23:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:49] mutante: how does df look now? [23:14:51] RECOVERY - Disk space on snapshot1001 is OK: DISK OK [23:14:54] and we have almost 10G free :) [23:14:56] thanks bd808 [23:15:06] Use: 63% [23:15:18] l10n is huge :/ [23:15:21] (03CR) 10Andrew Bogott: [V: 032] Drop our cache ttls WAY down [puppet] - 10https://gerrit.wikimedia.org/r/227905 (https://phabricator.wikimedia.org/T107325) (owner: 10Andrew Bogott) [23:15:39] json + cdb + another json copy [23:15:51] (03CR) 10Andrew Bogott: [C: 032] Drop our cache ttls WAY down [puppet] - 10https://gerrit.wikimedia.org/r/227905 (https://phabricator.wikimedia.org/T107325) (owner: 10Andrew Bogott) [23:15:57] the purge gets rid of the second json and the cdb [23:16:06] (03CR) 10Andrew Bogott: "bah, very inaccurate mouse click!" [puppet] - 10https://gerrit.wikimedia.org/r/227905 (https://phabricator.wikimedia.org/T107325) (owner: 10Andrew Bogott) [23:16:41] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.101 second response time [23:17:51] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 35.71% of data above the critical threshold [500.0] [23:20:10] !log starting script to fix Scribunto content models due to imports on all wikis (T91170) [23:20:12] PROBLEM - HHVM rendering on mw1220 is CRITICAL - Socket timeout after 10 seconds [23:20:12] (03CR) 10Dzahn: [C: 032] "ports confirmed on dataset1001" [puppet] - 10https://gerrit.wikimedia.org/r/227711 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [23:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:55] mutante, Krenair: we could probably safely drop wmf9 and 10. 11 is on the cusp that I try to stay away from. The cleanup is a bit yucky and takes a full scap though + some gerrit patches [23:21:17] twentyafterfour should really be dropping the oldest branch every week with the train [23:21:41] we have time for a full scap [23:22:00] let's do that config patch of yours first though? [23:22:21] RECOVERY - HHVM rendering on mw1220 is OK: HTTP OK: HTTP/1.1 200 OK - 62907 bytes in 9.921 second response time [23:22:34] it should be easy peasy. [23:22:39] the config patch [23:23:03] (03CR) 10Alex Monk: [C: 032] Ignore debug level messages from the 'redis' logging channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227892 (owner: 10BryanDavis) [23:23:09] (03Merged) 10jenkins-bot: Ignore debug level messages from the 'redis' logging channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227892 (owner: 10BryanDavis) [23:23:51] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/227892/ (duration: 00m 12s) [23:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:00] bd808: since we have 10G now and 9 and 10 are just like 500M it might not be worth it [23:25:01] PROBLEM - Apache HTTP on mw1220 is CRITICAL - Socket timeout after 10 seconds [23:25:21] RECOVERY - nutcracker port on mw1220 is OK: TCP OK - 0.000 second response time on port 11212 [23:25:26] I'd personally rather not do it tonight [23:25:32] RECOVERY - nutcracker process on mw1220 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:25:47] * bd808 is getting tired of staring at a computer and needs food [23:26:33] bd808: delayed dropping.... remember when we lost all messages in the new version for 45 minutes due to debugging by rm'ing them? :) [23:26:35] (03PS4) 10Dzahn: Add ferm rules for dataset NFS server [puppet] - 10https://gerrit.wikimedia.org/r/227711 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [23:26:52] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [23:26:57] good times [23:28:13] ori: redis/nutcracker on mw1200 and mw1220 look sad in /a/mw-log/redis.log [23:28:57] oh, there's another one I've been meaning to do [23:29:03] (03CR) 10Alex Monk: [C: 032] Disable a bunch of extensions on loginwiki/votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225840 (https://phabricator.wikimedia.org/T61702) (owner: 10Alex Monk) [23:29:28] (03Merged) 10jenkins-bot: Disable a bunch of extensions on loginwiki/votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225840 (https://phabricator.wikimedia.org/T61702) (owner: 10Alex Monk) [23:30:12] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [23:30:18] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/225840/ (duration: 00m 12s) [23:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:51] !log krenair Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/225840/ (duration: 00m 12s) [23:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:31:30] ori: mw1220 is definately redis sad -- https://logstash.wikimedia.org/#/dashboard/elasticsearch/redis [23:31:48] 8644 log events in the last hour from that one [23:33:52] RECOVERY - HHVM busy threads on mw1220 is OK Less than 30.00% above the threshold [76.8] [23:34:32] Is anybody running a maintenance job right now that would be pounding the dbs? There are a lot more fcgi timeouts and mysql slow query alerts in the hhvm logs than normal [23:34:55] legoktm: ^ [23:35:07] bd808: me probably [23:35:11] like several orders of magnitude more [23:35:23] 23:20 < legoktm> !log starting script to fix Scribunto content models due to imports on all wikis (T91170) [23:36:24] it's nearly done, should be a minute [23:37:05] but I should be mostly hitting slaves [23:37:09] the badness looks like it really started around 05:45 today [23:37:14] way before legoktm [23:37:21] are the slow query alerts all from mimesearch? [23:37:54] because I saw a task for that [23:38:04] https://phabricator.wikimedia.org/T107265 [23:38:39] should we just make the query page cached? [23:38:43] legoktm: my bad, sorry to blame you :) [23:38:54] nothing in SAL that matches the time frame [23:39:06] no worries :) [23:41:00] Krenair: I think yes, the slow queries are mostly the horrible MIMEsearchPage query [23:41:31] so probably not a big deal except for the log noise that it makes drowning out everything else [23:43:14] !log finished fixing Scribunto content models [23:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:29] grep -v :) [23:45:45] 6operations, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1493496 (10bd808) Searching on https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm shows slow timer alerts for `MIMEsearchPage::reallyDoQue... [23:54:22] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [23:56:37] what's going on?