[00:00:18] (03CR) 10Dzahn: [C: 032] tendril: add config template [puppet] - 10https://gerrit.wikimedia.org/r/224205 (https://phabricator.wikimedia.org/T98816) (owner: 10Dzahn) [00:01:01] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1451044 (10BBlack) >>! In T102566#1450297, @demon wrote: >>>! In T102566#1449713, @BBlack wrote: >> So, where are we at on re... [00:03:01] (03CR) 10Springle: tendril: add config template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224205 (https://phabricator.wikimedia.org/T98816) (owner: 10Dzahn) [00:10:38] (03CR) 10BryanDavis: Don't assume current l10n cache files are .cdb (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/224520 (owner: 10Ori.livneh) [00:11:31] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1451066 (10BBlack) [00:11:34] (03CR) 10Ori.livneh: Don't assume current l10n cache files are .cdb (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/224520 (owner: 10Ori.livneh) [00:12:33] (03PS1) 10Dzahn: tendril: use tendril-backend CNAME as db_host [puppet] - 10https://gerrit.wikimedia.org/r/224547 (https://bugzilla.wikimedia.org/98816) [00:12:45] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack) [00:12:48] (03PS2) 10Dzahn: tendril: use tendril-backend CNAME as db_host [puppet] - 10https://gerrit.wikimedia.org/r/224547 (https://phabricator.wikimedia.org/T98816) [00:13:41] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1451071 (10BBlack) [00:13:49] (03PS3) 10Dzahn: tendril: use tendril-backend CNAME as db_host [puppet] - 10https://gerrit.wikimedia.org/r/224547 (https://bugzilla.wikimedia.org/98816) [00:14:19] (03CR) 10Dzahn: [C: 032] tendril: use tendril-backend CNAME as db_host [puppet] - 10https://gerrit.wikimedia.org/r/224547 (https://bugzilla.wikimedia.org/98816) (owner: 10Dzahn) [00:14:28] (03CR) 10Dzahn: [V: 032] tendril: use tendril-backend CNAME as db_host [puppet] - 10https://gerrit.wikimedia.org/r/224547 (https://bugzilla.wikimedia.org/98816) (owner: 10Dzahn) [00:20:44] (03PS1) 10Manybubbles: Add es-tool upgrade-fast and stopping paranoia [puppet] - 10https://gerrit.wikimedia.org/r/224548 [00:21:20] (03PS2) 10Manybubbles: Add es-tool upgrade-fast and stopping paranoia [puppet] - 10https://gerrit.wikimedia.org/r/224548 [00:28:36] (03PS4) 10Ori.livneh: Don't assume current l10n cache files are .cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/224520 [00:28:38] bd808: ^ [00:33:22] bd808: blech, the commit message is no longer accurate. I'll udpate that. [00:33:55] ori: cool. I'm testing the code locally now just to feel like I did something. :) [00:37:30] (03PS1) 10Dzahn: tendril: fix Apache config Options [puppet] - 10https://gerrit.wikimedia.org/r/224549 (https://phabricator.wikimedia.org/T98816) [00:37:32] (03PS5) 10Ori.livneh: Don't assume current l10n cache files are .cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/224520 [00:38:33] (03PS2) 10Dzahn: tendril: fix Apache config Options [puppet] - 10https://gerrit.wikimedia.org/r/224549 (https://phabricator.wikimedia.org/T98816) [00:39:00] (03CR) 10BryanDavis: [C: 032] Don't assume current l10n cache files are .cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/224520 (owner: 10Ori.livneh) [00:39:04] (03CR) 10Dzahn: [C: 032] tendril: fix Apache config Options [puppet] - 10https://gerrit.wikimedia.org/r/224549 (https://phabricator.wikimedia.org/T98816) (owner: 10Dzahn) [00:39:09] bd808: woot, thanks! [00:39:23] (03Merged) 10jenkins-bot: Don't assume current l10n cache files are .cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/224520 (owner: 10Ori.livneh) [00:41:00] 6operations, 7Database, 5Patch-For-Review: move tendril to gerrit repo and puppetize cloning - https://phabricator.wikimedia.org/T98816#1451095 (10Dzahn) done. - config.php now generated by puppet - .htaccess replaced by lines in main Apache template [00:41:30] 6operations, 7Database: move tendril to gerrit repo and puppetize cloning - https://phabricator.wikimedia.org/T98816#1451102 (10Dzahn) [00:41:45] (03PS1) 10BBlack: add ecc-star.wmfusercontent.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/224551 [00:41:47] (03PS1) 10BBlack: switch wmfusercontent.org to ECDSA+RSA [puppet] - 10https://gerrit.wikimedia.org/r/224552 [00:42:08] (03CR) 10BBlack: [C: 032 V: 032] add ecc-star.wmfusercontent.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/224551 (owner: 10BBlack) [00:43:12] (03CR) 10BBlack: [C: 032 V: 032] switch wmfusercontent.org to ECDSA+RSA [puppet] - 10https://gerrit.wikimedia.org/r/224552 (owner: 10BBlack) [00:46:33] 6operations, 7Database: move tendril to gerrit repo and puppetize cloning - https://phabricator.wikimedia.org/T98816#1451105 (10Dzahn) 5Open>3Resolved should be fully automatic now [00:49:54] (03PS1) 10Dzahn: move grafana from zirconium to netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/224554 [00:51:08] (03PS2) 10Dzahn: move grafana from zirconium to netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/224554 [00:54:19] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1451119 (10BBlack) [00:55:29] (03PS4) 10Dzahn: Add a note about RCStream to irc.wikimedia.org MOTD [puppet] - 10https://gerrit.wikimedia.org/r/224242 (https://phabricator.wikimedia.org/T87780) (owner: 10Glaisher) [00:57:26] (03CR) 10Dzahn: [C: 032] "i won't restart the service at this time - but we will soon enough" [puppet] - 10https://gerrit.wikimedia.org/r/224242 (https://phabricator.wikimedia.org/T87780) (owner: 10Glaisher) [00:57:58] (03PS2) 10Dzahn: site.pp - add comments about server roles [puppet] - 10https://gerrit.wikimedia.org/r/224191 [00:58:18] (03PS6) 10Dzahn: ferm rules for IRCd [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) [00:59:18] (03PS3) 10Dzahn: move grafana from zirconium to netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/224554 (https://phabricator.wikimedia.org/T105008) [01:12:57] (03CR) 10Dzahn: [C: 032] "noop for now, and former PS already had reviews" [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [01:18:18] (03PS1) 10Springle: repool db1037; depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224555 [01:19:11] (03PS4) 10Dzahn: Reclaim lanthanum: remove related puppet conf [puppet] - 10https://gerrit.wikimedia.org/r/223175 (https://phabricator.wikimedia.org/T86658) (owner: 10Hashar) [01:20:12] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1367811 (10demon) >>! In T102566#1451044, @BBlack wrote: > Just to clear up confusion with the @Tgr's comment as well: Are we... [01:20:33] (03CR) 10Springle: [C: 032] repool db1037; depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224555 (owner: 10Springle) [01:21:28] MediaWiki-General-or-Unknown: MWHttpRequest's redirect behavior is terrible - https://phabricator.wikimedia.org/T105765#1451141 (demon) NEW [01:21:30] bblack: Heh ^ [01:22:09] !log springle Synchronized wmf-config/db-eqiad.php: repool db1037; depool db1030 (duration: 00m 13s) [01:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:37:20] (03PS3) 10Dzahn: site.pp - add comments about server roles [puppet] - 10https://gerrit.wikimedia.org/r/224191 [01:38:37] (03CR) 10John F. Lewis: [C: 031] "(line 1623 - people need to appreciate the awesomeness of that comment)" [puppet] - 10https://gerrit.wikimedia.org/r/224191 (owner: 10Dzahn) [01:39:23] (03CR) 10Dzahn: [C: 032] site.pp - add comments about server roles [puppet] - 10https://gerrit.wikimedia.org/r/224191 (owner: 10Dzahn) [01:49:41] (03CR) 10Dzahn: [C: 031] Split labs-specific bits of base into labs::base [puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [01:52:04] (03CR) 10Dzahn: [C: 031] grafana: Set a default dashboard [puppet] - 10https://gerrit.wikimedia.org/r/224129 (owner: 10Krinkle) [01:56:09] (03PS1) 10BBlack: HTTPS redirects: Remove meta+MediaWiki exception [puppet] - 10https://gerrit.wikimedia.org/r/224556 [01:56:11] (03PS1) 10BBlack: HTTPS redirects: remove InstantCommons exception [puppet] - 10https://gerrit.wikimedia.org/r/224557 (https://phabricator.wikimedia.org/T102566) [01:58:30] (03CR) 10Dzahn: [C: 04-1] "there is a followed by and another (" Stray end tag body")" [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [01:58:36] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1451187 (10BBlack) Change prepped so it's easy. I'm open to debate on timing (I tend to think we should least have a softwar... [02:02:34] !log LocalisationUpdate failed (1.26wmf13) at 2015-07-14 02:02:33+00:00 [02:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:03:25] 6operations, 5Patch-For-Review: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1451189 (10Dzahn) @ori can it move to netmon1001 maybe? [02:05:04] jzerebecki: i dont know about that list, sorry [02:05:29] (03PS1) 10John F. Lewis: remove db100[2-7] from install_server and coredb [puppet] - 10https://gerrit.wikimedia.org/r/224558 [02:05:41] mutante / springle ^^ [02:07:32] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jul 14 02:07:32 UTC 2015 (duration 7m 30s) [02:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:14:35] (03PS1) 10John F. Lewis: remove db100[2-7]{.mgmt}.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/224560 [02:14:40] mutante / springle ^ [02:19:16] 6operations, 10ops-eqiad, 7Database: Remove db1002-db1007 from production - https://phabricator.wikimedia.org/T105768#1451207 (10JohnLewis) 3NEW [02:19:29] (03PS2) 10John F. Lewis: remove db100[2-7]{.mgmt}.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/224560 (https://phabricator.wikimedia.org/T105768) [02:19:39] (03PS2) 10John F. Lewis: remove db100[2-7] from install_server and coredb [puppet] - 10https://gerrit.wikimedia.org/r/224558 (https://phabricator.wikimedia.org/T105768) [02:31:36] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 07m 27s) [02:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:21] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-14 02:35:21+00:00 [02:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:45] (03PS2) 10Yuvipanda: remove include ::diamond [puppet] - 10https://gerrit.wikimedia.org/r/224507 (owner: 1020after4) [02:45:50] (03CR) 10Yuvipanda: [C: 032 V: 032] remove include ::diamond [puppet] - 10https://gerrit.wikimedia.org/r/224507 (owner: 1020after4) [03:31:44] 6operations, 10SEO: GWT accounts - https://phabricator.wikimedia.org/T103567#1451237 (10Stu) @dr0ptp4kt I still don't have access AFAIK to any of the https site variants. haven't been able to do any analysis on a bunch of the bug fixes we've deployed and are working on. would appreciate it ASAP. cc @wwes [04:25:11] (03PS1) 10Ori.livneh: Don't exclude PHP files from being synced [tools/scap] - 10https://gerrit.wikimedia.org/r/224561 [04:25:39] (03CR) 10Ori.livneh: "Simple; line removed was added by previous patch." [tools/scap] - 10https://gerrit.wikimedia.org/r/224561 (owner: 10Ori.livneh) [04:25:45] (03CR) 10Ori.livneh: [C: 032] Don't exclude PHP files from being synced [tools/scap] - 10https://gerrit.wikimedia.org/r/224561 (owner: 10Ori.livneh) [04:26:05] (03Merged) 10jenkins-bot: Don't exclude PHP files from being synced [tools/scap] - 10https://gerrit.wikimedia.org/r/224561 (owner: 10Ori.livneh) [04:40:59] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Drop AES-256 mid/compat lists. - https://phabricator.wikimedia.org/T105716#1451258 (10Chmarkine) My thought is that we'd better support a cipher suite as long as someone is actively using it and it is not close to broken (such as RC4). So how about keeping... [04:43:45] 6operations, 5Patch-For-Review: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1451261 (10ori) @Dzahn -- Grafana v2 (which we aren't running yet) has a backend component that is written in Go and which has not been packaged for Debian. Apart from its own backend, Grafana does n... [04:47:06] (03PS1) 10Ori.livneh: Follow-up for Ieb62ee050e: allow LCStoreStaticArray in server mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224562 [04:47:20] (03CR) 10Ori.livneh: [C: 032] Follow-up for Ieb62ee050e: allow LCStoreStaticArray in server mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224562 (owner: 10Ori.livneh) [04:47:25] (03Merged) 10jenkins-bot: Follow-up for Ieb62ee050e: allow LCStoreStaticArray in server mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224562 (owner: 10Ori.livneh) [04:48:25] !log ori Synchronized wmf-config/CommonSettings.php: Follow-up for Ieb62ee050e: allow LCStoreStaticArray in server mode (duration: 00m 13s) [04:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:54:23] 6operations, 5Patch-For-Review: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1451264 (10Dzahn) @ori thank you for the detailed reply! So it sounds a VM makes sense indeed in this case. It would involve requesting a new VM (similar to T105507) and then it's pretty much like... [04:58:16] !log Enabling LCStoreStaticArray in production. May be reverted by running: 'salt -G deployment_target:scap/scap cmd.run "rm /etc/lcstore"' on palladium. [04:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:03:50] ..waiting for downforeveryoneorjustme.com .. ironic [05:03:57] PROBLEM - Disk space on mw1010 is CRITICAL: DISK CRITICAL - free space: /run 0 MB (0% inode=99%) [05:03:57] PROBLEM - Disk space on mw1005 is CRITICAL: DISK CRITICAL - free space: /run 0 MB (0% inode=99%) [05:04:08] PROBLEM - Disk space on mw1015 is CRITICAL: DISK CRITICAL - free space: /run 0 MB (0% inode=99%) [05:04:08] PROBLEM - Disk space on mw1011 is CRITICAL: DISK CRITICAL - free space: /run 0 MB (0% inode=99%) [05:04:08] PROBLEM - Disk space on mw1012 is CRITICAL: DISK CRITICAL - free space: /run 0 MB (0% inode=99%) [05:04:14] ok, that's not good [05:04:16] PROBLEM - Disk space on mw1002 is CRITICAL: DISK CRITICAL - free space: /run 0 MB (0% inode=99%) [05:04:16] PROBLEM - Disk space on mw1016 is CRITICAL: DISK CRITICAL - free space: /run 0 MB (0% inode=99%) [05:04:17] PROBLEM - Disk space on mw1009 is CRITICAL: DISK CRITICAL - free space: /run 0 MB (0% inode=99%) [05:04:27] PROBLEM - Disk space on mw1006 is CRITICAL: DISK CRITICAL - free space: /run 0 MB (0% inode=99%) [05:04:27] PROBLEM - Disk space on mw1014 is CRITICAL: DISK CRITICAL - free space: /run 0 MB (0% inode=99%) [05:04:33] looking [05:04:46] PROBLEM - Disk space on mw1007 is CRITICAL: DISK CRITICAL - free space: /run 0 MB (0% inode=99%) [05:04:58] PROBLEM - Disk space on mw1001 is CRITICAL: DISK CRITICAL - free space: /run 0 MB (0% inode=99%) [05:05:07] PROBLEM - Disk space on mw1008 is CRITICAL: DISK CRITICAL - free space: /run 3 MB (0% inode=99%) [05:05:07] PROBLEM - Disk space on mw1004 is CRITICAL: DISK CRITICAL - free space: /run 3 MB (0% inode=99%) [05:05:14] wait, this is all just /run [05:05:16] not / [05:05:17] PROBLEM - Disk space on mw1003 is CRITICAL: DISK CRITICAL - free space: /run 0 MB (0% inode=99%) [05:05:18] but still [05:05:19] that's /run [05:05:21] yeah [05:05:33] and these are job runners [05:06:47] they keep their bytecode repo there, which they shouldn't [05:07:07] i'll have a fix in a minute. this isn't user-facing, fwiw [05:07:18] cool, thanks ori [05:12:17] so there is /run/lock and /run/shm , which one is it [05:12:35] since it just says /run [05:12:45] and the size is very much different [05:13:36] and /run/user [05:14:45] oh, and the tmpfs, nevermind [05:15:58] (03CR) 10Dzahn: [C: 032] remove wikkii table entirely [debs/wikistats] - 10https://gerrit.wikimedia.org/r/222237 (https://phabricator.wikimedia.org/T104367) (owner: 10Dzahn) [05:16:47] (03PS1) 10Ori.livneh: hhvm: use /var/cache/hhvm for hhbc files, per I83501931 [puppet] - 10https://gerrit.wikimedia.org/r/224563 [05:16:52] (03CR) 10Dzahn: [C: 032] remove gentoo table entirely [debs/wikistats] - 10https://gerrit.wikimedia.org/r/222238 (https://phabricator.wikimedia.org/T104367) (owner: 10Dzahn) [05:18:12] (03CR) 10Dzahn: [C: 031] "yes, thanks! per icinga messages we got like:" [puppet] - 10https://gerrit.wikimedia.org/r/224563 (owner: 10Ori.livneh) [05:18:36] (03PS2) 10Ori.livneh: hhvm: use /var/cache/hhvm for hhbc files, per I83501931 [puppet] - 10https://gerrit.wikimedia.org/r/224563 [05:18:41] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1451278 (10Nemo_bis) > I tend to think we should least have a software release available The necessary steps also must be do... [05:18:46] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm: use /var/cache/hhvm for hhbc files, per I83501931 [puppet] - 10https://gerrit.wikimedia.org/r/224563 (owner: 10Ori.livneh) [05:26:26] RECOVERY - Disk space on mw1010 is OK: DISK OK [05:26:26] RECOVERY - Disk space on mw1005 is OK: DISK OK [05:26:37] RECOVERY - Disk space on mw1015 is OK: DISK OK [05:26:37] RECOVERY - Disk space on mw1011 is OK: DISK OK [05:26:37] RECOVERY - Disk space on mw1012 is OK: DISK OK [05:26:38] RECOVERY - Disk space on mw1016 is OK: DISK OK [05:26:38] RECOVERY - Disk space on mw1002 is OK: DISK OK [05:26:47] RECOVERY - Disk space on mw1009 is OK: DISK OK [05:26:47] RECOVERY - Disk space on mw1006 is OK: DISK OK [05:26:47] RECOVERY - Disk space on mw1014 is OK: DISK OK [05:26:58] !log Cleaned up now-unused hhbc files from /run/hhvm/cache on job runners [05:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:27:08] RECOVERY - Disk space on mw1007 is OK: DISK OK [05:27:27] RECOVERY - Disk space on mw1001 is OK: DISK OK [05:27:28] RECOVERY - Disk space on mw1008 is OK: DISK OK [05:27:36] RECOVERY - Disk space on mw1004 is OK: DISK OK [05:27:46] RECOVERY - Disk space on mw1003 is OK: DISK OK [05:29:01] (03CR) 10Dzahn: "if the DBAs confirm these can go from coredb, they can also go from DHCP. why does db1001 stay btw" [puppet] - 10https://gerrit.wikimedia.org/r/224558 (https://phabricator.wikimedia.org/T105768) (owner: 10John F. Lewis) [05:29:04] :) [05:34:45] ori, grafana default dashboard :) https://gerrit.wikimedia.org/r/#/c/224129/ and good night [05:35:51] mutante: thanks! good night [05:40:37] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [05:57:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 7 below the confidence bounds [05:59:41] _joe|afk: that's puppet restarting hhvms after https://gerrit.wikimedia.org/r/#/c/224563/ , subsiding already. [06:08:15] <_joe|afk> or I imagined that [06:08:52] _joe|afk: imagined what? [06:09:25] <_joe|afk> ori: nod, I imagined that [06:09:47] cool [06:10:09] _joe_: paul biss confirmed that there is no point in having the hhbc file in tmpfs, as you suspected [06:10:11] <_joe_> that the puppet restarts were the reason of the 5xxs [06:10:14] the OS does a good job keeping it in memory [06:10:30] <_joe_> yes, it was an overoptimization on our part :) [06:16:27] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [06:31:26] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on cp1053 is CRITICAL Puppet has 3 failures [06:31:46] PROBLEM - puppet last run on mw1008 is CRITICAL Puppet has 1 failures [06:32:18] PROBLEM - puppet last run on mc1017 is CRITICAL Puppet has 2 failures [06:32:27] PROBLEM - puppet last run on mw2207 is CRITICAL Puppet has 1 failures [06:32:50] I'm having image problem on svwp... it will appear in https://sv.wikipedia.org/w/index.php?title=Hylaeus_connectens&oldid=30216013 but not in https://sv.wikipedia.org/w/index.php?title=Hylaeus_connectens&oldid=30216014 [06:33:17] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:33:17] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:33:25] Deskana|Away, _joe_ etc. CRITICAL ^ [06:34:47] Josve05afk: the alerts are unrelated (and nothing to worry about) [06:35:28] oh, ok. Got worried. Last time I was here and it said critical or something, all images disappeared from all wikis...flashbacks [06:35:58] (03PS1) 10Ori.livneh: Use LCStoreStaticArray unconditionally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224573 [06:36:26] (03CR) 10Ori.livneh: [C: 032] Use LCStoreStaticArray unconditionally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224573 (owner: 10Ori.livneh) [06:36:32] (03Merged) 10jenkins-bot: Use LCStoreStaticArray unconditionally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224573 (owner: 10Ori.livneh) [06:37:28] morning [06:37:52] I'll continue the elasticsearch rolling upgrade [06:39:47] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:27] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [06:41:39] !log ori Synchronized wmf-config/CommonSettings.php: I9c9bf0f4: Use LCStoreStaticArray unconditionally (duration: 03m 02s) [06:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:42:46] PROBLEM - RAID on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:44:28] RECOVERY - RAID on tin is OK optimal, 1 logical, 2 physical [06:47:17] PROBLEM - puppet last run on tin is CRITICAL Puppet has 1 failures [06:48:17] !log es1.6 step 6: upgrade elastic1005 [06:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:53:27] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: TransparencyReport repository master in Gerrit silently made private - https://phabricator.wikimedia.org/T89640#1451316 (10Prtksxna) >>! In T89640#1450499, @akosiaris wrote: > Just force pushing to the public repo instead of the private should be suf... [06:55:47] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:55:47] RECOVERY - puppet last run on cp1053 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:56:06] RECOVERY - puppet last run on mw1008 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:56:38] RECOVERY - puppet last run on mc1017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:19] <_joe_> dcausse: I'm sorry, I wasn't aware of your plans to perform the upgrade this week [06:57:37] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:37] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:57:38] <_joe_> people in the ops team raised concerns about such a big upgrade happening during wikimania [06:57:51] _joe_: oh ok [06:58:05] <_joe_> dcausse: no don't stop because of it, just FYI for the future [06:58:12] <_joe_> it's better if we get notified :) [06:58:34] _joe_: sure, we'll do it next time, sorry [06:58:37] RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:13] 6operations, 10Traffic, 7Graphite, 7Varnish: Varnish caches Grafana dashboard configuration too strongly - https://phabricator.wikimedia.org/T105734#1451317 (10Joe) p:5Triage>3Normal [07:06:34] 6operations, 6Discovery: Cirrus search in codfw - https://phabricator.wikimedia.org/T105703#1451331 (10Joe) p:5Triage>3High [07:07:58] 6operations, 6Discovery, 5codfw-rollout: Cirrus search in codfw - https://phabricator.wikimedia.org/T105703#1449703 (10Joe) [07:08:06] RECOVERY - puppet last run on tin is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:09:10] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jul 14 07:09:10 UTC 2015 (duration 9m 9s) [07:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:25:46] 6operations, 10CirrusSearch, 6Discovery, 5codfw-rollout: Implement multi-DC support in CirrusSearch - https://phabricator.wikimedia.org/T105709#1451336 (10Joe) [07:42:19] (03PS2) 10Muehlenhoff: Optionally disable connection tracking per service [puppet] - 10https://gerrit.wikimedia.org/r/223751 [07:43:02] (03CR) 10Muehlenhoff: [C: 032 V: 032] Optionally disable connection tracking per service [puppet] - 10https://gerrit.wikimedia.org/r/223751 (owner: 10Muehlenhoff) [07:49:37] !log es1.6 step 7: upgrade elastic1006 [07:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:16:31] (03PS1) 10Muehlenhoff: Enable packet filter for heze [puppet] - 10https://gerrit.wikimedia.org/r/224576 [08:17:31] (03PS5) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [08:17:59] <_joe_> mobrovac: I think it starts to be reviewable :) [08:18:16] hehe [08:18:20] kk, will take a look [08:18:59] <_joe_> still not ready though, btu the python script is [08:19:14] <_joe_> oh I type like a king this morning [08:20:34] lol [08:20:46] it was like that for me yesterday [08:24:50] (03PS6) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [08:25:17] <_joe_> mobrovac: of course this is a noop, I'll still need to add the check to restbase (which doesn't use service::node) [08:25:21] <_joe_> I'll do it now :) [08:26:28] _joe_: i'll be working on adding the monitoring spec / endpoint to mathoid citoid and graphoid today/tomorrow [08:26:42] <_joe_> mobrovac: oh nice :)) [08:26:57] albeit, can't do that for graphoid until yurik fixes the tests [08:27:14] which we have been waiting to happen for a month now ... [08:27:26] (03PS1) 10Muehlenhoff: Disable connection tracking for pool counters [puppet] - 10https://gerrit.wikimedia.org/r/224577 [08:48:42] <_joe_> mobrovac: we could just turn off graphoid as a solution :P [08:49:51] _joe_: it's not that bad, graphoid is (supposedly) working, but the tests are not passing, making it quite hard for us to do any operational changes to it as we can't really test [08:51:31] <_joe_> yeah I was just playing the BOFH card [08:53:08] (03CR) 10Matanya: [C: 031] add frack subnets to network.pp, add frack-codfw to icinga firewall policy [puppet] - 10https://gerrit.wikimedia.org/r/224519 (owner: 10Jgreen) [08:55:12] (03PS1) 10Giuseppe Lavagetto: restbase: spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/224586 (https://phabricator.wikimedia.org/T94831) [08:55:22] 6operations, 6Services, 5Patch-For-Review, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1451363 (10mobrovac) [08:55:38] _joe_: hehe [08:55:52] (03CR) 10jenkins-bot: [V: 04-1] restbase: spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/224586 (https://phabricator.wikimedia.org/T94831) (owner: 10Giuseppe Lavagetto) [08:56:58] (03PS2) 10Giuseppe Lavagetto: restbase: spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/224586 (https://phabricator.wikimedia.org/T94831) [08:59:16] (03CR) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [08:59:45] (03CR) 10Giuseppe Lavagetto: "the checker script has been tested against the public restbase URL." [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [08:59:58] (03Abandoned) 10Giuseppe Lavagetto: poolcounter: just comment out helium, leave potassium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223840 (owner: 10Giuseppe Lavagetto) [09:01:55] (03PS2) 10Giuseppe Lavagetto: mediawiki: make www-data the default user [puppet] - 10https://gerrit.wikimedia.org/r/217265 [09:08:10] 6operations, 6Services, 5Patch-For-Review, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1173326 (10mobrovac) [09:16:52] (03PS1) 10Muehlenhoff: Add ferm rules for statistics-web [puppet] - 10https://gerrit.wikimedia.org/r/224587 [09:20:34] 6operations: Ferm rules for abacist - https://phabricator.wikimedia.org/T104992#1451451 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [09:21:20] (03PS1) 10Muehlenhoff: Enable packet filter for stat1001 [puppet] - 10https://gerrit.wikimedia.org/r/224588 (https://phabricator.wikimedia.org/T104992) [09:22:13] 6operations: Puppet catalog compiler is broken - https://phabricator.wikimedia.org/T96802#1451456 (10fgiunchedi) I think another puppet compiler was mentioned? anyways I tried running "utils/pcc" from puppet.git yesterday but at least the jenkins job doesn't seem to work? https://integration.wikimedia.org/ci/job... [09:30:22] (03CR) 10Giuseppe Lavagetto: [C: 032] "checked with the compiler, noop" [puppet] - 10https://gerrit.wikimedia.org/r/217265 (owner: 10Giuseppe Lavagetto) [09:32:52] 6operations: Ferm rules for parsoid / wtp* hosts - https://phabricator.wikimedia.org/T104966#1451479 (10MoritzMuehlenhoff) 5Open>3Invalid a:3MoritzMuehlenhoff The parsoid hosts already use ferm rules, they're defined inside the role defition, so they we initially overlooked. [09:51:24] (03PS1) 10Giuseppe Lavagetto: imagescalers: reimage mw1154, mw1155 to HAT [puppet] - 10https://gerrit.wikimedia.org/r/224594 (https://phabricator.wikimedia.org/T84842) [09:51:40] (03CR) 10Filippo Giunchedi: [C: 04-1] "a few comments/thoughts but LGTM overall" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/224548 (owner: 10Manybubbles) [09:54:28] PROBLEM - puppet last run on cp2019 is CRITICAL puppet fail [09:58:44] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1451493 (10fgiunchedi) >>! In T78514#1450817, @GWicke wrote: > We discussed this on IRC, but didn't mention it here yet: The... [10:01:43] (03CR) 10Southparkfan: imagescalers: reimage mw1154, mw1155 to HAT (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224594 (https://phabricator.wikimedia.org/T84842) (owner: 10Giuseppe Lavagetto) [10:03:01] (03CR) 10Giuseppe Lavagetto: "thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224594 (https://phabricator.wikimedia.org/T84842) (owner: 10Giuseppe Lavagetto) [10:03:39] (03PS2) 10Giuseppe Lavagetto: imagescalers: reimage mw1154, mw1155 to HAT [puppet] - 10https://gerrit.wikimedia.org/r/224594 (https://phabricator.wikimedia.org/T84842) [10:04:32] (03CR) 10Giuseppe Lavagetto: [C: 032] imagescalers: reimage mw1154, mw1155 to HAT [puppet] - 10https://gerrit.wikimedia.org/r/224594 (https://phabricator.wikimedia.org/T84842) (owner: 10Giuseppe Lavagetto) [10:06:34] <_joe_> !log reimaging mw1154 [10:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:12:01] <_joe_> !log stopped poolcounter on mw1154 [10:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:15:18] PROBLEM - Hadoop NodeManager on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:21:06] RECOVERY - Hadoop NodeManager on analytics1032 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:21:18] RECOVERY - puppet last run on cp2019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:26] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:23:38] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:24:01] <_joe_> that's me sorry [10:25:36] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:26:07] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [10:28:44] 6operations, 6Services, 7Service-Architecture: Create a doc explaining the SLA between services and the monitoring tool - https://phabricator.wikimedia.org/T105780#1451502 (10mobrovac) 3NEW [10:28:58] 6operations, 6Services, 7Service-Architecture: Create a doc explaining the SLA between services and the monitoring tool - https://phabricator.wikimedia.org/T105780#1451510 (10mobrovac) [10:29:01] 6operations, 6Services, 5Patch-For-Review, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1451509 (10mobrovac) [10:59:54] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 20.69% of data above the critical threshold [100000000.0] [11:05:03] PROBLEM - HHVM processes on mw1154 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [11:06:04] PROBLEM - RAID on mw1154 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [11:06:23] PROBLEM - configured eth on mw1154 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [11:06:34] PROBLEM - dhclient process on mw1154 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [11:07:03] PROBLEM - nutcracker port on mw1154 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [11:07:14] PROBLEM - nutcracker process on mw1154 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [11:07:33] PROBLEM - puppet last run on mw1154 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [11:07:33] PROBLEM - DPKG on mw1154 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [11:07:42] <_joe_> grr damn downtime [11:07:43] PROBLEM - Disk space on mw1154 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [11:07:43] PROBLEM - salt-minion processes on mw1154 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [11:18:54] RECOVERY - DPKG on mw1154 is OK: All packages OK [11:19:04] RECOVERY - Disk space on mw1154 is OK: DISK OK [11:19:04] RECOVERY - salt-minion processes on mw1154 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:19:23] RECOVERY - RAID on mw1154 is OK no RAID installed [11:19:43] RECOVERY - configured eth on mw1154 is OK - interfaces up [11:19:54] RECOVERY - dhclient process on mw1154 is OK: PROCS OK: 0 processes with command name dhclient [11:20:14] RECOVERY - HHVM processes on mw1154 is OK: PROCS OK: 6 processes with command name hhvm [11:20:23] RECOVERY - nutcracker port on mw1154 is OK: TCP OK - 0.000 second response time on port 11212 [11:20:34] RECOVERY - nutcracker process on mw1154 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [11:20:45] PROBLEM - puppet last run on mw1154 is CRITICAL Puppet has 6 failures [11:22:44] RECOVERY - puppet last run on mw1154 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:25:39] <_joe_> !log repooling mw1154 with HHVM [11:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:27:14] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [11:27:45] I could use a review for this https://gerrit.wikimedia.org/r/#/c/222205/ [11:28:08] <_joe_> godog: ack [11:28:19] !log es1.6 step 8: upgrade elastic1007 [11:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:32:11] <_joe_> godog: do we have a list of valid characters in the icinga config? [11:32:23] <_joe_> sorry, bbl [11:33:04] _joe|afk: yep, it is illegal_object_name_chars as explained in the comment [11:39:10] quick question about PDF files with 1 landscape and 1 portrait page. Are they known to have display issues, with the landscape page being treated as a portrait page when the thumbnail is created ? (https://en.wikipedia.org/wiki/File:Woodlands_Cemetery,_Philadelphia,_Gwyn_cemetery_record_for_Section_E_0033_N.5.pdf) [12:16:12] (03CR) 10Ori.livneh: "Is it better to barf at invalid characters or quietly normalize or strip them? (I'm inclined toward the latter option.)" [puppet] - 10https://gerrit.wikimedia.org/r/222205 (https://phabricator.wikimedia.org/T101799) (owner: 10Filippo Giunchedi) [12:16:33] Reviewed the patch at https://gerrit.wikimedia.org/r/#/c/222205/, Master. [12:17:06] !log Logging a message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log. [12:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:17:18] haha ori did you made it to your flight? [12:17:29] not sure yet, on BART [12:19:13] godog: `git grep safe.*regsubst` for some possibilities [12:22:05] ori: thanks, I'll take a look, I'm trying to argument why I'm in the former camp [12:24:07] i usually am too [12:24:19] it's python vs php [12:24:27] In the face of ambiguity, refuse the temptation to guess. [12:24:47] the problem is that the consequences for puppet failures are disproportionate to the crime [12:25:15] icinga-wm does its best to pretend that the world is ending [12:26:22] yeah that's true in this case [12:27:11] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Drop AES-256 mid/compat lists. - https://phabricator.wikimedia.org/T105716#1451590 (10BBlack) re: Camellia: we've only got stats for 1 week so far, but yeah, I'm not fond of keeping either of the Camellia options in the long run. They don't appear to be i... [12:27:22] reasons to oppose normalization are: * if it will cause confusing and hard-to-debug duplicate definition errors (by normalizing `foo#` and `foo%' to 'foo') [12:27:42] if it will mangle the description beyond recognition [12:27:50] neither seems applicable in this case [12:28:12] descriptions are likely to be unique and to contain mostly words and numbers [12:28:25] if some illicit punctuation crept in it was probably by accident, hard to imagine that it's essential to the meaning [12:34:25] (03PS4) 10Jgreen: add frack subnets to network.pp, add frack-codfw to icinga firewall policy [puppet] - 10https://gerrit.wikimedia.org/r/224519 [12:36:45] Jeff_Green: good catch ^ :) [12:37:09] ya, I suddenly remembered that I had forgotten to change those last time [12:37:59] matanya: if it's +1 worthy now can you do so, and I'll deploy it? [12:38:16] (03CR) 10Matanya: [C: 031] add frack subnets to network.pp, add frack-codfw to icinga firewall policy [puppet] - 10https://gerrit.wikimedia.org/r/224519 (owner: 10Jgreen) [12:38:23] thanks! [12:38:27] :) [12:38:45] here's hoping for no surprises... [12:39:08] yes :) [12:39:20] bblack: what is the motivation to remove ciphers ? [12:39:21] (03PS2) 10BBlack: HTTPS redirects: Remove meta+MediaWiki exception [puppet] - 10https://gerrit.wikimedia.org/r/224556 [12:39:57] (03PS5) 10Jgreen: add frack subnets to network.pp, add frack-codfw to icinga firewall policy [puppet] - 10https://gerrit.wikimedia.org/r/224519 [12:40:47] matanya: because a bunch of the ciphers in the list are never defaults (they're very rarely chosen by clients, and only because a user fucked with the client's config, essentially), and on top of that the user made a poor choice when they messed with it. [12:41:00] we'd rather force them upwards in our list and have them make more-secure choices, basically. [12:42:05] (03CR) 10Jgreen: [C: 032 V: 031] add frack subnets to network.pp, add frack-codfw to icinga firewall policy [puppet] - 10https://gerrit.wikimedia.org/r/224519 (owner: 10Jgreen) [12:42:09] bblack: thanks, but if it is a one-hard-coded cipher we are actully preventing them from access or downgrading them, if they have worse chipers [12:43:03] matanya: in the case of AES256, that's never the case. if the client has AES256, it at least has equivalent AES128 options as well. [12:43:32] bblack: you can't have AES256 only ? [12:43:55] the implementation never does. the user could choose to disable AES128, which is what's happening here in some cases I think. [12:44:31] in any case, we're not talking about disabling all AES256 options, we're talking about disabling the less-secure ones in favor of more-secure ones. [12:44:57] there's a lot of dimensions to ciphersuite choice. PFS and AEAD are far more important than any view on AES256 vs AES128. [12:44:59] though some weird, edge case might have AES128 disabled al together [12:45:18] matanya: our stats say no. it's single-digit handfuls of users. [12:45:18] yea, agree, just wondered what was the motivation. [12:45:37] it's covered in the ticket :) [12:45:57] i saw that, but this converstion clarified it to me. thank you [12:47:23] hrm, that didn't go well. ferm doesn't like the puppet output on neon [12:47:24] aside from just simplifying everything on our end, it concerns me that some may have, in an attempt to choose AES256 over 128, also accidentally disabled PFS/AEAD. This would be those 5 users or whatever's wakeup call to fix their client. It's kind of insane to say "I care about security enough to mess with the cipher settings and try to pick something that at least appears, on the surface, [12:47:30] to be more secure, but then I disabled a bunch of options that are definitely way more secure" [12:48:30] <_joe_> !log reimaging mw1155 [12:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:49:30] anyways, like I said in the last update on the ticket, disabling the worst of the AES256 choices is not urgent. It's just an annoyance to address at some point. [12:49:49] thanks again [12:49:53] we need to do some publicity on other related issues anyways, we could hold for that and include some messaging about it in a blog post [12:50:07] good idea [12:50:32] the main thrust of the above being, at some point we need to make a public blog post sort of statement about "Hey, if you're still using Windows XP, that's a really horrible thing that's killing your security with all websites, not just ours, please upgrade!" [12:51:37] we're still getting around 50M requests/day from IE8/XP :/ [12:52:02] (out of 8.3B, but still!) [12:52:14] (03PS1) 10Jgreen: fix typo from I67a16556 [puppet] - 10https://gerrit.wikimedia.org/r/224606 [12:53:20] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection refused [12:53:51] also, 23M reqs/day using AES128-SHA, which is almost as bad. From what I've sampled of those, the vast majority are due to corporate SSL-breaking firewalls. [12:54:00] (03PS2) 10Jgreen: fix typos from I67a16556 [puppet] - 10https://gerrit.wikimedia.org/r/224606 [12:54:02] that issue needs more publicity in general, it's not just about us. [12:55:07] The scenario there is that the user may have a modern browser with good crypto on their work desktop machine, and maybe they visit their bank from work over lunch or whatever, and their bank tries to prefer good ciphers too. But the SSL-breaking outbound firewall/proxy their office runs kills the connection's security in the process. [12:55:19] PROBLEM - salt-minion processes on mw1155 is CRITICAL: Connection refused by host [12:55:30] PROBLEM - nutcracker port on mw1155 is CRITICAL: Connection refused by host [12:55:40] PROBLEM - DPKG on mw1155 is CRITICAL: Connection refused by host [12:55:49] (aside from the fact that you shouldn't ever trust an office computer in a situation like that. they're probably logging critical bits of traffic that can be used to help access your bank account in some corporate IT server somewhere. It's a way easier target than the bank itself...) [12:55:49] PROBLEM - Disk space on mw1155 is CRITICAL: Connection refused by host [12:55:51] PROBLEM - RAID on mw1155 is CRITICAL: Connection refused by host [12:55:51] PROBLEM - dhclient process on mw1155 is CRITICAL: Connection refused by host [12:56:22] (03CR) 10Jgreen: [C: 032 V: 031] fix typos from I67a16556 [puppet] - 10https://gerrit.wikimedia.org/r/224606 (owner: 10Jgreen) [12:56:32] bblack: can you list "inscure" or less secure options people pick [12:56:46] and we can try to figure out what they are composed from ? [12:56:49] frankly I would have expected the IE8/XP share to be higher... [12:57:00] gladly it is not [12:57:00] (03CR) 10Mobrovac: "Maybe I don't have the correct versions or sth, but I need to comment out two lines (cf. the in-lined comments) in order for the script to" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [12:58:28] matanya: all of the cipher options we support are listed here grouped into 3 categories: https://github.com/wikimedia/operations-puppet/blob/production/modules/wmflib/lib/puppet/parser/functions/ssl_ciphersuite.rb#L78 [12:58:38] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1451642 (10mark) >>! In T104458#1446582, @BBlack wrote: > They're replacing the old ones. I guess technically we could name them anything, but it will be less-confusing months from now if they'... [12:59:10] the strong category is basically "best available", but only about 2/3rds of clients are even capable of those options. Notable, absolutely zero versions of any Apple product have those options :/ [12:59:13] Yes, i know that, i mean the correclation OS+browser => cipher-suite [12:59:29] there's some info about that in the source comments, in broad terms [12:59:35] <_joe_> non ti /win 37 [13:00:00] bblack: I would build a table with support map [13:00:06] the "mid" options are non-AEAD, which means they're probably vulnerable to uncommon attacks nobody's written easy exploit code for yet, but you can bet state actors can figure it out [13:00:12] actully, i "will" do it. [13:00:26] matanya: it's virtually impossible, the table would be like 200 pages long if you really get into every client implementation on the planet, sadly. [13:00:38] major stuff at least [13:00:43] but, we know the broad strokes of it for the major client platforms. [13:00:43] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1451644 (10mark) So what exact interfaces do we have in the new machines? Usually 10G interfaces are not 1G capable, but if they're separate interfaces (as I think is the case here) we should be... [13:01:01] the links in here will tell you most of it: https://www.ssllabs.com/ssltest/clients.html [13:01:20] but even that doesn't list all the possibilities and exceptions you have to care about [13:01:38] i know about this too :) i wanted something more wikimedia-specific [13:01:54] i.e. list of UA we get mapped to chiper suites [13:02:02] ah [13:02:26] that's non-trivial either. I've just been investigating the insecure / low-volume cases on case-by-case basis by watching traffic logs [13:02:39] there's exceptions to exceptions to exceptions always. [13:03:09] for instance: Chrome on any platform is usually one of the more-secure client choices, since Google cares about this stuff and it auto-updates.... [13:03:55] <_joe_> while well, it listens to your room and sends data back to google of course :P [13:03:58] but I kept seeing latest versions of Chrome-on-(Win7, and others) in traffic logs picking horrible-insecure choices like AES128-SHA, and thought it might be a bug. [13:04:16] _joe_: sure s/Chrome/Chromium/ is the same on this stuff [13:04:49] probably some proxy/weird network equip in the way ? [13:04:52] anyways, the Chrome insecure choices problem eventually I figured out through analyzing IP addresses was due to the corporate firewalls mentioned above. [13:05:19] i'd mail them :D [13:05:39] they're on corporate machines with fake root certs installed, so that the corporate outbound proxy can silently hijack all SSL connections and decrypt-log-re-encrypt, and do a poor job of the latter. [13:06:37] there's no point emailing them, it's a long tail and a huge waste of time. [13:06:39] <_joe_> bblack: yeah I know something about that [13:06:59] lots of businesses, it would take forever to find them all and frankly it's not our job to tell them all they suck one by one :) [13:07:03] <_joe_> (bad proxies screwing up HTTPS) [13:07:06] bblack: "Dear , If you're going to to sniff your employee's traffic please try not to be an idiot while doing so. xoxo, bblack" [13:07:10] bblack: Wrote yer email ^ [13:07:25] <_joe_> ostriches: we could write to blue coat inc. directly [13:07:26] honestly, the poor outbound crypto is the least of the issues there [13:07:38] <_joe_> and I agree with bblack [13:07:56] what's worse is that the employees don't realize they're also trusting their probably-inept local IT Security policies with all of their supposedly-private traffic (logging and access) [13:08:08] Oh I don't disagree. It's just 6am and my snark is the first emotion to wake up :p [13:08:38] even if it were re-encrypted properly, it's a huge security loss that your banking transactions are now also subject to the weakest-link being some server in the closet in your office whoever runs them. [13:09:12] <_joe_> bblack: a blue coat appliance, lemme find the typical model [13:09:35] also, what's really really awesome, is some of these companies have now outsourced this to cloud-based services. I found one over the weekend. [13:09:55] <_joe_> https://www.bluecoat.com/products/proxysg-secure-web-gateway this [13:10:08] so the corporation actually sends their users' traffic off to a shared cloud server elsewhere on the internet to get decrypted and infiltrated and then re-encrypted poorly and sent on its way. [13:10:39] that's this: https://www.zscaler.com/ [13:10:47] (we had a bunch of hits through them on Friday when I was looking) [13:11:00] bblack: Coincidentally, that's also what we do with our traffic in the US. Only s/some company/NSA/ :p [13:11:01] <_joe_> my god [13:11:18] <_joe_> we should publish a blog post on those findings, bblack [13:11:20] "Working to make the internet safe for business", by routing your traffic out into a cloud service and then downgrading the security on the outbound side to the worst possible choice. [13:11:30] <_joe_> lol [13:12:01] bblack: This one made me laugh more, I think.... "Internet security for every byte, every port and every protocol – delivered 100% in the cloud" [13:12:17] "Our security services scan and filter every byte of your network traffic, including SSL-encrypted sessions, as it passes to and from the internet." [13:13:04] “I pitched them to my Board as the Swiss Army knife approach.” --CISO, Fortune 500 Company [13:14:02] _joe_: I'm still overdue for the basic tech blog post on how we're configuring our server-side and such, I'm hoping to finish that this week before I take off. [13:14:27] but yes, I really want to do a followup after that about these sorts of findings, client security issues, begging users to upgrade and what to be more-conscious of, etc. [13:17:49] (03PS10) 10Matanya: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 [13:18:02] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1451669 (10Cmjohnson) The servers were ordered with 4 10GB adapters. 2 ports sit on the system board and 2 are on a pci-e card. I am nearly positive these do not have any 1GB ports. [13:20:44] bblack: any draft for the curious ? [13:21:00] nothing worth reading yet. soon! [13:22:26] (03CR) 10Manybubbles: Add es-tool upgrade-fast and stopping paranoia (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/224548 (owner: 10Manybubbles) [13:25:01] PROBLEM - nutcracker port on mw1155 is CRITICAL: Connection refused by host [13:25:20] PROBLEM - nutcracker process on mw1155 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [13:25:40] PROBLEM - DPKG on mw1155 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [13:25:40] PROBLEM - puppet last run on mw1155 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [13:25:51] PROBLEM - salt-minion processes on mw1155 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [13:25:51] PROBLEM - Disk space on mw1155 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [13:26:10] PROBLEM - HHVM processes on mw1155 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [13:26:40] PROBLEM - check_puppetrun on bellatrix is CRITICAL Puppet has 3 failures [13:26:51] PROBLEM - RAID on mw1155 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [13:27:21] PROBLEM - configured eth on mw1155 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [13:27:40] PROBLEM - dhclient process on mw1155 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [13:28:50] not sure, but this ^^^ could be related to the fact that I just deployed a new nsca config file, so icinga got restarted [13:28:50] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string 301 not found on https://10.195.0.37:443https://fundraising.wikimedia.org/IPNListener_Standalone.php - 214 bytes in 0.010 second response time [13:28:50] PROBLEM - check_load on mintaka is CRITICAL - load average: 16.71, 20.68, 21.93 [13:28:50] PROBLEM - check_procs on mintaka is CRITICAL: PROCS CRITICAL: 1029 processes [13:28:51] PROBLEM - check_puppetrun on mintaka is CRITICAL Puppet has 2 failures [13:29:11] (03PS1) 10BBlack: fix ganglia dns view host regexes [puppet] - 10https://gerrit.wikimedia.org/r/224611 [13:29:30] PROBLEM - check_puppetrun on alnilam is CRITICAL Puppet has 2 failures [13:29:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [13:29:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.013 second response time [13:29:41] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 2 failures [13:30:05] (03PS3) 10Manybubbles: Add es-tool upgrade-fast and stopping paranoia [puppet] - 10https://gerrit.wikimedia.org/r/224548 [13:30:10] PROBLEM - check_mysql on fdb2001 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [13:30:31] PROBLEM - check_mysql on payments2002 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [13:30:31] PROBLEM - check_payments_wiki on payments2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.011 second response time [13:30:31] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 2 failures [13:30:31] PROBLEM - check_raid on payments2002 is CRITICAL HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [13:30:32] PROBLEM - check_listener_gc on saiph is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 214 bytes in 0.012 second response time [13:30:32] PROBLEM - check_listener_ipn on saiph is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 214 bytes in 0.010 second response time [13:30:32] PROBLEM - check_puppetrun on saiph is CRITICAL Puppet has 2 failures [13:30:40] PROBLEM - check_mysql on payments2001 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [13:30:41] PROBLEM - check_payments_wiki on payments2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.009 second response time [13:30:41] PROBLEM - check_puppetrun on payments2001 is CRITICAL Puppet has 2 failures [13:31:40] PROBLEM - check_puppetrun on bellatrix is CRITICAL Puppet has 3 failures [13:33:50] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string 301 not found on https://10.195.0.37:443https://fundraising.wikimedia.org/IPNListener_Standalone.php - 214 bytes in 0.009 second response time [13:33:51] PROBLEM - check_load on mintaka is CRITICAL - load average: 25.09, 23.68, 22.93 [13:33:51] PROBLEM - check_procs on mintaka is CRITICAL: PROCS CRITICAL: 1029 processes [13:33:51] PROBLEM - check_puppetrun on mintaka is CRITICAL Puppet has 2 failures [13:34:30] PROBLEM - check_puppetrun on alnilam is CRITICAL Puppet has 2 failures [13:34:38] (03CR) 10BBlack: [C: 032] fix ganglia dns view host regexes [puppet] - 10https://gerrit.wikimedia.org/r/224611 (owner: 10BBlack) [13:34:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [13:34:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.010 second response time [13:34:41] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 2 failures [13:35:00] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [13:35:30] PROBLEM - check_mysql on payments2002 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [13:35:30] PROBLEM - check_payments_wiki on payments2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.010 second response time [13:35:31] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 2 failures [13:35:31] PROBLEM - check_raid on payments2002 is CRITICAL HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [13:35:31] PROBLEM - check_listener_gc on saiph is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 214 bytes in 0.010 second response time [13:35:31] PROBLEM - check_listener_ipn on saiph is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 214 bytes in 0.008 second response time [13:35:32] PROBLEM - check_puppetrun on saiph is CRITICAL Puppet has 2 failures [13:36:40] PROBLEM - check_puppetrun on bellatrix is CRITICAL Puppet has 3 failures [13:37:00] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 69 seconds ago with 0 failures [13:39:30] PROBLEM - check_puppetrun on alnilam is CRITICAL Puppet has 2 failures [13:39:36] (03PS4) 10Manybubbles: Add es-tool upgrade-fast and stopping paranoia [puppet] - 10https://gerrit.wikimedia.org/r/224548 [13:39:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [13:39:41] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.010 second response time [13:39:41] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 2 failures [13:39:51] RECOVERY - RAID on mw1155 is OK no RAID installed [13:40:11] RECOVERY - configured eth on mw1155 is OK - interfaces up [13:40:11] RECOVERY - HHVM processes on mw1155 is OK: PROCS OK: 6 processes with command name hhvm [13:40:20] RECOVERY - nutcracker port on mw1155 is OK: TCP OK - 0.000 second response time on port 11212 [13:40:31] RECOVERY - salt-minion processes on mw1155 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:40:40] RECOVERY - DPKG on mw1155 is OK: All packages OK [13:40:50] RECOVERY - dhclient process on mw1155 is OK: PROCS OK: 0 processes with command name dhclient [13:41:40] RECOVERY - check_puppetrun on bellatrix is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [13:41:41] RECOVERY - nutcracker process on mw1155 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [13:41:42] RECOVERY - Disk space on mw1155 is OK: DISK OK [13:42:31] PROBLEM - puppet last run on mw1155 is CRITICAL Puppet has 6 failures [13:44:30] RECOVERY - check_puppetrun on alnilam is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [13:44:31] RECOVERY - puppet last run on mw1155 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:44:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [13:44:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.008 second response time [13:44:40] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 2 failures [13:47:30] (03CR) 10Filippo Giunchedi: "from irc, tl;dr "we can replace characters instead of barfing"" [puppet] - 10https://gerrit.wikimedia.org/r/222205 (https://phabricator.wikimedia.org/T101799) (owner: 10Filippo Giunchedi) [13:49:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [13:49:41] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.008 second response time [13:49:41] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 2 failures [13:50:03] (03PS1) 10BBlack: ciphersuites: update ordering commentary [puppet] - 10https://gerrit.wikimedia.org/r/224615 [13:50:30] RECOVERY - check_puppetrun on payments2002 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [13:50:40] RECOVERY - check_puppetrun on payments2001 is OK Puppet is currently enabled, last run 245 seconds ago with 0 failures [13:51:03] (03CR) 10BBlack: [C: 032] ciphersuites: update ordering commentary [puppet] - 10https://gerrit.wikimedia.org/r/224615 (owner: 10BBlack) [13:53:28] (03PS3) 10BBlack: HTTPS redirects: Remove meta+MediaWiki exception [puppet] - 10https://gerrit.wikimedia.org/r/224556 [13:54:18] (03CR) 10BBlack: [C: 032] HTTPS redirects: Remove meta+MediaWiki exception [puppet] - 10https://gerrit.wikimedia.org/r/224556 (owner: 10BBlack) [13:54:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [13:54:41] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.009 second response time [13:54:41] RECOVERY - check_puppetrun on payments2003 is OK Puppet is currently enabled, last run 280 seconds ago with 0 failures [13:55:35] (03PS2) 10Filippo Giunchedi: monitoring: replace illegal chars in description [puppet] - 10https://gerrit.wikimedia.org/r/222205 (https://phabricator.wikimedia.org/T101799) [13:56:31] (03CR) 10Giuseppe Lavagetto: "I'll comment out those two lines, with a comment stating we can de-comment them once trusty is out of the way." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [13:59:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [13:59:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.011 second response time [14:01:48] !log es1.6 step 9: upgrade elastic1008 [14:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:01:58] (03PS3) 10Filippo Giunchedi: monitoring: replace illegal chars in description [puppet] - 10https://gerrit.wikimedia.org/r/222205 (https://phabricator.wikimedia.org/T101799) [14:03:29] (03PS3) 10Giuseppe Lavagetto: restbase: spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/224586 (https://phabricator.wikimedia.org/T94831) [14:03:31] (03PS7) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [14:03:50] RECOVERY - check_procs on mintaka is OK: PROCS OK: 363 processes [14:04:20] (03CR) 10jenkins-bot: [V: 04-1] service::node: auto-monitoring of local endpoints [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [14:04:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [14:04:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.009 second response time [14:05:00] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [14:09:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [14:09:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.008 second response time [14:10:00] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [14:10:36] (03PS4) 10Giuseppe Lavagetto: restbase: spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/224586 (https://phabricator.wikimedia.org/T94831) [14:10:38] (03PS8) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [14:14:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [14:14:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.009 second response time [14:19:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [14:19:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.009 second response time [14:22:02] 6operations, 10Traffic, 7HTTPS: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1451751 (10BBlack) 3NEW [14:24:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [14:24:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.008 second response time [14:26:04] (03CR) 10Filippo Giunchedi: "a few comments, LGTM overall but we ought to have tests for this" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [14:27:09] (03PS2) 10BBlack: HTTPS: redirect POST with 307 [puppet] - 10https://gerrit.wikimedia.org/r/221974 (https://phabricator.wikimedia.org/T105794) [14:28:50] RECOVERY - check_load on mintaka is OK - load average: 0.00, 0.08, 3.65 [14:29:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [14:29:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.013 second response time [14:34:12] !log started RESTBase revision thin-out script for html and data-parsoid on wikimedia domains [14:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [14:34:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.009 second response time [14:35:00] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [14:37:09] (03CR) 10Filippo Giunchedi: Add es-tool upgrade-fast and stopping paranoia (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/224548 (owner: 10Manybubbles) [14:37:47] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1451791 (10GWicke) [14:38:07] (03CR) 10Filippo Giunchedi: [C: 031] Add es-tool upgrade-fast and stopping paranoia [puppet] - 10https://gerrit.wikimedia.org/r/224548 (owner: 10Manybubbles) [14:38:13] (03CR) 10Manybubbles: "I replaced "blow up" with a more extensive comment." [puppet] - 10https://gerrit.wikimedia.org/r/224548 (owner: 10Manybubbles) [14:38:36] (03PS5) 10Manybubbles: Add es-tool upgrade-fast and stopping paranoia [puppet] - 10https://gerrit.wikimedia.org/r/224548 [14:39:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [14:39:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.009 second response time [14:40:00] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [14:41:55] (03CR) 10Filippo Giunchedi: "Agreed with Andrew, if codfw/eqiad already contain other hosts let's move it there even with the same value for consistency" [puppet] - 10https://gerrit.wikimedia.org/r/201880 (owner: 10Dzahn) [14:44:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [14:44:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.012 second response time [14:45:01] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 204 seconds ago with 0 failures [14:49:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [14:49:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.010 second response time [14:52:01] 6operations, 10ops-codfw, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1451819 (10fgiunchedi) machine is replicating the objects to the failed disk [14:54:13] (03CR) 10Mobrovac: [C: 04-1] "First round of comments" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [14:54:32] FWIW I'm not around for SWAT this morning (forgot to remove myself on wikitech) I'm on a bus to the airport for wikimania. [14:54:40] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [14:54:40] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.011 second response time [14:57:08] 6operations, 10ops-eqiad, 10Analytics-Cluster: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1451836 (10Ottomata) Heya Chris, any updates? [14:58:51] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1451838 (10BBlack) [14:59:41] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack) [14:59:44] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [14:59:44] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.011 second response time [15:00:04] manybubbles anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150714T1500). [15:00:34] I'm here :) [15:00:41] who's SWAT'ng? [15:01:02] uhhh - I dunno. I think I might be one of the few around on the list [15:01:08] I'll do it [15:01:42] manybubbles: thanks. I think only one patch from me. [15:01:46] 6operations, 10RESTBase: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1451846 (10mark) As we discussed in the Ops meeting yesterday, please revert all nodes back to stable/maintained JDK7 so we can get a good baseline while things are stable, and can do limited testing with OpenJDK 8 in... [15:01:54] yeah [15:02:02] I could probably do it all week and it not be a big deal [15:02:35] kart__: I've +2ed. when it merges I'll deploy [15:02:46] manybubbles: cool! [15:02:48] 6operations, 10RESTBase, 7Blocked-on-Services: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1451847 (10mark) [15:04:31] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [15:04:31] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.010 second response time [15:05:00] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [15:05:00] Jeff_Green: should we just downtime the whole hosts for a while until they're up? [15:07:11] manybubbles: it seems merged. [15:07:42] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1451857 (10BBlack) This is the normal bnx2x 4x10GbE setup right, the BCM57800? From a software perspective that card just looks like 4x identical broadcom ports, but I've never seen what it ends... [15:08:49] scapeded [15:08:50] !log manybubbles Synchronized php-1.26wmf13/extensions/UniversalLanguageSelector/: SWAT add some hooks to extension.json (duration: 00m 13s) [15:08:54] kart__: ^^ [15:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:08] manybubbles: thanks! [15:09:21] all seems well. I'll leave the log open for a few minutes but looks fine [15:09:28] have a good day [15:09:31] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [15:09:31] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.011 second response time [15:10:00] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [15:13:14] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1451865 (10BBlack) Nevermind, clearly that's not the case. I just dug around in RT and found the order. Apparently these are HPs, with some completely different network card setup than the exi... [15:14:29] manybubbles: thanks! confirmed, it is fine. [15:14:30] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [15:14:31] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.010 second response time [15:15:00] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 185 seconds ago with 0 failures [15:15:01] yup - no weird errors in logs too - so I'm considering this SWAT all finished [15:16:30] 6operations, 10SEO: GWT accounts - https://phabricator.wikimedia.org/T103567#1451883 (10dr0ptp4kt) @WWes, @Stu, @ori: {wmoran,swest,olivneh} AT wikimedia //dot// org have Google Search Console restricted access to the following HTTPS websites. Note manually typed jp.wiki redirects to a ja subdomain, which is... [15:17:08] (03CR) 10Dzahn: "i'm not sure about the rpcbind/rpc.statd processes that are also listening" [puppet] - 10https://gerrit.wikimedia.org/r/224576 (owner: 10Muehlenhoff) [15:19:31] PROBLEM - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) [15:19:31] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.010 second response time [15:21:31] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1451920 (10BBlack) Apparently the HPs have 2x onboard + 2x on PCIe as noted earlier. The PCIe card is HP's own using a QLogic chipset, which apparently in turn is a rebranded Broadcom chip simi... [15:22:01] ACKNOWLEDGEMENT - check_mysql on payments2003 is CRITICAL: Access denied for user nagios@localhost (using password: NO) daniel_zahn codfw - just enabled [15:22:01] ACKNOWLEDGEMENT - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 214 bytes in 0.010 second response time daniel_zahn codfw - just enabled [15:27:14] 6operations, 10ops-eqiad, 10Analytics-Cluster: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1451927 (10Cmjohnson) 3 of 4 are racked in row D and connected to mgmt. They only need re-install. The 4th one was broken and I didn't get to replace it before I left on vacation. Dell has... [15:28:48] (03CR) 10Dzahn: "looks like NFS, but then i don't see an NFS mount there, as opposed to say, ms1001, which legitimately has one" [puppet] - 10https://gerrit.wikimedia.org/r/224576 (owner: 10Muehlenhoff) [15:31:25] (03PS1) 10Filippo Giunchedi: mediawiki: fix jobqueue metric name [puppet] - 10https://gerrit.wikimedia.org/r/224627 [15:31:56] 6operations, 10Wikimedia-IRC, 7Ipv6, 5Patch-For-Review: enable IPv6 on irc.wikimedia.org - https://phabricator.wikimedia.org/T105422#1451938 (10Dzahn) [15:32:00] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] mediawiki: fix jobqueue metric name [puppet] - 10https://gerrit.wikimedia.org/r/224627 (owner: 10Filippo Giunchedi) [15:33:16] (03PS5) 10Giuseppe Lavagetto: restbase: spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/224586 (https://phabricator.wikimedia.org/T94831) [15:33:18] (03PS9) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [15:33:33] <_joe_> godog: ^^ this should've addressed your comments [15:33:54] (03CR) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [15:34:19] 6operations: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1451943 (10Dzahn) 3NEW [15:35:00] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [15:35:15] (03CR) 10Muehlenhoff: "nfs::netapp::common only activates an NFS mount on helium, this is probably from earlier work (the comment mentions a temporary workaround" [puppet] - 10https://gerrit.wikimedia.org/r/224576 (owner: 10Muehlenhoff) [15:36:44] _joe_: thanks, I'll look shortly [15:39:40] (03CR) 10Dzahn: "yes, needs https://phabricator.wikimedia.org/T105804" [puppet] - 10https://gerrit.wikimedia.org/r/223887 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [15:40:00] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [15:45:00] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 198 seconds ago with 0 failures [15:47:21] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 805.522671044 [15:50:56] (03PS1) 10Ori.livneh: Expect l10n_cache-en.php, not l10n_cache-en.cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/224629 [15:51:21] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2524.33531273 [15:52:29] bd808: ^ [15:54:14] (03CR) 10BryanDavis: [C: 031] "We need to make beta cluster use this" (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/224629 (owner: 10Ori.livneh) [15:58:26] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, however it is critical enough that we ought to have tests" [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [16:00:28] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1452007 (10BBlack) Added a few CCs that seem like authors of bots in the UA logs... [16:05:01] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [16:06:37] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1452021 (10Cyberpower678) My framework's at the top of the list. :p [16:08:33] Warning: mwstore://local-swift-eqiad/local-thumb/6/61/AudiRS6Motor.jpg/120px-AudiRS6Motor.jpg was not stored with SHA-1 metadata. in /srv/mediawiki/php-1.26wmf13/includes/filebackend/SwiftFileBackend.php on line 670 [16:08:45] lots of messages like that in the hhvm log stream [16:09:02] Easily findable at https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm with a search for "swift" [16:10:00] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [16:10:28] 6operations, 7user-notice: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1452027 (10Dzahn) [16:10:47] looks like it is benign. Just an "oops I need to fix up this object" warning [16:10:53] gwicke: https://wikitech.wikimedia.org/wiki/Deployments?veaction=edit just hangs :/ [16:11:15] that page an VE don't get along [16:11:31] if you do get it to load you really won't be able to edit it properly [16:11:40] too much template magic :/ [16:11:59] I *think* this is actually still going through PHP and not RESTBase [16:12:31] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1452037 (10BBlack) [16:12:39] what is the wikitech db name again? [16:12:46] labsdb [16:12:55] labswiki I mean [16:13:02] http://parsoid-lb.eqiad.wikimedia.org/labswiki/Deployments loads okay [16:13:26] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack) [16:14:09] Uncaught TypeError: Cannot read property '1' of null [16:14:30] I wonder if it's a client-side issue [16:15:00] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 137 seconds ago with 0 failures [16:16:02] <_joe_> I didn't think VE was working on wikitech [16:16:16] https://wikitech.wikimedia.org/wiki/Cassandra?veaction=edit doesn't show the same JS error in the browser error console [16:16:32] and loads okay [16:17:54] the JS error happens at https://wikitech.wikimedia.org/wiki/Deployments too [16:18:15] so possibly an unrelated JS error that then breaks VE [16:19:23] yeah, it's in some gadget-y code related to the deployment calendar [16:19:27] greg-g: ^^ [16:20:38] .siblings('.deploycal-time-sf').children().attr('datetime'));sfTz=(sfTz[1]*60+sfTz[2]*1)/60; [16:22:04] from https://wikitech.wikimedia.org/wiki/MediaWiki:Common.js [16:22:19] Krinkle_: you around? [16:23:49] !log bromine - apt-get upgrade [16:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:01] if anybody has edit rights for https://wikitech.wikimedia.org/wiki/MediaWiki:Common.js, the sfTz match might be null [16:27:19] (03PS4) 10Dzahn: add admin group 'wikidata query service deployers' [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) [16:28:34] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 3 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1452062 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/223984/ [16:29:33] 6operations, 7user-notice: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1452063 (10Dzahn) https://meta.wikimedia.org/w/index.php?title=Tech/News/2015/30&diff=prev&oldid=12703397 tagged the ticket "user-notice" [16:31:39] (03CR) 10Smalyshev: [C: 031] add admin group 'wikidata query service deployers' [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) (owner: 10Dzahn) [16:33:16] (03PS2) 10Dzahn: Add ferm rules for statistics-web [puppet] - 10https://gerrit.wikimedia.org/r/224587 (owner: 10Muehlenhoff) [16:33:30] gwicke: what's the issue? the calendar loads ok for me, but will it break if I edit it? [16:35:40] (03CR) 10Dzahn: [C: 04-1] "i think we can also close 443 - i don't think that Apache should be listening on it anymore - all services have been moved to behind misc-" [puppet] - 10https://gerrit.wikimedia.org/r/224587 (owner: 10Muehlenhoff) [16:36:34] (03CR) 10Dzahn: "the task would be to cleanup stat1001 though:" [puppet] - 10https://gerrit.wikimedia.org/r/224587 (owner: 10Muehlenhoff) [16:39:34] greg-g: the JS code throws an exception (see browser console), which breaks VE [16:39:56] https://wikitech.wikimedia.org/wiki/MediaWiki_talk:Common.js [16:41:06] this is in Chrome, but I doubt that makes a difference [16:41:21] (03PS1) 10BBlack: Run update-ca-certificates on CA removal as well [puppet] - 10https://gerrit.wikimedia.org/r/224639 [16:42:57] (03PS1) 10Dzahn: statistics: don't let Apache listen on 443 anymore [puppet] - 10https://gerrit.wikimedia.org/r/224642 [16:43:03] gwicke: /me nods [16:44:08] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/224642/1" [puppet] - 10https://gerrit.wikimedia.org/r/224587 (owner: 10Muehlenhoff) [16:45:12] (03CR) 10Dzahn: [C: 031] "if it also gets a +1 from Giuseppe i would say can be merged anytime, the group is not applied on anything yet. it will be once we install" [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) (owner: 10Dzahn) [16:47:19] (03CR) 10BBlack: [C: 032] Run update-ca-certificates on CA removal as well [puppet] - 10https://gerrit.wikimedia.org/r/224639 (owner: 10BBlack) [16:47:51] (03PS3) 10Dzahn: Add ferm rules for statistics-web [puppet] - 10https://gerrit.wikimedia.org/r/224587 (owner: 10Muehlenhoff) [16:50:16] (03PS4) 10Dzahn: Add ferm rules for statistics-web [puppet] - 10https://gerrit.wikimedia.org/r/224587 (owner: 10Muehlenhoff) [16:50:40] (03CR) 10Dzahn: [C: 031] "moved from module to role class and removed 443" [puppet] - 10https://gerrit.wikimedia.org/r/224587 (owner: 10Muehlenhoff) [16:51:27] (03PS5) 10Dzahn: Add ferm rules for statistics-web [puppet] - 10https://gerrit.wikimedia.org/r/224587 (owner: 10Muehlenhoff) [16:52:05] (03CR) 10Dzahn: [C: 032] "is needed either way and will be noop until I09f134915999bf8145" [puppet] - 10https://gerrit.wikimedia.org/r/224587 (owner: 10Muehlenhoff) [16:52:43] (03CR) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [16:54:12] (03PS10) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [16:54:25] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1452104 (10RobH) So the symlink in /etc/ssl/RapidSSL_CA.pem was leftover, as update-ca-certificates didn't fire off (via config/script) after file removal. Brandon fixed the config to fire an update: https:/... [16:54:32] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1452105 (10Rillke) Thanks, I updated the defaults of the nodejs library yesterday. Maybe still forcing http somewhere; is this list from today? I thought the connection between WMFLabs and... [16:54:35] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1452106 (10RobH) 5stalled>3Resolved [16:54:52] (03CR) 10jenkins-bot: [V: 04-1] service::node: auto-monitoring of local endpoints [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [16:57:55] 6operations, 10RESTBase, 10RESTBase-Cassandra: Set up multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1452113 (10GWicke) I have done some testing with ipv6 docker containers on hosts with a /64 subnet assigned to them: https://github.com/wikimedia/ansible-deploy/tree... [16:58:35] 6operations, 10RESTBase, 10RESTBase-Cassandra: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1452125 (10GWicke) [17:02:26] (03CR) 10Dzahn: [C: 032] "tested:" [puppet] - 10https://gerrit.wikimedia.org/r/222205 (https://phabricator.wikimedia.org/T101799) (owner: 10Filippo Giunchedi) [17:02:37] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1452132 (10Rillke) > Redirecting POST traffic doesn't actually secure it if the clients don't remember to keep using HTTPS afterwards anyways Agreed. At least my bot would happily continue e... [17:02:44] (03PS4) 10Dzahn: monitoring: replace illegal chars in description [puppet] - 10https://gerrit.wikimedia.org/r/222205 (https://phabricator.wikimedia.org/T101799) (owner: 10Filippo Giunchedi) [17:03:55] 6operations, 10RESTBase, 10RESTBase-Cassandra: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1452134 (10GWicke) [17:10:38] !log es1.6 step 10: upgrade elastic1009 [17:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:13:49] 6operations, 7Icinga, 5Patch-For-Review: monitoring::service must fail on invalid service descriptions - https://phabricator.wikimedia.org/T101799#1452159 (10Dzahn) tested the regex / puppet code with some random description , merged and babysat on neon. i think we can close it then [17:17:54] 6operations, 7Icinga, 5Patch-For-Review: monitoring::service must fail on invalid service descriptions - https://phabricator.wikimedia.org/T101799#1452165 (10Dzahn) a:3fgiunchedi [17:18:13] 6operations, 7Icinga, 5Patch-For-Review: monitoring::service must fail on invalid service descriptions - https://phabricator.wikimedia.org/T101799#1452166 (10Dzahn) [17:18:25] 6operations, 7Icinga, 5Patch-For-Review: monitoring::service must fail on invalid service descriptions - https://phabricator.wikimedia.org/T101799#1452167 (10fgiunchedi) 5Open>3Resolved yes we can! [17:19:18] 7Puppet, 6operations, 10Continuous-Integration-Infrastructure, 6Labs: Error "Duplicate declaration: File[/etc/ssh/userkeys] is already declared in file /private/modules/passwords/manifests/init.pp:36; cannot redeclare at /modules/ssh/manifests/server.pp:31" - https://phabricator.wikimedia.org/T92752#1452173... [17:19:22] 7Puppet, 6Labs, 3Labs-Sprint-106: puppetmaster::gitsync should update labs/private repository as well - https://phabricator.wikimedia.org/T92756#1452170 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Resolved, me thinks [17:25:26] (03PS6) 10Smalyshev: Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 [17:26:07] 7Blocked-on-Operations, 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1452192 (10GWicke) Debian packages at https://deb.nodesource.com/iojs_2.x/pool/main/i/iojs/ [17:26:08] (03CR) 10jenkins-bot: [V: 04-1] Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 (owner: 10Smalyshev) [17:26:19] (03CR) 10Dzahn: "see ticket - it says this is a good use case for a VM - instead" [puppet] - 10https://gerrit.wikimedia.org/r/224554 (https://phabricator.wikimedia.org/T105008) (owner: 10Dzahn) [17:26:24] (03Abandoned) 10Dzahn: move grafana from zirconium to netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/224554 (https://phabricator.wikimedia.org/T105008) (owner: 10Dzahn) [17:27:09] (03PS7) 10Smalyshev: Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 [17:39:41] 7Blocked-on-Operations, 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1452206 (10GWicke) On IRC, @MoritzMuehlenhoff proposed to wait until LTS comes out in October. We could do that, but it would be nice to start gradually migrating to modern V8 sooner... [17:41:35] !log terbium: /usr/local/bin/foreachwiki extensions/Echo/maintenance/processEchoEmailBatch.php [17:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:42:25] (03PS7) 10Dzahn: Remove dependency on echowikis.dblist [puppet] - 10https://gerrit.wikimedia.org/r/139581 (https://phabricator.wikimedia.org/T59375) (owner: 10Withoutaname) [17:45:18] (03PS8) 10Dzahn: Remove dependency on echowikis.dblist [puppet] - 10https://gerrit.wikimedia.org/r/139581 (https://phabricator.wikimedia.org/T59375) (owner: 10Withoutaname) [17:45:37] (03CR) 10Dzahn: [C: 032] "confirmed and ran it on terbium - rebased" [puppet] - 10https://gerrit.wikimedia.org/r/139581 (https://phabricator.wikimedia.org/T59375) (owner: 10Withoutaname) [17:48:49] (03CR) 10Dzahn: "hey Reedy, should we leave this untouched until we have general response about wiki renaming and external storage?" [puppet] - 10https://gerrit.wikimedia.org/r/169944 (https://bugzilla.wikimedia.org/39482) (owner: 10Reedy) [17:50:10] 6operations, 10Wikimedia-Site-requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1452225 (10Dzahn) What about the pending Apache config change to redirect it? Is it moot until we figured out the issues with ES? https://gerrit.wikimedia.org/r/#/c/1699... [17:56:24] (03CR) 10Dzahn: [C: 04-1] "wouldn't this set the puppetmaster to "labs-puppetmaster-eqiad.wikimedia.org" for all production hosts too?" [puppet] - 10https://gerrit.wikimedia.org/r/224465 (owner: 10Andrew Bogott) [17:56:43] PROBLEM - Apache HTTP on mw1160 is CRITICAL - Socket timeout after 10 seconds [17:57:13] PROBLEM - Apache HTTP on mw1157 is CRITICAL - Socket timeout after 10 seconds [17:57:17] (03CR) 10Dzahn: "..since i see it being set to that in eqiad.yaml and the default is removed in base/manifests/init.pp" [puppet] - 10https://gerrit.wikimedia.org/r/224465 (owner: 10Andrew Bogott) [17:58:34] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.240 second response time [17:58:48] ^ that was apparently because it is doing this right now: [17:58:53] Notice: /Stage[main]/Mediawiki::Multimedia/File[/tmp/gs_TymQXv]/ensure: removed [17:59:03] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.323 second response time [17:59:03] Mediawiki::Multimedia/Tidy[/tmp]: Tidying File ... [18:00:05] twentyafterfour greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150714T1800). Please do the needful. [18:00:27] (03CR) 10Andrew Bogott: "You are totally right. I will rethink this :)" [puppet] - 10https://gerrit.wikimedia.org/r/224465 (owner: 10Andrew Bogott) [18:14:23] PROBLEM - Apache HTTP on mw1157 is CRITICAL - Socket timeout after 10 seconds [18:16:05] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.360 second response time [18:26:52] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1452294 (10BBlack) >>! In T105794#1452105, @Rillke wrote: > Thanks, I updated the defaults of the nodejs library yesterday. Maybe still forcing http somewhere; is this list from today? Yeah... [18:28:01] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1452296 (10RobH) It appears that we are now in the steps of relocating two of the three systems into different racks, correct? We'll have @cmjohnson then remove the larger capacity disks and relocate two... [18:28:40] mutante: I accidentally reverted your edit on meta while trying to thank you for it; don't be surprised by the echo notification :) [18:29:00] guillom: heh, ok:) [18:29:24] (03PS1) 10BBlack: varnish: default dynamic_directors true (changes eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/224649 (https://phabricator.wikimedia.org/T97029) [18:37:41] thcipriani: Do you know when the MW-core wmf14 cut will happen? RoanKattouw and I are waiting for it before merging something, and our plane leaves quite soon… ;-) [18:38:44] PROBLEM - puppet last run on db2012 is CRITICAL puppet fail [18:39:12] James_F: I'm waiting for a plane too. twentyafterfour would probably know. [18:39:21] thcipriani: Ha, OK. :-) [18:40:27] lol [18:42:30] plot twist: Reedy is the pilot of that plane [18:42:45] Which one? [18:42:53] All of them? Simultaneously? [18:42:53] the one thcipriani is waiting for [18:43:07] oh, heh [18:43:22] Reedy: an entirely new question.. what about wiki renaming? hahah [18:43:41] Did the DBAs ever sort out ES? [18:43:55] hehe, i think no [18:44:13] (03PS1) 10Manybubbles: Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 [18:44:18] i just saw your pending Apache change again [18:44:24] and noticed the task is from 2012 [18:44:28] lol [18:44:33] I think we have some older ones [18:44:53] (03CR) 10Manybubbles: [C: 04-1] "No mergies until I5aca0d6a0a72ff674603d00fb6b9519b1d73a0eb is deployed." [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [18:45:45] mutante: Well there are "only" 12 WMF staff on my flight... [18:46:15] Reedy: so i'm still at my comment from 2013 "talked to Asher, it is possible but not easy" [18:46:35] make the wiki readonly [18:46:36] dump the db [18:46:38] reimport [18:46:40] ??? [18:46:41] profit [18:47:00] RoanKattouw: ooh ohh, i thought there is all this organizing to avoid that and that's why we cant just book ourselves [18:47:13] Ther eis [18:47:16] I think 12 is the maximum [18:47:18] Are they all "important" staff? [18:47:19] * Reedy coughs [18:47:40] 12 out of 555 on an A380 isn't many [18:47:42] :D [18:47:48] Nope, I'm on it for instance. [18:47:54] Sadly it's a 738. [18:47:55] PROBLEM - puppet last run on mw2080 is CRITICAL puppet fail [18:48:26] heh [18:48:49] whoever makes wikimania videos this year.. let's wait for the ones from next year first [18:48:54] eh, last [18:49:15] mutante: are they still not done? [18:49:20] lol. [18:49:22] Np [18:49:23] No [18:49:39] Many are too big to go into swift [18:49:55] reduce quality until it fits:) [18:49:57] Reedy: make swift bigger? :p [18:50:27] mutante: they'll probably end up as 1 pixel videos then :) [18:50:52] https://phabricator.wikimedia.org/T84459 [18:51:50] You do not have permission to view this object. - :( [18:52:34] SPF|Cloud: you never do, you're used to it now [18:52:47] "most files failed with i/o errors, broken source " [18:52:53] I see that as an insult :p [18:54:23] Bah [18:55:43] if i'd open it up i'd have comments about it being ok or not ok that there are full email footers in there with street address / phone numbers [18:55:47] but there is no other reason [18:56:49] Okay, no problem anyway, gotta deal with it it's nda-only so [18:58:09] it's an imported RT ticket, i'd like to open up more but every single one is a new case and the email footers are often the only reason to hesitate [18:58:27] bugs me too [19:02:14] (03PS8) 10Smalyshev: Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 [19:02:54] (03CR) 10jenkins-bot: [V: 04-1] Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 (owner: 10Smalyshev) [19:03:31] (03CR) 10Dzahn: "still blocked on ES - i'd like to remove myself and get back to this once that blocker is done - since the linked ticket is 3 years old an" [puppet] - 10https://gerrit.wikimedia.org/r/169944 (https://bugzilla.wikimedia.org/39482) (owner: 10Reedy) [19:03:53] (03PS9) 10Smalyshev: Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 [19:05:14] RECOVERY - puppet last run on db2012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:05:22] (03CR) 10Dzahn: "bd808 - are you the scap man who wants to accept this? (to the tune of https://www.youtube.com/watch?v=y6oXW_YiV6g)" [puppet] - 10https://gerrit.wikimedia.org/r/214037 (owner: 10Alex Monk) [19:06:34] James_F: branch cut just happened, I'm about to commit it [19:06:42] twentyafterfour: Awesome, thanks. [19:07:38] James_F: all pushed [19:07:53] (03CR) 10Dzahn: [C: 031] Enable packet filter for heze [puppet] - 10https://gerrit.wikimedia.org/r/224576 (owner: 10Muehlenhoff) [19:07:53] twentyafterfour: Thanks. [19:12:27] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1452350 (10TheDJ) Left a comment at: https://github.com/mwclient/mwclient/issues/70#issuecomment-121345327 [19:16:33] RECOVERY - puppet last run on mw2080 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:23:21] !log es1.6 step iforget: upgrade elasticsearch on elastic1010 [19:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:26:50] What am I going to miss out on by disabling flash in gerrit? [19:28:33] ah, the answer is: those little ‘copy link to pasteboard’ graphics [19:35:43] RECOVERY - check_mysql on payments2001 is OK: Uptime: 3760 Threads: 3 Questions: 4672 Slow queries: 3 Opens: 50 Flush tables: 1 Open tables: 43 Queries per second avg: 1.242 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [19:35:47] i pull those with git review anyhow [19:35:51] so sounds like you miss nothing. [19:38:25] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1452388 (10Pcoombe) I'm fine with dropping the www.* subdomains if they're going to cause extra work/cost. Couldn't find anywhere that... [19:40:27] (03PS3) 10Andrew Bogott: Use hiera for the puppetmaster name, everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/224465 [19:40:29] (03PS8) 10Andrew Bogott: Split labs-specific bits of base into labs::base [puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [19:40:31] (03PS1) 10Andrew Bogott: Purge labs_puppet_master_secondary. [puppet] - 10https://gerrit.wikimedia.org/r/224660 [19:40:46] mutante: I don’t trust myself anymore, but is that patchset less breaky? [19:44:16] oops, one small mistake [19:44:28] (03PS4) 10Andrew Bogott: Use hiera for the puppetmaster name, everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/224465 [19:44:30] (03PS9) 10Andrew Bogott: Split labs-specific bits of base into labs::base [puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [19:44:32] (03PS2) 10Andrew Bogott: Purge labs_puppet_master_secondary. [puppet] - 10https://gerrit.wikimedia.org/r/224660 [19:44:38] (03PS1) 1020after4: 1.26wmf14 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224663 [19:46:46] !log twentyafterfour Started scap: testwiki to 1.26wmf14 and rebuild localization cache [19:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:47:13] (03CR) 1020after4: [C: 032] 1.26wmf14 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224663 (owner: 1020after4) [19:47:24] (03Merged) 10jenkins-bot: 1.26wmf14 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224663 (owner: 1020after4) [19:54:20] 6operations: detail now many XFP/SFP+ tranceivers are needed per peering site - https://phabricator.wikimedia.org/T105827#1452429 (10RobH) 3NEW a:3faidon [19:55:23] RECOVERY - check_mysql on payments2002 is OK: Uptime: 4715 Threads: 1 Questions: 2319 Slow queries: 0 Opens: 36 Flush tables: 1 Open tables: 29 Queries per second avg: 0.491 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [19:55:28] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1452436 (10RobH) @cmjohnson: Done, bring them with you. I ordered the other fibers needed and papaul will include them in the shipment. [20:04:34] RECOVERY - check_mysql on payments2003 is OK: Uptime: 4427 Threads: 1 Questions: 2359 Slow queries: 0 Opens: 36 Flush tables: 1 Open tables: 29 Queries per second avg: 0.532 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [20:28:59] 6operations, 7Graphite, 7Monitoring: deprecate gdash - https://phabricator.wikimedia.org/T104365#1452514 (10faidon) I also regularly use reqsum (neither these nor reqerror aren't "Varnish" stats, by the way; there is a long pipeline to log these and right now it does originate in Varnish but there's no real... [20:31:19] (03CR) 10Giuseppe Lavagetto: [C: 031] varnish: default dynamic_directors true (changes eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/224649 (https://phabricator.wikimedia.org/T97029) (owner: 10BBlack) [20:31:32] <_joe|AFK> bblack: served! [20:34:54] PROBLEM - puppet last run on mw2132 is CRITICAL Puppet has 1 failures [20:35:04] PROBLEM - Apache HTTP on mw1158 is CRITICAL - Socket timeout after 10 seconds [20:35:08] (03CR) 10BBlack: [C: 032] varnish: default dynamic_directors true (changes eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/224649 (https://phabricator.wikimedia.org/T97029) (owner: 10BBlack) [20:36:54] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.360 second response time [20:40:13] PROBLEM - HHVM rendering on mw1146 is CRITICAL - Socket timeout after 10 seconds [20:40:44] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [20:41:54] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 65504 bytes in 0.905 second response time [20:42:54] !log undoing LCStoreStaticArray because appservers look unhealthy, using ori's command: 'salt -G deployment_target:scap/scap cmd.run "rm /etc/lcstore"' [20:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:43:13] PROBLEM - HHVM rendering on mw1090 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [20:44:14] PROBLEM - Apache HTTP on mw1090 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.017 second response time [20:45:37] (03PS1) 10BBlack: Revert "Use LCStoreStaticArray unconditionally" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224720 [20:45:44] (03PS2) 10BBlack: Revert "Use LCStoreStaticArray unconditionally" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224720 [20:46:10] (03CR) 10BBlack: [C: 032] Revert "Use LCStoreStaticArray unconditionally" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224720 (owner: 10BBlack) [20:46:43] PROBLEM - HHVM processes on mw1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:47:14] PROBLEM - Apache HTTP on mw1124 is CRITICAL - Socket timeout after 10 seconds [20:48:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 0 below the confidence bounds [20:48:44] twentyafterfour: ping? [20:48:45] twentyaf pts/5 2620:0:861:2:7a2 17:56 50:31 10.53s 9.92s python /usr/local/bin/scap testwiki to 1.26wmf14 and rebuild localization cache [20:49:04] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.262 second response time [20:49:34] PROBLEM - Apache HTTP on mw1191 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.166 second response time [20:50:23] RECOVERY - check_payments_wiki on payments2001 is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.009 second response time [20:51:24] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [20:53:06] scap is at 80% [20:53:52] so LCStoreStaticArray is slow? [20:53:53] RECOVERY - check_ipn_redir on mintaka is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.012 second response time [20:53:55] PROBLEM - HHVM rendering on mw1148 is CRITICAL - Socket timeout after 10 seconds [20:54:33] RECOVERY - check_payments_wiki on payments2003 is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.010 second response time [20:54:37] twentyafterfour: it's OOMing -> killing appservers [20:54:54] PROBLEM - HHVM rendering on mw1140 is CRITICAL - Socket timeout after 10 seconds [20:55:17] lovely [20:55:19] <_joe|AFK> bblack: it's OOMing because at every release [20:55:23] RECOVERY - check_payments_wiki on payments2002 is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.010 second response time [20:55:25] <_joe|AFK> we load a ton new static strings [20:55:32] <_joe|AFK> without freeing the old ones [20:55:37] <_joe|AFK> that's what happened [20:55:44] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 65505 bytes in 1.504 second response time [20:55:45] PROBLEM - HHVM rendering on mw1119 is CRITICAL - Socket timeout after 10 seconds [20:55:52] <_joe|AFK> looking at the memory graph [20:55:58] <_joe|AFK> so scap is killing the appservers [20:56:01] I really think it's stupid that we branch a whole new branch every week instead of just merging in changes [20:56:07] <_joe|AFK> how nice is that? [20:56:29] my scap is just syncing the new branch without activating it other than testwiki [20:56:41] and scap is done [20:56:44] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 65504 bytes in 1.282 second response time [20:56:56] (well... 3 left)... [20:57:13] PROBLEM - HHVM rendering on mw1116 is CRITICAL - Socket timeout after 10 seconds [20:57:33] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 200 OK [20:57:33] PROBLEM - check_listener_ipn on thulium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 200 OK [20:57:35] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 65504 bytes in 1.340 second response time [20:58:03] one of the syncs failed for a read-only filesystem [20:58:15] mw1216 [20:59:04] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 65504 bytes in 0.134 second response time [20:59:04] 6operations, 10ops-codfw: payments2002 looks like it has a failed disk - https://phabricator.wikimedia.org/T105833#1452588 (10Jgreen) 3NEW [20:59:32] !log twentyafterfour Finished scap: testwiki to 1.26wmf14 and rebuild localization cache (duration: 72m 45s) [20:59:34] mw1090 [20:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:59:43] ACKNOWLEDGEMENT - check_raid on payments2002 is CRITICAL HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] Jeff_Green ticket submitted T105833 [20:59:57] mw1090 has a readonly filesystem? [21:00:06] twentyafterfour: does that mean I can fetch and sync-file now? [21:01:05] <_joe|AFK> twentyafterfour: what tells you that? [21:01:52] !log twentyafterfour Synchronized wmf-config/CommonSettings.php: revert LCStoreStaticArray (duration: 00m 12s) [21:01:53] RECOVERY - puppet last run on mw2132 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:02:00] oh ok [21:02:03] twentyafterfour: thanks! [21:02:24] I got the readonly filesystem errors from scap [21:02:33] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 200 OK [21:02:33] PROBLEM - check_listener_ipn on thulium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 200 OK [21:03:02] 21:01:42 sync-common failed: Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/.svn/lock', '--exclude=**/.git/objects', '--exclude=**/.git/**/objects', '--exclude=**/cache/l10n/*.cdb', '--no-perms', '--include=/wmf-config', '--include=/wmf-config/CommonSettings.php', [21:03:04] '--exclude=*', 'mw1216.eqiad.wmnet::common', '/srv/mediawiki']' returned non-zero exit status 12 [21:03:39] <_joe|AFK> twentyafterfour: so mw1216 returned an error? [21:04:32] _joe|AFK: well, also this: 21:01:42 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/CommonSettings.php', 'mw1010.eqiad.wmnet', 'mw1033.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1097.eqiad.wmnet', 'mw1216.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet', 'mw2001.codfw.wmnet', 'mw2041.codfw.wmnet', 'mw2080.codfw.wmnet', [21:04:34] 'mw2119.codfw.wmnet', 'mw2187.codfw.wmnet'] on mw1090.eqiad.wmnet returned [70]: 21:01:42 Copying to mw1090.eqiad.wmnet from mw1216.eqiad.wmnet [21:04:48] so I think it' [21:04:56] <_joe|AFK> twentyafterfour: you're right about mw1090 [21:05:08] yeah 1216 was just the proxy node [21:05:11] <_joe|AFK> depooling it [21:05:16] maybe it's mounted with errors=remount-ro and had some hwfail? [21:05:22] <_joe|AFK> bblack: yes [21:05:33] RECOVERY - check_puppetrun on saiph is OK Puppet is currently enabled, last run 206 seconds ago with 0 failures [21:05:48] <_joe|AFK> !log depooling mw1090, ext4 errors in syslog, filesystem mounted read-only [21:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:19] is the OOM condition fixed? I don't see a bunch of recovery notifications from icinga... [21:07:03] RECOVERY - check_listener_gc on thulium is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 0.010 second response time [21:07:03] RECOVERY - check_listener_ipn on thulium is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.057 second response time [21:07:51] <_joe|AFK> twentyafterfour: it apparently is [21:08:02] <_joe|AFK> twentyafterfour: we still need a rolling restart probably [21:08:12] <_joe|AFK> but I'll leave that to bblack honestly [21:08:15] <_joe|AFK> too tired [21:08:36] <_joe|AFK> and now I have to open a ticket [21:09:14] bblack: did you rm /etc/lcstore? [21:09:27] twentyafterfour: yes [21:10:16] _joe|AFK: advise on parallelism/delay for rolling restart? [21:11:08] yeah I haven't done a scap hhvm restart before, `scap --restart` is all I need to do? [21:11:26] 6operations, 10ops-eqiad: mw1090 has a read-only filesystem - https://phabricator.wikimedia.org/T105835#1452614 (10Joe) 3NEW [21:11:40] <_joe|AFK> twentyafterfour: don't [21:11:56] what's too aggressive? [21:12:20] 1-by-1, 3s sleep after? 1-by-1, no sleep? [21:12:37] <_joe|AFK> it is [21:12:51] <_joe|AFK> do 5 at a time, 30s between them :) [21:12:54] ok [21:13:14] <_joe|AFK> I know it's a long time, but I'd wait before restarting [21:13:29] <_joe|AFK> nah, it's needed [21:13:43] RECOVERY - check_listener_gc on saiph is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.011 second response time [21:13:43] RECOVERY - check_listener_ipn on saiph is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.012 second response time [21:13:43] salt -t 60 -v -b 5 -G deployment_target:scap/scap cmd.run "service hhvm restart; sleep 30" [21:13:44] <_joe|AFK> https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=mem_report&s=by+name&c=API%2520application%2520servers%2520eqiad&tab=m&vn=&hide-hf=false [21:13:46] ^ ok? [21:14:00] <_joe|AFK> bblack: seems ok [21:14:06] <_joe|AFK> bblack: nope [21:14:23] <_joe|AFK> you actually need to target appservers and api appservers [21:14:31] (target is from ori's lcstore rm command) [21:14:47] <_joe|AFK> so -G cluster:appserver first, then -G cluster:appserver_api IIRC [21:14:55] <_joe|AFK> and then jobrunner [21:14:58] can I do both together with -C? [21:15:02] or all 3? [21:15:03] <_joe|AFK> because they're the ones running hhvm [21:15:05] <_joe|AFK> yes [21:15:08] ok [21:15:08] <_joe|AFK> as you prefer [21:15:34] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [21:16:20] <_joe|AFK> half of the imagescalers too, but they seem better off [21:16:49] <_joe|AFK> https://gdash.wikimedia.org/dashboards/reqerror/ tells you we'd need an incident report, I guess [21:17:12] <_joe|AFK> sigh, the outages keep happening after my supposed bedtime [21:17:40] the servers have learned your schedule [21:18:01] thank ori ;) [21:18:41] I actually am not sure why this happened since I was just syncing the branch and not pointing any load to it. hhvm bug? [21:18:55] hhvm "feature" :) [21:19:09] <_joe|AFK> twentyafterfour: no the fact is that patch loads the cdb data in memory as static arrays [21:19:18] <_joe|AFK> they won't ever be removed by design [21:19:37] <_joe|AFK> so I guess any update will just add more static strings [21:19:44] <_joe|AFK> until the server dies [21:19:57] <_joe|AFK> so, we did actively shoot ourselves in the foot here [21:20:20] hmm, so they never get garbage collected in any way? [21:20:28] <_joe|AFK> can't blame HHVM for not removing statically allocated memory, right [21:20:50] <_joe|AFK> twentyafterfour: that was my understanding from the start, yes [21:21:18] <_joe|AFK> but I might have lost some detail, I didn't really look into the implementation [21:21:23] yeah [21:21:29] I'll look into it further [21:22:25] hi ops, sutff's broken [21:22:28] https://test.wikipedia.org/wiki/ [21:22:31] [b12dbfc5] /wiki/ MWException from line 469 of /srv/mediawiki/php-1.26wmf14/includes/cache/LocalisationCache.php: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php. [21:24:07] yup [21:24:18] we're restarting hhvm in prod [21:24:38] which backend is testwiki again? [21:24:49] is it labs or one of the mw*'s? [21:25:05] <_joe|AFK> mw1017 [21:25:27] <_joe|AFK> bblack: it's an exception [21:25:32] <_joe|AFK> and it's bad [21:25:49] <_joe|AFK> lemme see what's happening on pybal for the normal servers [21:26:04] so, disable lcstore -> break all the things? [21:26:35] testwiki was the only one I pointed to wmf14, should I roll that back to wmf13? [21:26:52] maybe [21:27:16] <_joe|AFK> I think it's just testwiki AFAICS [21:27:37] yeah [21:27:44] did the lcstore stuff also make it so that new localization pushes only went to the static array stuff and not to the existing stuff? [21:27:59] hence enable lcstore -> update l10n -> disable lcstore -> no l10n? [21:28:02] the localization cache for wmf14 was built in the LCStoreStaticArray format [21:28:21] bblack: yeah [21:28:25] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Fix testwiki [21:28:27] <_joe|AFK> twentyafterfour: this means we can't release it? [21:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:50] <_joe|AFK> Reedy: I doubt that will help [21:28:59] I'm pretty sure it did [21:29:01] Look [21:29:02] It works [21:29:06] I only typed sync-wikiversions [21:29:17] <_joe|AFK> oh so we just wiped out those files? [21:29:18] <_joe|AFK> wtf? [21:29:23] This is a test of release of MediaWiki 1.26wmf13 (ab8da29) [21:29:33] it works, but it's wmf13 [21:29:46] wmf14 has the new lccache format [21:29:46] the stack trace was complaining about wmf14. staging on tin had wmf13 for testwiki [21:29:46] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1452675 (10GWicke) [21:29:58] !log mw1090 fs is ro [21:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:30:30] so yeah, maybe we avoid wmf14 till we get to the bottom of this pile [21:30:33] Reedy: _joe already filed a ticket for mw1090 [21:30:34] <_joe|AFK> Reedy: I already depooled it and opened a ticket [21:30:44] ok [21:30:52] bblack: yeah I wasn't planning to push it out any further [21:31:00] wmf14 wikis should still just be using the same code and config [21:31:11] Unless something is borked in the chooser code [21:31:17] I think I can rebuild the localization cache now that we reverted mwconfig but I'm not sure I gotta look at the changes to scap and see how that works [21:31:48] if ( file_exists( '/etc/lcstore' ) && file_exists( "$IP/cache/l10n/zu.l10n.php" ) ) { [21:31:48] $wgLocalisationCacheConf['storeClass'] = 'LCStoreStaticArray'; [21:31:48] } [21:31:54] Reedy: there were corresponding scap changes [21:32:12] scap didn't generate the old format cache for wmf14 [21:32:16] afaik [21:32:45] Yup [21:32:48] Looks to be the case [21:32:56] <_joe|AFK> that was quick. Congratulations everyoen. [21:32:59] * twentyafterfour goes to look at the scap diffs [21:32:59] wmf14 only has php stuff [21:33:17] * greg-g smiles at Reedy [21:33:18] But that if shoudl work for said version... As long as they were copied out [21:33:19] _joe: go get some rest? :) [21:33:37] Which, presumably they were, as it wouldn't be overriding the config otherwise [21:33:44] reedy: the php stuff is horribly broken [21:33:45] is testwiki just on one box now? [21:33:53] now/still [21:33:57] read the backlog [21:34:24] "did you test this before you committed it?" [21:34:25] :) [21:34:56] no but bd808 predicted it [21:35:08] heh [21:35:10] talk to ori [21:35:28] he's the one who thought it a good idea to +2 this and then jump on a plane [21:35:45] "wtf, why not, I'll be in mexico!" [21:36:00] Do you have the scap diff to hand? [21:36:16] oh, looking at the wrong repo [21:39:08] yeah scap changes seem minor [21:40:22] I guess the wmf-config changes probably trigger different behavior from l10nupdate [21:40:57] It looks vaguely correct [21:41:11] Oh [21:41:12] I wonder [21:41:33] If there is the PHP cache for said version of the config... [21:41:49] "php" (read hhvm or wharever) [21:41:57] And therefore it doesn't know that it's changed for said wiki [21:42:29] APC was known to cause weird shit like that [21:42:40] maybe just push another minor l10n update now that wmf-config is disbled? [21:42:48] Do we need to? [21:43:01] s/wmf-config/lcstaticarraywhatever in wmf-config/ [21:43:15] I think we should update testwiki config again, and then just restart hhvm [21:43:30] we're still going through the first rolling restart, to get it disabled [21:43:43] Reedy: we don't want to turn this thing back on [21:44:08] Ah [21:44:10] just need to run l10nupdate again to generate the cdb cache [21:44:11] where is the value for this: [21:44:13] $role::cache::configuration::backends[$::realm]['test_appservers'][$::mw_primary] [21:44:16] I think [21:44:17] Well, yeah, comment out the wmf-config stuff [21:44:21] in other words: which server is test.wp [21:44:26] Did l10nupdate get updated/hacked? [21:44:41] Reedy: that's what I'm trying to track down [21:44:44] twentyafterfour: I think... Delete the php files. Remove/comment out oris hack. Re-run scap [21:44:51] test.wp is mw1017 [21:45:06] In theory, you could force the php l10n cache to be built as such, by first touching zu.l10n.php [21:45:09] bblack: thanks [21:45:11] That'd override the localisation cache onfig [21:45:17] Reedy: there you go, mw1017 [21:45:19] Which would then mean the php files were generated [21:45:38] Reedy: that's what we don't want [21:45:45] twentyafterfour: Precisely [21:45:47] So like I say [21:45:50] Comment out the 3 lines in wmf-config [21:45:52] Delete the php files [21:46:00] Run scap, with that version enabled for testwiki [21:46:02] Profit [21:46:09] right [21:46:37] https://github.com/wikimedia/operations-puppet/tree/production/modules/scap [21:46:48] The l10nupdate code hasn't been touched (at least, not comitted) [21:47:02] So, it's presumably working on some fragile hack based on the existence of that file [21:47:27] the /etc/lcstore file was killed everywhere, before any of the reverts/restarts, if that's what you mean [21:47:54] I initially thought that would be enough to stop it, until I saw that the code to conditionalize on that file had been removed in a later commit, so I reverted the commit that removed the conditional. [21:48:59] Primarily, $wgLocalisationCacheConf['storeClass'] = 'LCStoreStaticArray'; needs removing [21:49:03] // [21:49:15] so it doesn't even try and use that l10ncache type [21:51:56] the initial rolling restart finished [21:52:19] (the one that happened after both killing /etc/lcstore, and doing the reversion that re-enabled the conditional on /etc/lcstore) [21:54:16] So there is a TLDR version of this that the LCStoreStaticArray code isn't fit for production use? [21:55:50] https://phabricator.wikimedia.org/T99740 [21:56:34] i'm here now, just got in. can i help? [21:57:20] scap itself will work with either cdb or php files, depending on what the wiki is configured to use [21:57:38] ori: That's not the major problem [21:57:42] I don't think [21:58:02] is the question: how do we go back to cdbs safely and unbreak deployment? [21:58:06] It seems the implementation is fundementally broken (on the MW side using PHP string arrays) [21:58:07] No [21:58:08] yes [21:58:14] well, I thought it was! [21:58:29] I'm pretty sure I know how to do that, and I said above ;) [21:59:11] If we comment out $wgLocalisationCacheConf['storeClass'] = 'LCStoreStaticArray'; [21:59:17] Bump testwiki in -staging [21:59:20] Run scap [21:59:26] yep. [21:59:26] We have a cdb based l10n cache for wmf14 [21:59:36] yep [22:00:01] And delete the php l10n files too for good measure [22:00:13] no, don't do that [22:00:50] why not? [22:00:53] well, they're useless [22:01:01] other than for further testing [22:01:18] [22:15:46] <_joe|AFK> twentyafterfour: no the fact is that patch loads the cdb data in memory as static arrays [22:01:18] [22:15:54] <_joe|AFK> they won't ever be removed by design [22:02:24] yeah, i didn't think that through. [22:02:45] well, apparently neither did the fb guys bar "zomg, it's fast!" [22:03:14] and if you think that each appserver could need to load every language into memory [22:03:14] :D [22:04:09] Finally market for RAM again! [22:04:37] loading every language into memory is fine, loading every language into memory for every active branch is not [22:05:13] ACKNOWLEDGEMENT - Apache HTTP on mw1090 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time daniel_zahn disk fail - T105835 [22:05:13] ACKNOWLEDGEMENT - HHVM processes on mw1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm daniel_zahn disk fail - T105835 [22:05:13] ACKNOWLEDGEMENT - HHVM rendering on mw1090 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time daniel_zahn disk fail - T105835 [22:05:43] is that a new virus? [22:06:21] sudo daniel_zahn disk fail [22:06:46] :p [22:07:18] imagines a graph of failing disks per 100 servers [22:07:26] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1452705 (10CCogdill_WMF) @Chmarkine, @BBlack, @faidon - can one of you give me a summary of the domains you're proposing we delete? I'... [22:14:11] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1452706 (10Rillke) rillke-node (node.js) clients should be fixed except they decided to explicitly overwrite the protocol (but these aren't my bots). Thanks for pinging. [22:22:05] (03PS1) 10Dzahn: bump version to 2.10 [debs/wikistats] - 10https://gerrit.wikimedia.org/r/224726 [22:24:30] (03CR) 10Dzahn: [C: 032] bump version to 2.10 [debs/wikistats] - 10https://gerrit.wikimedia.org/r/224726 (owner: 10Dzahn) [22:24:35] (03Merged) 10jenkins-bot: bump version to 2.10 [debs/wikistats] - 10https://gerrit.wikimedia.org/r/224726 (owner: 10Dzahn) [22:34:34] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [22:45:11] (03PS1) 10BBlack: Port Filipe da Silva's multicert patches, bump libssl to 1.0.2 [software/nginx] (wmf-1.9.3-1) - 10https://gerrit.wikimedia.org/r/224728 (https://phabricator.wikimedia.org/T86654) [22:45:13] (03PS1) 10BBlack: Release 1.9.3-1+wmf1 (multicert, libssl1.0.2) [software/nginx] (wmf-1.9.3-1) - 10https://gerrit.wikimedia.org/r/224729 [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150714T2300). Please do the needful. [23:00:04] AaronSchulz: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:02:39] (03CR) 10BBlack: [C: 032 V: 032] Port Filipe da Silva's multicert patches, bump libssl to 1.0.2 [software/nginx] (wmf-1.9.3-1) - 10https://gerrit.wikimedia.org/r/224728 (https://phabricator.wikimedia.org/T86654) (owner: 10BBlack) [23:02:48] (03CR) 10BBlack: [C: 032 V: 032] Release 1.9.3-1+wmf1 (multicert, libssl1.0.2) [software/nginx] (wmf-1.9.3-1) - 10https://gerrit.wikimedia.org/r/224729 (owner: 10BBlack) [23:17:47] !log reprepro: nginx for jessie-wikimedia/main bumped to 1.9.3-1+wmf1 [23:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:13] !log updating nginx to 1.9.3-1+wmf1 on cp* [23:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:46:30] !log es1.6 upgrade: upgraded elastic1011 [23:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:54:24] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1429 bytes in 0.159 second response time