[00:00:03] (03PS2) 10Ori.livneh: Remove wmgUseBits setting, now that the migration is complete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209130 [00:00:05] (03PS1) 10Ori.livneh: MWWikiversions::readDbListFile: allow single-line ('#' or '//') comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209150 [00:00:25] Apparently snapshot1004 is actually starved for RAM -- https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=snapshot1004.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1430870331&g=mem_report&z=large&c=Miscellaneous%20eqiad [00:01:03] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/209150/ [00:01:07] (tested) [00:01:21] bd808: is rsnapshot a apergo.s managed host/project? [00:01:23] this doesn't make dblists turing-complete, but it brings us closer [00:01:24] an* [00:01:38] haha nice [00:01:52] greg-g: *nod* and I think where h.oo runs the wikidata dumps as well [00:01:56] * greg-g nods [00:04:18] !log catrope Synchronized php-1.26wmf4/extensions/WikiEditor: SWAT (duration: 00m 42s) [00:04:26] Logged the message, Master [00:06:37] !log catrope Synchronized php-1.26wmf4/extensions/Flow: SWAT (duration: 00m 52s) [00:06:37] !log catrope Synchronized php-1.26wmf4/extensions/CirrusSearch: SWAT (duration: 00m 28s) [00:06:38] Logged the message, Master [00:06:38] Logged the message, Master [00:13:09] PROBLEM - High load average on labstore1001 is CRITICAL 75.00% of data above the critical threshold [24.0] [00:13:09] !log Aborted sync-common on snapshot1004; host is starved for RAM and using swap heavily [00:13:14] Logged the message, Master [00:13:24] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:14:54] RECOVERY - puppet last run on snapshot1004 is OK Puppet is currently enabled, last run 25 minutes ago with 0 failures [00:15:13] !log catrope Synchronized php-1.26wmf3/extensions/WikiEditor: SWAT (duration: 00m 33s) [00:15:19] Logged the message, Master [00:15:37] !log catrope Synchronized php-1.26wmf3/extensions/Flow: SWAT (duration: 00m 23s) [00:15:42] Logged the message, Master [00:20:43] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [00:22:53] (03PS1) 10Jforrester: visualeditor-default.dblist: Add comments explaining order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209151 [00:26:03] 6operations: Changes to the fr-tech@ email group - https://phabricator.wikimedia.org/T98269#1263382 (10K4-713) 3NEW [00:26:23] SWAT's done BTW [00:26:31] 6operations: Changes to the fr-tech@ email group - https://phabricator.wikimedia.org/T98269#1263389 (10K4-713) [00:29:03] PROBLEM - nutcracker port on silver is CRITICAL - Socket timeout after 2 seconds [00:30:34] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [00:31:19] RoanKattouw: You missed https://gerrit.wikimedia.org/r/#/c/207332/ ? [00:31:46] I guess so? [00:31:50] Did someone add it after 4pm? [00:31:57] Oh no I'm just blind [00:31:59] Sorry bd808 [00:32:04] heh. no worries [00:32:15] can you get it or do you want me to? [00:32:26] Can you take it? [00:32:30] I've gone back to coding [00:32:32] Sorry for missing it [00:32:47] I'll handle it. Thanks [00:35:22] (03PS9) 10BryanDavis: Add AffCom user group application contact page on meta (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [00:35:33] (03CR) 10BryanDavis: [C: 032] Add AffCom user group application contact page on meta (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [00:37:41] (03PS1) 10Dzahn: admin: add user for dkg [puppet] - 10https://gerrit.wikimedia.org/r/209155 (https://phabricator.wikimedia.org/T98148) [00:39:23] (03Merged) 10jenkins-bot: Add AffCom user group application contact page on meta (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [00:40:21] Jamesofur|cloud: ping? [00:40:38] (03CR) 10BryanDavis: gdash: adjust deploy metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/208085 (https://phabricator.wikimedia.org/T64667) (owner: 10Filippo Giunchedi) [00:41:13] any community people/liaisons around? :) [00:41:40] JohnFLewis: ^ [00:42:02] (enwiki, specifically) [00:44:11] paravoid: pong [00:44:45] hi [00:44:55] !log bd808 Synchronized wmf-config/AffComContactPages.php: Add AffCom user group application contact page on meta {{gerrit|207332}} (duration: 00m 33s) [00:44:58] not sure if you're the best person to ask this [00:45:01] Logged the message, Master [00:45:05] so bear with me :) [00:45:18] !log bd808 Synchronized docroot/noc/conf/AffComContactPages.php.txt: Add AffCom user group application contact page on meta {{gerrit|207332}} (duration: 00m 15s) [00:45:19] enwiki's featured picture for today is a 7MB image [00:45:24] Logged the message, Master [00:45:29] it's not actually causing any issues on our servers (yet) [00:45:40] Main_Page isn't all that popular apparently :) [00:45:41] !log bd808 Synchronized docroot/noc/createTxtFileSymlinks.sh: Add AffCom user group application contact page on meta {{gerrit|207332}} (duration: 00m 17s) [00:45:47] Logged the message, Master [00:45:56] but this isn't great for end-users [00:46:16] !log bd808 Synchronized wmf-config/CommonSettings.php: Add AffCom user group application contact page on meta {{gerrit|207332}} (duration: 00m 20s) [00:46:22] Logged the message, Master [00:46:32] jamesofur: ^ (that was for you :) [00:47:43] about 20M views a day on average [00:48:16] hmmm [00:48:41] so yeah, a 7MB image on our frontpage is not a very good idea from an end-user performance standpoint [00:48:45] ori: ^ too, btw [00:49:34] * jamesofur pokes some people [00:49:45] thanks [00:51:25] !log schema change running T95179 wikidata, bit unusual, dropping a not-null field [00:51:30] Logged the message, Master [00:53:38] paravoid: Do you think that's why catchpoint is crying? [00:53:44] yes [00:53:53] they have a 3MB limit [00:53:53] Makes sense... [00:53:55] ah [00:54:14] chasemp is working with them [00:54:23] or around the check by now, I guess :) [00:55:43] paravoid: "dkg" on rhenium, access in this case assumes root? [00:56:28] so I figured out you can ignore a page asset explicitly [00:56:41] so even though I can't up the limit I can tell it not to fetch the object [00:56:54] so anyways, keeps continuity on teh test for performance and should be ok [00:57:00] hoo: Are either of the php using 3G of swap on snapshot1004 yours? [00:57:07] assuming we don't stack the page with 7MB gifs [00:57:23] bd808: Nope... if they were I'd already have killed them [00:57:24] paravoid: ^ thanks [00:57:37] Not sure why that stuff leaks memory, but it seems to do [00:57:46] I'm not going to touch it, that's Ariel's thing [00:58:13] hoo: *nod* [00:58:38] looks like maybe one is stuck and now there are 2 overlapping jobs or something [00:59:13] (03PS5) 10Yuvipanda: logstash: Convert Elasticsearch on logstash100[1-3] to client [puppet] - 10https://gerrit.wikimedia.org/r/205971 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [01:01:24] (03CR) 10Yuvipanda: [C: 032] logstash: Convert Elasticsearch on logstash100[1-3] to client [puppet] - 10https://gerrit.wikimedia.org/r/205971 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [01:03:59] jamesofur: I'm off, others are around so ping this channel if you need anything from us regarding that [01:04:13] thanks again :) [01:04:14] will do, thanks paravoid [01:04:19] no promises :) but I'm conversing :) [01:05:11] springle: doing the schema change now? [01:06:39] aude: correct [01:06:44] springle: thanks :) [01:09:14] (03PS1) 10BryanDavis: logstash: fix heap size setting [puppet] - 10https://gerrit.wikimedia.org/r/209158 [01:09:19] yuvipanda: ^ [01:09:37] (03CR) 10Yuvipanda: [C: 032 V: 032] logstash: fix heap size setting [puppet] - 10https://gerrit.wikimedia.org/r/209158 (owner: 10BryanDavis) [01:09:43] haha [01:09:47] hindsight :) [01:09:52] yeah [01:10:04] and boo hardcoding the 'm' [01:10:24] (03PS1) 10Dzahn: admin: add group for traceback-roots [puppet] - 10https://gerrit.wikimedia.org/r/209159 (https://phabricator.wikimedia.org/T98148) [01:11:48] (03PS2) 10Dzahn: admin: add group for traceback-roots [puppet] - 10https://gerrit.wikimedia.org/r/209159 (https://phabricator.wikimedia.org/T98148) [01:12:17] yuvipanda: :( "Incompatible minimum and maximum heap sizes specified" [01:12:22] stupid java [01:12:35] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL 7.69% of data above the critical threshold [500.0] [01:12:53] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [01:14:48] paravoid: it's been removed [01:15:48] bd808: should we revert? [01:16:09] yuvipanda: live editing to find good value [01:16:15] bd808: alright :) [01:16:47] 256m works, 512m and higher doesn't [01:17:03] jamesofur: thanks! [01:17:39] bd808: why does it have a *minimum* heapsize requiremnet.... [01:17:54] (03PS1) 10BryanDavis: logstash: change heap size to 256m [puppet] - 10https://gerrit.wikimedia.org/r/209161 [01:18:05] yuvipanda: so the heap size is static [01:18:09] (03CR) 10Yuvipanda: [C: 032 V: 032] logstash: change heap size to 256m [puppet] - 10https://gerrit.wikimedia.org/r/209161 (owner: 10BryanDavis) [01:18:21] jvm allocates all of it on startup and doesn't try to scale [01:18:22] bd808: merged [01:18:45] applying... [01:19:52] 1001 looks good again [01:21:18] 1003 looks good yuvipanda. You should be safe to get on with your day [01:22:28] jamesofur: thanks. you know, should they use .gifv for that? https://imgur.com/blog/2014/10/09/introducing-gifv/ [01:22:58] mutante: do we support it? Could be an interesting thing to bring up [01:23:21] jamesofur: no idea, just came to my mind because i keep seeing them on imgur and reddit [01:23:28] * jamesofur nods [01:23:30] yes, let's find out [01:24:14] RECOVERY - HTTP 5xx req/min on graphite2001 is OK Less than 1.00% above the threshold [250.0] [01:24:33] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:24:53] Imgur plans to submit an accompanying specification to relevant standards organizations before the end of the year. [01:25:13] is it just renamed .webm? [01:25:19] it seems like it [01:31:14] gfycat gets it down to 2mb https://gfycat.com/GenuineRareFanworms >.> [01:31:42] quiddity: :) does UploadWizard like it? [01:32:07] if I save to desktop, it's just a .webm [01:33:28] mutante, So I would assume it would work. However, I fear the nigh-insanity of copyright nuances, and don't have time/energy to read and refresh my fuzzy memory, so if anyone else wants to upload, go for it. [01:33:34] quiddity: 2 MiB is still a lot more than 27 KiB which the thumbnail is. [01:33:43] true dat [01:33:44] quiddity: (The current thumbnail.) [01:34:35] (03PS1) 10Springle: move db1021 to s2 and db1054 to sideline [puppet] - 10https://gerrit.wikimedia.org/r/209164 (https://phabricator.wikimedia.org/T89801) [01:38:10] (03CR) 10Springle: [C: 032] move db1021 to s2 and db1054 to sideline [puppet] - 10https://gerrit.wikimedia.org/r/209164 (https://phabricator.wikimedia.org/T89801) (owner: 10Springle) [01:40:47] !log upgrade db1021 trusty [01:40:56] Logged the message, Master [01:42:59] * aude assumes greg-g is not around or is he? [01:43:23] hi [01:43:29] * aude prefers not to wait until swat to deploy https://gerrit.wikimedia.org/r/#/c/209160/ [01:43:31] what's up? [01:43:39] is the image issue sorted out? [01:43:49] ori: It has been replaced, apparently [01:44:06] ori: See my response. Yes. [01:44:22] * ori already dealt with earlier [01:44:35] ori: Ouch. [01:45:04] aude: go for it, i'm around to babysit [01:45:12] ori: ok :) [01:45:18] ori: the image has been removed from main_page [01:46:07] cool [01:47:12] (03CR) 10Ori.livneh: [C: 032] MWWikiversions::readDbListFile: allow single-line ('#' or '//') comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209150 (owner: 10Ori.livneh) [01:47:19] (03Merged) 10jenkins-bot: MWWikiversions::readDbListFile: allow single-line ('#' or '//') comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209150 (owner: 10Ori.livneh) [01:48:19] (03CR) 10Ori.livneh: "Empty lines are permitted now, too, so you can add those for clarity if you like. Or I can merge, if you prefer it as-is." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209151 (owner: 10Jforrester) [01:48:43] (03CR) 10Jforrester: "This works. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209151 (owner: 10Jforrester) [01:49:01] (03CR) 10Ori.livneh: [C: 032] visualeditor-default.dblist: Add comments explaining order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209151 (owner: 10Jforrester) [01:49:14] Thanks. [01:52:29] The amount of log spam produced by the ocg service is pretty astounding [01:52:35] !log ori Synchronized multiversion/MWWikiversions.php: Ib08e36901: MWWikiversions::readDbListFile: allow single-line ("#" or "//") comments (duration: 00m 18s) [01:52:43] Logged the message, Master [01:52:47] bd808: how astounding? :) [01:53:13] 56 log events for a 100% successful pdf creation awounding [01:53:20] *astounding [01:53:48] could be a lot more too because it sends a log message for each media element added to the doc [01:53:53] That does seem noisy [01:54:34] here's one example run -- https://logstash.wikimedia.org/#dashboard/temp/kurRg4PSRJC3QaFBOAYQyg [01:55:54] TL;DR [01:55:57] seriously. [01:56:06] (03Merged) 10jenkins-bot: visualeditor-default.dblist: Add comments explaining order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209151 (owner: 10Jforrester) [01:56:10] i got bored around chapter 5 [01:56:42] This message is so important "Render completed successfully!" [01:56:57] i was going to say, the exclamation mark tells you it's a hack [01:57:03] because the developer was surprised that it worked [01:57:13] lol [01:57:14] 6operations, 5Patch-For-Review: contacts.wikimedia.org drupal unpuppetized / retire contacts - https://phabricator.wikimedia.org/T90679#1263700 (10Dzahn) 5Open>3Resolved I confirmed files exists in backup (bconsole on helium) and then deleted them from zirconium. the entire document root as well as the Ap... [01:57:39] 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1263702 (10Stu) 3NEW [01:57:57] 6operations, 5Patch-For-Review: contacts.wikimedia.org drupal unpuppetized / retire contacts - https://phabricator.wikimedia.org/T90679#1263709 (10Dzahn) 5Resolved>3Open [01:59:28] !log aude Synchronized php-1.26wmf4/extensions/Wikidata: Fix Wikibase api error output bug (duration: 01m 08s) [01:59:35] Logged the message, Master [02:00:05] * aude done [02:00:28] gah [02:01:04] ? [02:01:13] helps to update the submodule [02:01:35] !log aude Synchronized php-1.26wmf4/extensions/Wikidata: Fix Wikibase api error output bug - update submoduled (duration: 00m 28s) [02:01:58] ok, looks good [02:02:03] (03PS3) 10Ori.livneh: Remove wmgUseBits setting, now that the migration is complete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209130 [02:02:22] (03CR) 10Ori.livneh: [C: 032] Remove wmgUseBits setting, now that the migration is complete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209130 (owner: 10Ori.livneh) [02:02:27] 6operations, 5Patch-For-Review: contacts.wikimedia.org drupal unpuppetized / retire contacts - https://phabricator.wikimedia.org/T90679#1263710 (10Dzahn) @springle on db1001/dbprox1001: the databases "contacts_civicrm" and "contacts_drupal" can now be dropped entirely. we have dump files in backup and everyth... [02:02:27] Logged the message, Master [02:02:28] (03Merged) 10jenkins-bot: Remove wmgUseBits setting, now that the migration is complete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209130 (owner: 10Ori.livneh) [02:02:50] 6operations, 5Patch-For-Review: contacts.wikimedia.org drupal unpuppetized / retire contacts - https://phabricator.wikimedia.org/T90679#1263712 (10Dzahn) a:5Dzahn>3Springle [02:03:17] 10Ops-Access-Requests, 6operations, 10Analytics: Access to stat1003 for jdouglas - https://phabricator.wikimedia.org/T98209#1263714 (10Dzahn) p:5Triage>3Normal [02:03:52] aude: is that change you pushed going to stop the "Undefined index: html" notices? [02:04:06] !log ori Synchronized wmf-config/CommonSettings.php: I83ad6d060: Remove wmgUseBits setting, now that the migration is complete (duration: 00m 18s) [02:04:13] Logged the message, Master [02:04:16] bd808: it is [02:04:21] <3 [02:04:27] assuming that's the only place it was a problem [02:05:32] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1263717 (10Dzahn) [02:05:41] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1263719 (10Dzahn) p:5Triage>3Normal [02:08:20] PROBLEM - High load average on labstore1001 is CRITICAL 62.50% of data above the critical threshold [24.0] [02:14:57] 6operations, 10Incident-20150205-SiteOutage, 10MediaWiki-Debug-Logging, 6Reading-Infrastructure-Team, and 2 others: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1263722 (10bd808) [02:19:51] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [02:26:58] !log xtrabackup clone db1060 to db1021 [02:27:08] Logged the message, Master [02:29:38] 6operations: Changes to the fr-tech@ email group - https://phabricator.wikimedia.org/T98269#1263750 (10Dzahn) @K4-713 done! fr-software-engineers: agreen, awight, eeggleston, khorn fr-tech: fr-tech-ops, fr-software-engineers, agomez, erik, mhernandez, pcoombe, ppena, dkozlowski (ssmith was on software-eng... [02:29:52] 6operations: Changes to the fr-tech@ email group - https://phabricator.wikimedia.org/T98269#1263751 (10Dzahn) 5Open>3Resolved a:3Dzahn [02:34:01] PROBLEM - RAID on snapshot1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:40] RECOVERY - RAID on snapshot1004 is OK no RAID installed [02:36:03] !log l10nupdate Synchronized php-1.26wmf3/cache/l10n: (no message) (duration: 10m 46s) [02:36:11] Logged the message, Master [02:40:41] PROBLEM - RAID on snapshot1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:44:00] RECOVERY - RAID on snapshot1004 is OK no RAID installed [02:44:01] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL 7.14% of data above the critical threshold [500.0] [02:44:21] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [02:46:29] !log LocalisationUpdate completed (1.26wmf3) at 2015-05-06 02:45:26+00:00 [02:46:37] Logged the message, Master [02:57:01] RECOVERY - HTTP 5xx req/min on graphite2001 is OK Less than 1.00% above the threshold [250.0] [02:57:21] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:07:03] PROBLEM - RAID on db1027 is CRITICAL 1 failed LD(s) (Degraded) [03:07:03] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [03:09:43] !log l10nupdate Synchronized php-1.26wmf4/cache/l10n: (no message) (duration: 08m 46s) [03:09:58] Logged the message, Master [03:10:12] RECOVERY - check_disk on lutetium is OK: DISK OK - free space: / 18334 MB (51% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366625 MB (25% inode=99%) [03:12:03] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [03:12:52] 6operations, 10ops-eqiad: db1027 raid degraded - https://phabricator.wikimedia.org/T98285#1263765 (10Springle) 3NEW [03:13:24] ACKNOWLEDGEMENT - RAID on db1027 is CRITICAL 1 failed LD(s) (Degraded) Sean Pringle T98285 [03:14:31] !log LocalisationUpdate completed (1.26wmf4) at 2015-05-06 03:13:28+00:00 [03:14:38] Logged the message, Master [03:16:47] (03PS4) 10Ori.livneh: Move *.dblist to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 [03:16:51] (03CR) 10jenkins-bot: [V: 04-1] Move *.dblist to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 (owner: 10Ori.livneh) [03:17:05] ori: argh really [03:17:13] 6operations, 10Incident-20150205-SiteOutage, 10MediaWiki-Debug-Logging, 6Reading-Infrastructure-Team, and 2 others: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1263773 (10bd808) [03:17:15] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review, 15User-Bd808-Test: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1263772 (10bd808) 5Open>3Resolved [03:17:17] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1263774 (10bd808) [03:22:34] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [03:23:23] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL 7.14% of data above the critical threshold [500.0] [03:27:19] (03CR) 10Springle: [C: 04-2] "Need to do a review of what else expects the old path. Probably not a big deal, but eg, currently running schema changes peeking at mediaw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 (owner: 10Ori.livneh) [03:32:23] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:32:36] (03PS1) 10BryanDavis: Send group0 + group1 MediaWiki events to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209170 (https://phabricator.wikimedia.org/T88732) [03:33:13] RECOVERY - HTTP 5xx req/min on graphite2001 is OK Less than 1.00% above the threshold [250.0] [03:38:41] (03CR) 10MZMcBride: "I wonder if it makes sense to do this more slowly/piecemeal. It might be safer." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 (owner: 10Ori.livneh) [03:45:04] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [03:45:57] (03CR) 10BryanDavis: "Posted for SWAT 2015-05-06T15:00Z" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209170 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [03:47:16] (03PS5) 10Ori.livneh: Move *.dblist to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 [03:47:21] (03CR) 10jenkins-bot: [V: 04-1] Move *.dblist to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 (owner: 10Ori.livneh) [03:48:29] springle: woops, didn't see your review [03:48:42] PROBLEM - puppet last run on cp4008 is CRITICAL puppet fail [03:48:53] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [03:53:43] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [03:53:45] (03PS6) 10Ori.livneh: Move *.dblist to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 [03:56:22] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL 7.69% of data above the critical threshold [500.0] [03:56:23] springle: do you think it is frivolous? I find it makes the repository a little less confusing to navigate and a lot less overwhelming [03:57:03] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [03:58:16] ori: it's not frivolous. Just need to approach carefully [03:58:49] I personally think it would be nice to move the lists. I was in operations-mediawiki-config/ the other day and they were cluttery. [03:59:33] It wasn't clear to me if the URLs (noc.wikimedia.org) would change/break. I didn't look closely enough. [04:00:23] Fiona: the patch takes care of that [04:00:28] Nice. [04:01:12] I was thinking people could have scripts relying on noc.wikimedia.org/conf/all.dblist or whatever. [04:06:43] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [04:06:44] could be [04:07:33] (03PS1) 10BryanDavis: Send MediaWiki events for all wikis to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209172 (https://phabricator.wikimedia.org/T88732) [04:08:35] (03CR) 10BryanDavis: [C: 04-1] "Need to sit on this until we run group0 + group1 for at least a day to see how the new Elasticsearch cluster holds up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209172 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [04:08:44] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [04:09:33] RECOVERY - HTTP 5xx req/min on graphite2001 is OK Less than 1.00% above the threshold [250.0] [04:15:03] PROBLEM - puppet last run on maerlant is CRITICAL puppet fail [04:21:13] we had similar questions re: not having ensured there's no scripts that would be affected by forcing HTTPS on noc.wm.o, too [04:21:34] hence stalled out change for ithere: https://gerrit.wikimedia.org/r/#/c/199515/ [04:33:03] RECOVERY - puppet last run on maerlant is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [04:49:23] PROBLEM - HTTP error ratio anomaly detection on graphite2001 is CRITICAL Anomaly detected: 10 data above and 7 below the confidence bounds [04:50:03] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 7 below the confidence bounds [04:58:52] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL 6.67% of data above the critical threshold [500.0] [04:59:34] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [05:00:13] (03CR) 10BryanDavis: "How is mira.codfw.wmnet going to be synced with tin? The scap setting that is being changed here is the fallback host in the event that no" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/208801 (owner: 10John F. Lewis) [05:01:02] PROBLEM - HTTP error ratio anomaly detection on graphite2001 is CRITICAL Anomaly detected: 10 data above and 7 below the confidence bounds [05:01:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 6 below the confidence bounds [05:09:17] (03PS1) 10BBlack: support geoiplookup target on all enabled clusters [puppet] - 10https://gerrit.wikimedia.org/r/209173 [05:14:23] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [05:15:13] RECOVERY - HTTP 5xx req/min on graphite2001 is OK Less than 1.00% above the threshold [250.0] [05:16:15] (03PS1) 10BBlack: move geoiplookup to text-addrs-v4 [dns] - 10https://gerrit.wikimedia.org/r/209174 [05:25:33] RECOVERY - HTTP error ratio anomaly detection on graphite2001 is OK No anomaly detected [05:26:13] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [05:36:52] PROBLEM - High load average on labstore1001 is CRITICAL 87.50% of data above the critical threshold [24.0] [05:43:32] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [05:47:23] PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: /a 75907 MB (3% inode=99%) [05:50:03] PROBLEM - High load average on labstore1001 is CRITICAL 66.67% of data above the critical threshold [24.0] [05:56:33] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [05:59:13] (03PS1) 10Ori.livneh: EventLogging varnish log tailers: make '.gif' suffix optional [puppet] - 10https://gerrit.wikimedia.org/r/209175 [05:59:34] ^ bblack [05:59:44] I can migrate EL after that [06:00:07] (and will follow up with a patch to make the regex simpler) [06:00:55] once I sync the EventLogging configuration change, the regex can simply be RxURL:^/beacon/event\?. [06:02:42] (03CR) 10BBlack: [C: 031] EventLogging varnish log tailers: make '.gif' suffix optional [puppet] - 10https://gerrit.wikimedia.org/r/209175 (owner: 10Ori.livneh) [06:02:54] yay, thanks [06:03:07] (03CR) 10Ori.livneh: [C: 032] EventLogging varnish log tailers: make '.gif' suffix optional [puppet] - 10https://gerrit.wikimedia.org/r/209175 (owner: 10Ori.livneh) [06:04:52] (03PS5) 10BBlack: refactor varnish::logging for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/208659 [06:07:20] (03PS6) 10BBlack: refactor varnish::logging for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/208659 [06:14:10] (03PS2) 10BBlack: transparency: make it HTTPS only and enable HSTS [puppet] - 10https://gerrit.wikimedia.org/r/199517 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [06:15:09] (03CR) 10BBlack: [C: 032] transparency: make it HTTPS only and enable HSTS [puppet] - 10https://gerrit.wikimedia.org/r/199517 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [06:17:17] (03CR) 10Nikerabbit: "Are the browsers clever enough not to re-fetch the fonts again when browsing different wiki? There is the font-name parameter, but I doubt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209027 (owner: 10Ori.livneh) [06:26:37] (03CR) 10Nemo bis: "Also, is there any other "central" domain available to serve fonts? Perhaps upload.wikimedia.org or even login.wikimedia.org, as both are " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209027 (owner: 10Ori.livneh) [06:28:02] PROBLEM - puppet last run on mw2014 is CRITICAL Puppet has 1 failures [06:29:53] PROBLEM - puppet last run on mw2059 is CRITICAL puppet fail [06:30:22] PROBLEM - puppet last run on mw1249 is CRITICAL puppet fail [06:30:24] PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 1 failures [06:31:03] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on db2042 is CRITICAL puppet fail [06:32:03] PROBLEM - puppet last run on ms-be3001 is CRITICAL puppet fail [06:32:13] PROBLEM - puppet last run on wtp2012 is CRITICAL Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 1 failures [06:32:23] PROBLEM - puppet last run on lvs2001 is CRITICAL Puppet has 1 failures [06:32:42] PROBLEM - puppet last run on cp4004 is CRITICAL Puppet has 1 failures [06:33:12] PROBLEM - puppet last run on mw2212 is CRITICAL Puppet has 1 failures [06:33:23] PROBLEM - puppet last run on mw1114 is CRITICAL Puppet has 1 failures [06:33:32] PROBLEM - puppet last run on mw1251 is CRITICAL Puppet has 1 failures [06:33:32] PROBLEM - puppet last run on mw1118 is CRITICAL Puppet has 1 failures [06:33:33] PROBLEM - puppet last run on mw1175 is CRITICAL Puppet has 1 failures [06:33:54] PROBLEM - puppet last run on mw1002 is CRITICAL Puppet has 1 failures [06:33:54] PROBLEM - puppet last run on mw2092 is CRITICAL Puppet has 2 failures [06:33:54] PROBLEM - puppet last run on mw2011 is CRITICAL Puppet has 1 failures [06:34:13] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:34:32] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 2 failures [06:34:32] PROBLEM - puppet last run on mw2093 is CRITICAL Puppet has 1 failures [06:34:32] PROBLEM - puppet last run on mw2096 is CRITICAL Puppet has 1 failures [06:34:44] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:38:22] (03CR) 10Filippo Giunchedi: "@Krinkle, in production that's already like that, you mean external mw users?" [puppet] - 10https://gerrit.wikimedia.org/r/205553 (https://phabricator.wikimedia.org/T92712) (owner: 10Filippo Giunchedi) [06:38:32] (03PS2) 10Filippo Giunchedi: nova: install ::mediawiki::cgroup [puppet] - 10https://gerrit.wikimedia.org/r/205553 (https://phabricator.wikimedia.org/T92712) [06:38:41] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] nova: install ::mediawiki::cgroup [puppet] - 10https://gerrit.wikimedia.org/r/205553 (https://phabricator.wikimedia.org/T92712) (owner: 10Filippo Giunchedi) [06:40:01] (03CR) 10BBlack: "login probably isn't a good target for this (or anything really). upload is possible if it's a real concern. We'd have to implement some" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209027 (owner: 10Ori.livneh) [06:40:39] (03CR) 10Ori.livneh: "> You're only going to take a new, unnecessary font-loading hit when switching domains to a project you haven't visited before to get them" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209027 (owner: 10Ori.livneh) [06:43:09] godog: morning -- fluorine disk is crit :/ [06:44:06] ori: lolz, didn't expect it to be that quick, new disks are in I'll be growing the fs shortly [06:44:33] RECOVERY - puppet last run on mw2014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:03] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:43] RECOVERY - puppet last run on mw1118 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:46:52] RECOVERY - puppet last run on mw1175 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:47:02] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:47:12] RECOVERY - puppet last run on wtp2012 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:47:13] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:47:13] RECOVERY - puppet last run on lvs2001 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:47:32] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:33] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:47:33] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:47:43] RECOVERY - puppet last run on db2042 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:47:43] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:47:43] RECOVERY - puppet last run on mw2096 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:43] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:48:03] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:03] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:03] RECOVERY - puppet last run on mw2059 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:48:13] RECOVERY - puppet last run on mw1114 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:48:22] RECOVERY - puppet last run on mw1251 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:23] RECOVERY - puppet last run on mw1249 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:48:43] RECOVERY - puppet last run on mw1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:43] RECOVERY - puppet last run on mw2092 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:43] RECOVERY - puppet last run on mw2011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:25] 6operations, 5Patch-For-Review: Investigate why cgroup on silver was only mounted with 4k of space - https://phabricator.wikimedia.org/T92712#1263923 (10fgiunchedi) 5Open>3Resolved we should be set now, please reopen if it comes up again ``` silver:~$ grep mediawiki /proc/mounts cgroup /sys/fs/cgroup/mem... [06:50:13] RECOVERY - puppet last run on ms-be3001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:31] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed May 6 06:50:27 UTC 2015 (duration 50m 26s) [06:51:40] Logged the message, Master [06:54:23] RECOVERY - Disk space on fluorine is OK: DISK OK [06:55:51] 6operations, 10MediaWiki-Debug-Logging, 5Patch-For-Review: Investigation if Fluorine needs bigger disks or we retain too much data - https://phabricator.wikimedia.org/T92417#1263929 (10fgiunchedi) ok gave another 500G to /a on fluorine ``` fluorine:~$ cat /proc/mdstat Personalities : [linear] [multipath] [r... [06:59:31] "Investigation if Fluorine needs bigger disks or we retain too much data" [06:59:46] https://en.wikipedia.org/wiki/False_dilemma ! [06:59:55] fluorine needs bigger disks because we retain too much data [07:02:15] 6operations, 10MediaWiki-Debug-Logging, 6Release-Engineering, 6Security-Team, 10procurement: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1263932 (10fgiunchedi) [07:02:17] 6operations, 10MediaWiki-Debug-Logging, 5Patch-For-Review: Investigation if Fluorine needs bigger disks or we retain too much data - https://phabricator.wikimedia.org/T92417#1263930 (10fgiunchedi) 5Open>3Resolved new disks on fluorine, resolving, let's followup on related T88393 [07:07:13] PROBLEM - puppet last run on db2067 is CRITICAL puppet fail [07:17:55] (03PS1) 10Ori.livneh: Change EventLogging endpoint to /beacon/event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209178 [07:19:39] (03CR) 10Ori.livneh: [C: 032] Change EventLogging endpoint to /beacon/event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209178 (owner: 10Ori.livneh) [07:19:45] (03Merged) 10jenkins-bot: Change EventLogging endpoint to /beacon/event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209178 (owner: 10Ori.livneh) [07:20:30] !log ori Synchronized wmf-config/CommonSettings.php: I019944f42: Change EventLogging endpoint to /beacon/event (duration: 00m 14s) [07:20:39] Logged the message, Master [07:23:43] RECOVERY - puppet last run on db2067 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:25:58] (03PS1) 10Ori.livneh: EventLogging log tailers: simplify regexes [puppet] - 10https://gerrit.wikimedia.org/r/209179 [07:26:41] bblack: how are you monitoring which bits URLs are still getting hit? varnishtop on a random bits varnish? [07:28:35] varnishlog on random bits varnishes [07:30:43] ori: should I switch the android app to /beacon/event? [07:30:47] hmm, I should probably go home first [07:30:59] yuvipanda: yes to both [07:31:16] heh [07:32:03] yeah the event.gif traffic on bits is exclusively android now [07:32:18] really? [07:32:22] well, and iOS I guess [07:32:36] i haven't seen any ios events scroll past [07:32:44] but i guess it's got a much smaller base of users [07:32:53] ori: teeheee yes :D [07:34:02] ori: I’ll do that tomorrow! it’ll be nice to have a bug with links / rationale for all the moving about :) [07:34:07] * yuvipanda packs it in for realz [07:34:51] yes, I have done a remarkably poor job at documenting the steps involved in the migration [07:34:59] bblack did much better, there is actually a task [07:35:08] heh [07:35:22] but hey, it's mostly done and no casualties! [07:35:24] well, when we get down to the long tail of things harder to fix, we can doc up some URL transitions to publish [07:36:42] ori: do you know where the code that actually hits bits.wikimedia.org/geoiplookup lives? I presume in some js somewhere... [07:36:42] centralnotice, i think [07:36:43] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [07:36:51] also ULS [07:36:52] ah that makes sense [07:37:02] should beta also not be using bits? [07:37:09] https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/master/modules/ext.centralNotice.bannerController/bannerController.js [07:37:17] GlobalCssJs is still set to 'loadScript' => '//bits.beta.wmflabs.org/meta.wikimedia.beta.wmflabs.org/load.php', [07:37:39] and https://github.com/wikimedia/mediawiki-extensions-UniversalLanguageSelector/blob/master/UniversalLanguageSelector.hooks.php#L80 [07:37:50] legoktm: probably. some of these changes have already moved traffic off of beta bits, but it will need some auditing eventually as well [07:38:07] legoktm: yes, it should be meta; i updated globalcssjs for prod but forgot beta [07:38:25] there's no real pressure to kill the betabits instance other than synchronicity with prod, though [07:39:06] right, but it's good to do anyhow, since the value of beta is its verisimilitude [07:39:14] right [07:39:22] there are many things wrong on that front that need fixing heh [07:40:12] (03CR) 1020after4: [C: 031] Update statsd events [tools/scap] - 10https://gerrit.wikimedia.org/r/208987 (https://phabricator.wikimedia.org/T64667) (owner: 10BryanDavis) [07:40:35] (03PS2) 10BBlack: support geoiplookup target on all enabled clusters [puppet] - 10https://gerrit.wikimedia.org/r/209173 [07:40:37] legoktm: would you like me to update it? [07:41:12] (03CR) 10BBlack: [C: 032 V: 032] support geoiplookup target on all enabled clusters [puppet] - 10https://gerrit.wikimedia.org/r/209173 (owner: 10BBlack) [07:41:38] ^ flying blind without catalog-compiler, so we'll see if that ends up causing a rash of puppetfails [07:42:08] ( https://phabricator.wikimedia.org/T96802#1257838 ) [07:42:58] ori: that would be nice but there's no rush [07:43:19] might as well [07:44:10] the centralnotice one doesn't need any change, I'll move the geoiplookup hostname for it [07:44:18] ULS does though [07:44:53] (03CR) 10Filippo Giunchedi: gdash: adjust deploy metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/208085 (https://phabricator.wikimedia.org/T64667) (owner: 10Filippo Giunchedi) [07:44:55] (03PS1) 10Ori.livneh: Update GlobalCssJs on labs to not use bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209180 [07:45:03] (03PS2) 10Filippo Giunchedi: gdash: adjust deploy metrics [puppet] - 10https://gerrit.wikimedia.org/r/208085 (https://phabricator.wikimedia.org/T64667) [07:46:33] (03CR) 10Legoktm: [C: 031] Update GlobalCssJs on labs to not use bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209180 (owner: 10Ori.livneh) [07:46:33] (03PS1) 10Muehlenhoff: (Bug: T97411) Refresh the control file and change the version scheme; we forked off the last 3.19 Debian upload (3.19.3) and all further updates will be folded in via the stable patchsets. [debs/linux] - 10https://gerrit.wikimedia.org/r/209181 [07:47:48] bblack: https://gerrit.wikimedia.org/r/209182 [07:47:53] Nikerabbit: ^ [07:48:28] ori: will be ~30m before all the varnishes get the change for it [07:48:33] (03CR) 10Ori.livneh: [C: 032] Update GlobalCssJs on labs to not use bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209180 (owner: 10Ori.livneh) [07:48:38] (03Merged) 10jenkins-bot: Update GlobalCssJs on labs to not use bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209180 (owner: 10Ori.livneh) [07:49:56] bblack: I know :) it won't get deployed right away and will likely be a few days before it gets reviewed [07:49:57] ok [07:50:06] 6operations, 10MediaWiki-Debug-Logging, 5Patch-For-Review: Investigation if Fluorine needs bigger disks or we retain too much data - https://phabricator.wikimedia.org/T92417#1263975 (10fgiunchedi) also note that I've added only one raid1 as a PV at the moment, we can add the other if needed ``` root@fluorin... [07:52:15] (03CR) 10Ori.livneh: [C: 04-2] "the iOS and Android apps need to be updated first" [puppet] - 10https://gerrit.wikimedia.org/r/209179 (owner: 10Ori.livneh) [07:54:37] ori: what's the plan for a hostname for e.g. /beacon/media?duration=4073&uri=http%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F8%2F81%2FFancoil_1.jpg ? [07:54:59] oh beacon is on upload too I guess since it went to the common VCL file [07:55:21] does it need a hostname? [07:55:27] no [07:55:45] I was just thinking for some reason that the new beacon support would have been on text but not upload [07:55:48] but it's everywhere [07:56:14] (03CR) 10Yuvipanda: "Even then, update rate isn't 100% - there will always be stragglers. I wonder if we should keep like a redirect for another year or so." [puppet] - 10https://gerrit.wikimedia.org/r/209179 (owner: 10Ori.livneh) [07:56:16] well, the appearance of upload.wikimedia.org in that URL is misleading [07:56:27] 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3Wikidata-Sprint-2015-04-07, and 2 others: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1264002 (10Jimkont) other examples of old serializations can be found here: https://github.com/dbpedia/extraction-fra... [07:56:31] it was logged elsewhere [07:56:51] in fact nothing should be hitting upload.wikimedia.org/beacon/* [07:57:08] ok [07:57:43] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [07:57:48] alright, i'm crashing [07:57:51] * ori -> off [07:57:52] nite! [07:57:56] nite, thanks! [08:00:37] (03CR) 10Nemo bis: "Well, that used to be considered a big deal (around one year ago). Maybe add a phabricator task to remember monitoring the bandwidth incre" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209027 (owner: 10Ori.livneh) [08:03:27] good night [08:04:44] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [08:05:05] <_joe_> night ori [08:11:18] hmm the unmerged thing is (Merged) jenkins-bot: Update GlobalCssJs on labs to not use bits [mediawiki-config] - https://gerrit.wikimedia.org/r/209180 (owner: Ori.livneh) [08:11:29] oh [08:11:32] I'll do that for him [08:12:24] !log legoktm Synchronized wmf-config/CommonSettings-labs.php: no-op (duration: 00m 24s) [08:12:36] Logged the message, Master [08:12:43] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [08:12:49] bblack: ^ [08:12:56] thanks! [08:21:46] akosiaris: morning, Do you know who/where should i disscuss the Video scalers performance ? [08:24:47] (03PS3) 10Filippo Giunchedi: statsite: decommission class [puppet] - 10https://gerrit.wikimedia.org/r/208635 (https://phabricator.wikimedia.org/T95687) [08:24:52] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: decommission class [puppet] - 10https://gerrit.wikimedia.org/r/208635 (https://phabricator.wikimedia.org/T95687) (owner: 10Filippo Giunchedi) [08:34:12] yurik [08:40:40] yurik: I tried to use graph on he.wiki, but nothing is displayed, do you have a minute to help please ? [08:41:15] matanya, ;fod [08:41:31] sure [08:42:01] so i tried the map graph, i created https://he.wikipedia.org/wiki/Rawdata:WorldMap-json [08:42:12] and called it in https://he.wikipedia.org/wiki/%D7%9E%D7%A9%D7%AA%D7%9E%D7%A9:Matanya/%D7%90%D7%A8%D7%92%D7%96_%D7%97%D7%95%D7%9C [08:42:24] but no map shown [08:42:43] yurik: i am probably doing something wrong, but no clue what :) [08:43:22] matanya, lets check that it actually works at all - can you copy the very last demo example and at least see that it shows up in preview? [08:44:00] if the graph subsystem is not loading, no point in debugging complex graphs [08:45:08] yurik: that works [08:45:14] https://he.wikipedia.org/wiki/%D7%9E%D7%A9%D7%AA%D7%9E%D7%A9:Matanya/%D7%90%D7%A8%D7%92%D7%96_%D7%97%D7%95%D7%9C [08:46:47] matanya, good, that means the problem is with the graph. looking. [08:48:14] matanya, i know what's wrong [08:48:20] yes ? [08:48:29] you named it Rawdata, but call it as RawData [08:48:35] casing ) [08:49:34] (03PS1) 10Filippo Giunchedi: role::cache: decommission statsite [puppet] - 10https://gerrit.wikimedia.org/r/209188 (https://phabricator.wikimedia.org/T95687) [08:53:07] 6operations, 7Graphite, 5Patch-For-Review: test sending varnishkafka and swift statsd traffic directly - https://phabricator.wikimedia.org/T95687#1264072 (10fgiunchedi) [08:53:09] 6operations, 7Graphite, 5Patch-For-Review: revisit what percentiles are calculated by statsite - https://phabricator.wikimedia.org/T88662#1264071 (10fgiunchedi) [08:54:49] Thanks yurik ! [08:54:58] matanya, works? [08:55:03] yes :) [08:55:26] excelente ) [08:55:29] (03PS4) 10Filippo Giunchedi: graphite: split alerts role [puppet] - 10https://gerrit.wikimedia.org/r/208083 (https://phabricator.wikimedia.org/T97754) [08:56:10] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: split alerts role [puppet] - 10https://gerrit.wikimedia.org/r/208083 (https://phabricator.wikimedia.org/T97754) (owner: 10Filippo Giunchedi) [09:04:11] 6operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1264104 (10Mjbmr) @Nemo_bis @Glaisher I don't see any reason why you guys are waiting. [09:10:18] yurik: one ,more question: why i change the highlights ID's to something else ,e.g 250 to 249 no other country changes color. is this part of the definition too ? [09:12:13] (03PS4) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [09:12:15] matanya, check the map data - in reality it would be better to use map data with country codes instead of numbers - makes it much easier - i simply used the first map i found on the web (seemed ok license-vise, but even that should be rechecked) [09:12:15] thanks [09:13:27] matanya, those numbers btw are standard too, just not as well known as "us", "he", etc [09:13:28] didn't know that [09:14:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:17:56] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [09:18:06] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [09:24:36] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [09:26:49] matanya: videoscalers... hmm that would be Brian, but he is anyway blocked on an upgrade to trusty and HHVM for them so you will pretty much circle back to ops. [09:27:40] thanks akosiaris. i'd say there status is suboptimal, to say the least. [09:33:43] <_joe_> akosiaris: I beg to disagree [09:34:10] _joe_: ? [09:34:25] <_joe_> I mean, the multimedia team has been dismantled, I had basically no support (apart from volunteers, and a few employees who acted as volunteers) [09:34:36] <_joe_> in converting the imagescalers and the videoscalers. [09:34:57] <_joe_> the WMF has currently no plan in working on those, and I want the org to know there are issues [09:35:13] <_joe_> before I struggle to fix them while pursuing other goals [09:35:52] <_joe_> we're focusing our strategies around a few big goals per quarter, and that's fine, but that means making tradeoffs [09:36:35] <_joe_> and I want the WMF to get feedback about those tradeofss [09:36:36] so you don't disagree per se, you complete my sentence that matanya will circle back to ops, to the multimedia team, to a third part, to a black hole and a new quarter and new goals before he actually gets any results [09:36:48] <_joe_> so my response with the WMF hat is there is exactly 0 resources to work on either the imagescalers or the videoscalers [09:37:06] <_joe_> so I told matanya yesterday to speak with someone in product [09:37:16] i tried :) [09:37:25] <_joe_> I imagined that [09:37:25] product ? [09:37:51] <_joe_> akosiaris: yeah, matanya is part of the community, product should listen to their needs right? [09:37:54] <_joe_> :) [09:38:30] no, my question is, do we got a product team after the reorg ? [09:38:33] <_joe_> so that work on commons/uploadwizard/multimedia in general gets allocated to someone [09:38:41] obviously matanya should be listened to [09:38:42] <_joe_> akosiaris: product people, though :) [09:38:59] <_joe_> I was speaking of the functional role [09:39:35] godog: this one is for you: APIMWException: internal_api_error_UploadChunkFileException: [89c559d8] Exception Caught: Error storing file in '/tmp/07gxJR': backend-fail-internal; local-swift-eqiad [09:40:23] heh, and which ppl fill that functional role of product these days ? I honestly am not sure how to find out [09:41:11] akosiaris and _joe_ i thank you for listening to my complains, it is not directed to ops, just some frustration regarding the multimedia status, where i find myself hoolahooping to get some stuff done. [09:41:21] matanya: can you give a little more context btw? [09:41:34] godog: uploaded a file, this is all i know [09:42:42] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1264182 (10Muehlenhoff) Yeah, I agree it makes sense to strip email addresses. Let's retain an unmodified copy in case someone really needs to track down/contact a bugreporter on a ca... [09:42:46] PROBLEM - High load average on labstore1001 is CRITICAL 66.67% of data above the critical threshold [24.0] [09:42:53] <_joe_> matanya: FWIW, I agree with you completely [09:42:56] matanya: yeah, point taken. I think it's good that you pose those questions and somehow we are not sure how to answer them right now [09:43:04] <_joe_> oh FFS labstore again [09:50:56] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [09:55:55] PROBLEM - High load average on labstore1001 is CRITICAL 75.00% of data above the critical threshold [24.0] [10:00:42] 6operations, 5Patch-For-Review: Remove Erik Moeller's Production Shell Access - https://phabricator.wikimedia.org/T97864#1264283 (10Krenair) 5Resolved>3Invalid Well, this request itself is invalid then. UID 503... [10:05:32] starting the salt upgrade train on production cluster now. first stop: upgrade master/minion on palladium (salt master) [10:07:32] <_joe_> apergos: yay [10:09:15] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [10:10:41] survived that, all hosts responsive but 3 that weren't anyways (two known down + dysprosium) [10:10:52] on to precise hosts, except for tin and virt* [10:14:37] (03PS5) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [10:19:37] <_joe_> oh this worked flawlessly in labs ^^, although I didn't test SSL for now [10:19:40] <_joe_> \o/ [10:25:27] nice!! [10:26:59] yurik: can i add text legends to graphs ? [10:29:36] <_joe_> ah! a small error, though [10:32:17] (03PS6) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [10:49:34] Can someone in ops take a look at error.log on silver? [10:49:45] (/var/log/apache2/error.log) [10:53:47] 6operations: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1264387 (10akosiaris) So, for this task the answer is precise. We know trusty or jessie will not work. We got T98129 for investigating ruby > 1.8 which is trusty/jessie. Ruby 1.9 supports pr... [10:56:37] PROBLEM - puppet last run on mw2180 is CRITICAL puppet fail [10:57:26] Krenair: client denied by server configuration [10:57:49] any signs of a php error there? [10:57:53] nope [10:58:12] I wonder what https://wikitech.wikimedia.org/wiki/Special:Properties is causing behind the scenes [10:58:16] PROBLEM - puppet last run on analytics1003 is CRITICAL Puppet has 1 failures [10:58:32] Krenair: where you able to get the file from my upload request ? [10:58:32] now there is [10:58:34] PHP Fatal error: Call to undefined function wfViewPrevNext() in /srv/mediawiki/php-1.26wmf4/extensions/SemanticMediaWiki/specials/QueryPages/SMW_QueryPage.php on line 76 [10:58:39] aha [10:58:52] someone must have tried again [10:58:57] (03PS1) 10Alexandros Kosiaris: rhodium: precise as installation distro [puppet] - 10https://gerrit.wikimedia.org/r/209201 [10:59:06] thanks apergos [10:59:08] yw [10:59:08] I did try again [10:59:18] ah that would be it then :-) [10:59:28] matanya: [11:00:02] yes Krenair ? [11:00:10] krenair@terbium:~$ host encoding01.eqiad.wmflabs [11:00:10] Host encoding01.eqiad.wmflabs not found: 3(NXDOMAIN) [11:00:16] unsurprisingly [11:00:19] a darn [11:00:32] I don't think I could ssh to that anyway [11:01:13] i wonder how hoo did it yesterday [11:05:21] (03PS7) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [11:09:59] matanya, sure, see vega for example on how to do legends [11:10:02] matanya, I think it might be possible via a ssh config on terbium [11:10:13] and probably forwarding your labs key to prod :( [11:10:22] (03PS8) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [11:10:43] yurik: thanks, i looked at that, and didn't find how to add text legend, just style. [11:10:50] Although I don't seem to be able to ping bastion.wmflabs.org from terbium [11:10:55] So maybe not [11:10:55] Krenair: proxycommand won't work ? [11:11:59] matanya, see the demo page - two of the graphs use legends - https://www.mediawiki.org/wiki/Extension:Graph/Demo [11:13:01] yurik: thanks, didn't figure it was in the rawdata page [11:13:12] i was asking about simple legends [11:13:24] ?? [11:14:29] yurik: https://he.wikipedia.org/wiki/%D7%9E%D7%A9%D7%AA%D7%9E%D7%A9:Matanya/%D7%90%D7%A8%D7%92%D7%96_%D7%97%D7%95%D7%9C in the left grpah, i wasnt to add a title: death of accidents between 1949-2015 in isreal [11:14:45] RECOVERY - puppet last run on analytics1003 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:15:42] (03CR) 10Alexandros Kosiaris: [C: 032] rhodium: precise as installation distro [puppet] - 10https://gerrit.wikimedia.org/r/209201 (owner: 10Alexandros Kosiaris) [11:16:04] (03PS1) 10KartikMistry: CX: Add languages for CX deployment on 20150507 [puppet] - 10https://gerrit.wikimedia.org/r/209202 (https://phabricator.wikimedia.org/T97888) [11:16:26] RECOVERY - puppet last run on mw2180 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:16:46] (03CR) 10KartikMistry: [C: 04-1] "To be deployed on 20150407." [puppet] - 10https://gerrit.wikimedia.org/r/209202 (https://phabricator.wikimedia.org/T97888) (owner: 10KartikMistry) [11:17:06] err. 20150507 :/ [11:17:31] matanya, sorry, still don't get it - are you trying to show a legend on the left/right of the graph, with the list of what color corresponds to what meaning? Or a title of the graph above it? Both should be possible [11:18:28] I tried both. the legend I sort of got working, but not the title. [11:22:24] FYI - mobile site appears broken - https://phabricator.wikimedia.org/T98309 [11:22:25] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [11:22:28] 6operations, 6Mobile-Web, 7Mobile: Mobile site broken - https://phabricator.wikimedia.org/T98309#1264413 (10Krenair) p:5Triage>3Unbreak! I noticed something like this earlier as well [11:22:32] 6operations, 6Mobile-Web, 7Mobile: Mobile site broken - https://phabricator.wikimedia.org/T98309#1264416 (10Krenair) [11:22:35] oh, the bot's just being slow, ok [11:25:22] (03PS1) 10Alexandros Kosiaris: Remove unused ganglia.wikimedia.org.erb template [puppet] - 10https://gerrit.wikimedia.org/r/209204 [11:25:24] (03PS1) 10Alexandros Kosiaris: ganglia.wikimedia.org: Specify position of access log [puppet] - 10https://gerrit.wikimedia.org/r/209205 [11:27:48] Font from origin 'https://phab.wmfusercontent.org' has been blocked from loading by Cross-Origin Resource Sharing policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'https://phabricator.wikimedia.org' is therefore not allowed access. [11:27:49] heh [11:28:27] 6operations, 6Mobile-Web, 7Mobile: Mobile site broken - https://phabricator.wikimedia.org/T98309#1264435 (10Krenair) {F161156} [11:28:37] github is down [11:28:41] too bad [11:29:53] yurik: [11:29:55] their status says all systems operational, heh [11:30:02] Krenair: see the graph [11:30:14] yeah [11:30:24] matanya, sorry, i can't fix github :) [11:30:25] doesn't quite match their green "All systems operational" box at the top though does it? [11:30:49] yurik: that was for sending me to read the docs :D [11:31:05] not quite [11:31:26] PROBLEM - puppet last run on labnet1001 is CRITICAL Puppet has 1 failures [11:31:28] yep status page is picking up latency now [11:31:45] PROBLEM - High load average on labstore1001 is CRITICAL 62.50% of data above the critical threshold [24.0] [11:31:46] precise hosts done, lunch break before moving on to salt upgrade of trusty hosts [11:31:46] (03PS1) 10Faidon Liambotis: varnish: quick fix for mobile redirect outage [puppet] - 10https://gerrit.wikimedia.org/r/209206 (https://phabricator.wikimedia.org/T98309) [11:31:50] bblack: are you here? [11:32:16] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [11:32:23] <_joe_> paravoid: he was here untli 3.5 hours ago, I hope not [11:32:53] (03CR) 10Faidon Liambotis: [C: 032] varnish: quick fix for mobile redirect outage [puppet] - 10https://gerrit.wikimedia.org/r/209206 (https://phabricator.wikimedia.org/T98309) (owner: 10Faidon Liambotis) [11:33:35] <_joe_> paravoid: LGTM too :) [11:33:53] heh, funky status codes [11:33:54] 6operations, 6Mobile-Web, 7Mobile, 5Patch-For-Review: Mobile site broken - https://phabricator.wikimedia.org/T98309#1264451 (10Krenair) [11:34:13] HTTP/999 Written In PHP [11:34:56] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [11:37:53] c'moon puppet [11:39:56] PROBLEM - High load average on labstore1001 is CRITICAL 62.50% of data above the critical threshold [24.0] [11:40:25] PROBLEM - puppet last run on terbium is CRITICAL Puppet has 1 failures [11:43:26] (03PS1) 10KartikMistry: CX: Enable ContentTranslation for Wikis scheduled on 20150507 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209207 (https://phabricator.wikimedia.org/T97888) [11:43:27] (03CR) 10Alexandros Kosiaris: [C: 032] Remove unused ganglia.wikimedia.org.erb template [puppet] - 10https://gerrit.wikimedia.org/r/209204 (owner: 10Alexandros Kosiaris) [11:43:27] AaronSchulz: can you take a look at https://gerrit.wikimedia.org/r/#/c/207785/1 ? [11:43:27] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia.wikimedia.org: Specify position of access log [puppet] - 10https://gerrit.wikimedia.org/r/209205 (owner: 10Alexandros Kosiaris) [11:43:41] 6operations, 6Mobile-Web, 7Mobile, 5Patch-For-Review: Mobile site broken - https://phabricator.wikimedia.org/T98309#1264465 (10faidon) Thanks for the report and thank you @Krenair for bringing it to our attention. This was broken with 4eb918924cbadb98d0e0e144e97d35376abfbbb8 at approximately 8:00 UTC (10:... [11:44:12] 6operations, 10Traffic, 7Mobile: Mobile site broken - https://phabricator.wikimedia.org/T98309#1264466 (10faidon) 5Open>3Resolved a:3faidon [11:44:31] 6operations, 10Traffic, 7Mobile: Mobile site broken - https://phabricator.wikimedia.org/T98309#1264471 (10Krenair) Jimmy's article on enwiki now works on my phone [11:47:32] apergos, would you mind taking a look at that log again for me? [11:47:41] (silver:/var/log/apache2/error.log) [11:47:56] RECOVERY - puppet last run on labnet1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:48:03] speaking of apergos [11:48:16] apergos: snapshot1004 is sick, bd808 was trying to figure it out last night [11:48:18] Krenair: mind running your whatever it is again? [11:48:25] paravoid: yep I saw that today [11:48:31] memory [11:48:42] I hit refresh on https://wikitech.wikimedia.org/wiki/Special:Properties a few times, apergos [11:48:45] I'll be looking at it later after today's salt upgrae [11:49:06] PHP Catchable fatal error: Argument 1 passed to Language::viewPrevNext() must be an instance of Title, string given, called in /srv/mediawiki/php-1.26wmf4/extensions/SemanticMediaWiki/specials/QueryPages/SMW_QueryPage.php on line 82 and defined in /srv/mediawiki/php-1.26wmf4/languages/Language.php on line 4706 [11:49:07] SMW was using a global MW function we removed from MediaWiki almost a year ago [11:49:12] ah, oops. ok [11:51:07] 6operations, 10ops-eqiad: db1027 raid degraded - https://phabricator.wikimedia.org/T98285#1264488 (10Cmjohnson) Replaced the disk and rebuilding Slot Number: 10 Firmware state: Rebuild [11:51:46] and now apergos? [11:52:10] PHP Catchable fatal error: Argument 4 passed to Language::viewPrevNext() must be of the type array, string given, called in /srv/mediawiki/php-1.26wmf4/extensions/SemanticMediaWiki/specials/QueryPages/SMW_QueryPage.php on line 82 and defined in /srv/mediawiki/php-1.26wmf4/languages/Language.php on line 4707 [11:52:16] whack-a-mole? :-) [11:52:22] yeah, I can't test this stuff locally [11:52:25] I wonder why this isn't going to fluorine [11:53:36] PROBLEM - puppet last run on mw1041 is CRITICAL Puppet last ran 14 hours ago [11:53:40] fixed it [11:53:44] thanks for your help apergos [11:53:50] sweet! [11:54:03] will commit the fix and upload [11:55:05] RECOVERY - puppet last run on terbium is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [11:55:15] RECOVERY - puppet last run on mw1041 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [11:56:26] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [11:59:05] bah, it doesn't merge of course [12:02:08] lol git-review doesn't even work for that [12:04:46] Krenair: for the font, the server has to serve an header like 'Access-Control-Allow-Origin *' or 'Access-Control-Allow-Origin https://phabricator.wikimedia.org'. [12:19:50] (03PS2) 10Alexandros Kosiaris: hieraize nrpe [puppet] - 10https://gerrit.wikimedia.org/r/208630 [12:27:22] RECOVERY - RAID on db1027 is OK optimal, 1 logical, 2 physical [12:31:15] (03CR) 10Alexandros Kosiaris: "ping ? pong ?" [puppet] - 10https://gerrit.wikimedia.org/r/204155 (owner: 10Alexandros Kosiaris) [12:34:52] (03PS3) 10Alexandros Kosiaris: hieraize nrpe [puppet] - 10https://gerrit.wikimedia.org/r/208630 [12:43:02] RECOVERY - puppet last run on analytics1016 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:45:25] !log krenair Synchronized php-1.26wmf4/extensions/SemanticMediaWiki/specials/QueryPages/SMW_QueryPage.php: https://gerrit.wikimedia.org/r/#/c/209212/ (duration: 00m 21s) [12:45:35] Logged the message, Master [12:48:00] (03PS1) 10Alex Monk: Update my gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/209214 [12:48:09] (03PS1) 10Alexandros Kosiaris: Assign role::puppetmaster::backend to rhodium [puppet] - 10https://gerrit.wikimedia.org/r/209215 (https://phabricator.wikimedia.org/T98173) [12:50:47] (03CR) 10Alexandros Kosiaris: [C: 032] Assign role::puppetmaster::backend to rhodium [puppet] - 10https://gerrit.wikimedia.org/r/209215 (https://phabricator.wikimedia.org/T98173) (owner: 10Alexandros Kosiaris) [12:53:51] 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3Wikidata-Sprint-2015-04-07, and 2 others: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1264702 (10daniel) @Jimkont: broken serialization of empty lists is a separate issue, unrelated to unconverted old-st... [12:55:43] PROBLEM - DPKG on rhodium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:58:04] joal: paravoid and cmjohnson1 are going to coordinate switch replacement here [12:58:13] 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3Wikidata-Sprint-2015-04-07, and 2 others: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1264706 (10daniel) I'm now running the following on tool labs to find "old" serializations: daniel@tools-bastion-0... [12:58:21] 6operations, 10ops-eqiad: Remove extra unused SSD from rhodium - https://phabricator.wikimedia.org/T98323#1264707 (10akosiaris) 3NEW [12:58:51] PROBLEM - puppet last run on rhodium is CRITICAL Puppet has 5 failures [13:00:05] paravoid, cmjohnson1: Respected human, time to deploy Switch Maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150506T1300). Please do the needful. [13:00:09] haha [13:00:19] cmjohnson1, give me a sec [13:00:27] haha [13:00:49] :) [13:01:19] ok, let's go ahead [13:01:27] !log replacing asw-c4-eqiad (T93730) [13:01:33] Logged the message, Master [13:01:44] cmjohnson1: power off C4 & unplug it :) [13:01:48] going to power down the switch [13:03:47] PROBLEM - Host iridium is DOWN: PING CRITICAL - Packet loss = 100% [13:03:47] PROBLEM - Host rcs1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:03:47] PROBLEM - Host rdb1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:03:47] PROBLEM - Host stat1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:04:58] PROBLEM - Host logstash1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:05:17] PROBLEM - Host osmium is DOWN: PING CRITICAL - Packet loss = 100% [13:05:17] PROBLEM - Host eventlog1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:05:18] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:05:18] PROBLEM - Host gadolinium is DOWN: PING CRITICAL - Packet loss = 100% [13:05:18] PROBLEM - Host caesium is DOWN: PING CRITICAL - Packet loss = 100% [13:05:18] PROBLEM - Host erbium is DOWN: PING CRITICAL - Packet loss = 100% [13:05:18] PROBLEM - Host labsdb1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:05:19] PROBLEM - Host ganeti1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:05:19] PROBLEM - Host graphite1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:05:27] PROBLEM - Host lead is DOWN: PING CRITICAL - Packet loss = 100% [13:05:28] PROBLEM - Host hafnium is DOWN: PING CRITICAL - Packet loss = 100% [13:05:28] PROBLEM - Host logstash1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:05:28] PROBLEM - Host labsdb1007 is DOWN: PING CRITICAL - Packet loss = 100% [13:05:28] PROBLEM - Host ganeti1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:05:28] PROBLEM - Host logstash1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:06:07] cmjohnson1: I just committed the FPC 4 serial number replacement on the rest of the asw-c stack [13:06:14] ("set virtual-chassis member 4 serial-number BP0211500170") [13:07:51] <_joe_> jobrunners stopped processing jobs, I don't see other flags for now [13:08:45] <_joe_> and phab is down obviously [13:14:59] paravoid: does it matter which vcp port? [13:15:05] no [13:15:06] or any 1 [13:15:07] okay [13:17:47] paravoid: plugged in, console connected, mgmt connected [13:17:55] 1 vcp [13:18:00] power it on [13:18:06] no servers yet [13:18:09] oh powered on alrady [13:18:18] yeah....once plugged in [13:18:21] it's powered on [13:19:08] everything looks good [13:19:14] start plugging servers [13:19:32] k [13:19:51] and the second VCP, but servers first to minimize downtime :) [13:20:08] RECOVERY - Host eventlog1001 is UPING OK - Packet loss = 0%, RTA = 3.51 ms [13:20:17] RECOVERY - Host logstash1001 is UPING OK - Packet loss = 0%, RTA = 1.01 ms [13:20:38] RECOVERY - Host logstash1003 is UPING OK - Packet loss = 0%, RTA = 1.51 ms [13:20:47] RECOVERY - Host hafnium is UPING OK - Packet loss = 0%, RTA = 1.34 ms [13:20:48] RECOVERY - Host logstash1002 is UPING OK - Packet loss = 0%, RTA = 1.50 ms [13:20:58] RECOVERY - Host rcs1001 is UPING OK - Packet loss = 0%, RTA = 1.71 ms [13:20:58] RECOVERY - Host graphite1001 is UPING OK - Packet loss = 0%, RTA = 1.38 ms [13:21:18] RECOVERY - Host caesium is UPING OK - Packet loss = 0%, RTA = 1.48 ms [13:21:38] RECOVERY - Host rdb1001 is UPING OK - Packet loss = 0%, RTA = 1.47 ms [13:22:07] RECOVERY - Host gadolinium is UPING OK - Packet loss = 0%, RTA = 0.47 ms [13:22:18] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:22:18] RECOVERY - Host erbium is UPING OK - Packet loss = 0%, RTA = 0.67 ms [13:22:28] RECOVERY - Host analytics1001 is UPING OK - Packet loss = 0%, RTA = 0.71 ms [13:22:44] cool, back. joal, looking at running jobs on an01 now [13:22:50] cool! [13:22:57] RECOVERY - Host stat1003 is UPING OK - Packet loss = 0%, RTA = 2.36 ms [13:23:00] so, it looks like whatever it had running before the thing went down, it just restarted [13:23:01] that was like being at the dc and watching cmjohnson1 plugging cables back in [13:23:07] RECOVERY - Host osmium is UPING OK - Packet loss = 0%, RTA = 3.05 ms [13:23:07] RECOVERY - Host ganeti1001 is UPING OK - Packet loss = 0%, RTA = 1.58 ms [13:23:15] godog: haha yes [13:23:17] RECOVERY - Host ganeti1002 is UPING OK - Packet loss = 0%, RTA = 5.12 ms [13:23:18] RECOVERY - Host iridium is UPING OK - Packet loss = 0%, RTA = 2.41 ms [13:23:18] RECOVERY - Host lead is UPING WARNING - Packet loss = 44%, RTA = 1.74 ms [13:23:22] I was thinking the same [13:23:38] RECOVERY - Host labsdb1006 is UPING OK - Packet loss = 0%, RTA = 1.08 ms [13:23:48] RECOVERY - Host labsdb1007 is UPING OK - Packet loss = 0%, RTA = 2.35 ms [13:23:59] hehe, too bad timestamps on my irc clients don't have seconds [13:24:08] paravoid: everything is plugged back in [13:24:28] PROBLEM - puppet last run on logstash1002 is CRITICAL puppet fail [13:24:28] going to add the 2nd vcp now [13:24:44] great [13:24:45] eventlogging looks cool now too [13:24:49] thanks guys that was pretty painless [13:24:50] icinga has only 979 active alerts [13:25:02] I hate our per-server graphite checks so much :) [13:25:09] PROBLEM - puppet last run on rdb1001 is CRITICAL puppet fail [13:25:18] PROBLEM - puppet last run on graphite1001 is CRITICAL puppet fail [13:25:28] PROBLEM - puppet last run on hafnium is CRITICAL puppet fail [13:25:39] PROBLEM - puppet last run on rcs1001 is CRITICAL puppet fail [13:25:41] cmjohnson1: ok, I see the 2nd VCP connected now [13:25:44] 4 (FPC 4) Prsnt BP0211500170 ex4200-48t 0 Linecard Y 6 vcp-0 2 vcp-1 [13:25:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:25:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:25:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:25:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:25:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:26:07] PROBLEM - puppet last run on analytics1001 is CRITICAL puppet fail [13:26:08] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:26:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:26:34] Hadoop Resource Manager is back, jobs seeme not to have suffer :0 [13:26:37] Thanks guys ! [13:26:38] PROBLEM - puppet last run on gadolinium is CRITICAL puppet fail [13:26:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:26:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:26:59] PROBLEM - puppet last run on ganeti1001 is CRITICAL puppet fail [13:27:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [13:27:08] PROBLEM - puppet last run on iridium is CRITICAL puppet fail [13:27:34] vk drerr due to destination on c4? [13:28:15] likely due to missing data in graphite [13:28:24] that ^ [13:28:27] graphite was on c4 [13:28:35] s/was/is/ :) [13:29:18] cmjohnson1: so, are we done? [13:29:21] (I am :) [13:29:34] paravoid we are done [13:29:43] just updating racktables now [13:29:49] \o/ thanks paravoid cmjohnson1 [13:29:52] awesome [13:29:54] thanks :) [13:30:22] yw...thx for the prep work [13:31:38] <_joe_> the jobrunners have recovered without the need for a restart. kudos to aaron [13:32:03] 6operations, 10ops-eqiad, 10Incident-20141130-eqiad-C4: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1264732 (10faidon) 5Open>3Resolved Switch replacement is done. End-to-end downtime for the affected hosts was ~20mins, starting at 13:01 UTC, well within our window. [13:33:44] 6operations, 7Monitoring: Upgrade to newer version of gdash - https://phabricator.wikimedia.org/T98134#1264745 (10fgiunchedi) a:3fgiunchedi I'll look into upgrading gdash also a candidate for replacement: https://github.com/urbanairship/tessera [13:34:29] (03Draft2) 10Filippo Giunchedi: icinga: unify swift alerts [puppet] - 10https://gerrit.wikimedia.org/r/209217 [13:35:19] (03PS1) 10Andrew Bogott: Rename virt1010 to labvirt1009 [dns] - 10https://gerrit.wikimedia.org/r/209219 [13:35:21] (03PS1) 10Andrew Bogott: Rename virt1010 to labvirt1009 [puppet] - 10https://gerrit.wikimedia.org/r/209220 [13:35:39] godog: there is an abundance of graphite frontend and I'm all for exploring more of them (as long as we pick 1-2, not install all of them -- we already have 2 :) [13:36:03] godog: but gdash has been useful for us so far, so an incremental update there could give a lot of value without much effort [13:36:27] (03CR) 10Andrew Bogott: [C: 032] Rename virt1010 to labvirt1009 [dns] - 10https://gerrit.wikimedia.org/r/209219 (owner: 10Andrew Bogott) [13:37:04] paravoid: indeed, I'll tackle upgrading gdash first, what's the second frontend btw? [13:37:16] grafana [13:37:18] 6operations, 10ops-eqiad: db1027 raid degraded - https://phabricator.wikimedia.org/T98285#1264749 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson Back online spun up [13:38:00] doh, I keep forgetting about it -- it doesn't have dashboards in configuration, that's probably why [13:38:24] 6operations, 10ops-eqiad: Remove extra unused SSD from rhodium - https://phabricator.wikimedia.org/T98323#1264753 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson done [13:38:26] <_joe_> !log uploading HHVM 3.6.1 and all the related extensions to apt.wikimedia.org [13:38:32] Logged the message, Master [13:38:39] (03CR) 10Andrew Bogott: [C: 032] Rename virt1010 to labvirt1009 [puppet] - 10https://gerrit.wikimedia.org/r/209220 (owner: 10Andrew Bogott) [13:38:49] _joe_: \o/ [13:39:03] <_joe_> akosiaris: wait for the rollback [13:39:04] <_joe_> :P [13:39:05] great [13:39:07] nice job cmjohnson1 [13:39:09] lol [13:39:09] RECOVERY - puppet last run on labsdb1007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:39:57] apergos: are you upgrading salt everywhere? [13:40:07] akosiaris: could you follow up on https://phabricator.wikimedia.org/T96924 before you punch out for the night? Some Openstack folks think there’s a feature to allow the routing we need. [13:40:15] yes. well not labs, and not tin today [13:40:18] paravoid: [13:40:25] you should !log that [13:40:28] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [13:40:30] I got quite scared for a moment there :) [13:40:55] because I saw all mw* hosts momentarily pop up with a dpkg failure, indicative of a package upgrade [13:40:59] ah [13:41:10] ...right after _joe_ logged putting HHVM 3.6 to apt [13:41:14] I'm doing them in batched [13:41:16] batches [13:41:23] <_joe_> paravoid: oh man [13:41:26] so you'll see more of those [13:41:37] then !log it please? [13:42:04] <_joe_> also I just found out someone bumped the version of hhvm-tidy without committing it to the repo, so now we have broken hhvm packages [13:42:12] twentyafterfour: search is working today on wikitech. Thanks again for fixing. [13:42:13] !log all precise hosts are upgraded to salt except for tin and virt1000; in the middle of trusty updates now, in batches [13:42:18] Logged the message, Master [13:42:21] <_joe_> I'm going to rebuild it in 10 minutes or so [13:42:34] <_joe_> but no one should operate on the appservers right now [13:42:47] (03PS1) 10Springle: repool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209221 [13:42:55] andrewbogott: yeah, I've seen it. I 'll followup as soon as I am done with a couple of other things [13:42:57] ok, I'll do the others, please holler when I can go back to those [13:43:02] _joe_: [13:43:10] akosiaris: thank you! [13:43:47] PROBLEM - puppet last run on mw1083 is CRITICAL Puppet has 1 failures [13:43:48] <_joe_> ahah labs is frozen now [13:43:55] <_joe_> so yeah, more than 10 minutes I guess [13:44:08] fyi I have upated salt on mw10*, I have apt-get update on mw11* and the rest of the mw* are untouched by me [13:44:08] PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 64.4966098684 [13:44:35] <_joe_> apergos: yeah please leave them alone for now [13:44:37] PROBLEM - Host virt1010 is DOWN: PING CRITICAL - Packet loss = 100% [13:44:40] right [13:44:50] <_joe_> what's happening? ^^ [13:45:08] <_joe_> paravoid: packet loss? [13:45:37] RECOVERY - puppet last run on db2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:46:08] (03CR) 10Springle: [C: 032] repool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209221 (owner: 10Springle) [13:46:14] (03Merged) 10jenkins-bot: repool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209221 (owner: 10Springle) [13:46:30] ? [13:46:48] RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY: 0.852770131579 [13:47:39] !log springle Synchronized wmf-config/db-eqiad.php: repool db1021 in s2, warm up (duration: 00m 27s) [13:47:46] Logged the message, Master [13:49:47] RECOVERY - Host virt1010 is UPING OK - Packet loss = 0%, RTA = 1.78 ms [13:50:43] (03PS3) 10Filippo Giunchedi: icinga: unify swift alerts [puppet] - 10https://gerrit.wikimedia.org/r/209217 (https://phabricator.wikimedia.org/T88974) [13:53:29] <_joe_> !log upgrading the hhvm imagescaler (mw1152) to HHVM 3.6.1 [13:53:29] PROBLEM - configured eth on virt1010 is CRITICAL: Timeout while attempting connection [13:53:33] Logged the message, Master [13:54:49] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [13:55:06] Sorry about the virt1010 warnings, that’s me doing things in the wrong order [13:55:18] PROBLEM - Host virt1010 is DOWN: PING CRITICAL - Packet loss = 100% [13:58:37] ACKNOWLEDGEMENT - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). alexandros kosiaris service still being onlined [13:59:07] (03PS6) 10Ottomata: Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 [13:59:33] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [14:00:02] RECOVERY - puppet last run on mw1083 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:00:19] chasemp: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150506T1400). [14:01:50] (03PS5) 10Rush: phab stage tags for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/205723 [14:01:58] (03CR) 10Rush: [V: 032] phab stage tags for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/205723 (owner: 10Rush) [14:05:33] (03PS7) 10Ottomata: Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 [14:05:33] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:06:09] Does anyone know how to force ganglia to reload hostnames? I’ve renamed some servers and it’s still showing stats under the old hostnames. [14:06:21] I can restart the service but I don’t want to erase everyone’s historical data... [14:07:26] Thanks for caring about that :) [14:08:32] PROBLEM - puppet last run on iridium is CRITICAL Puppet has 1 failures [14:08:52] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [14:10:12] PROBLEM - check_puppetrun on payments1002 is CRITICAL Puppet has 2 failures [14:10:13] RECOVERY - puppet last run on iridium is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:11:35] fixing ^^^ [14:11:36] !log rebooting labvirt1009 one last time [14:11:57] (03PS8) 10Ottomata: Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 [14:12:42] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [14:12:52] mark, around? CC: MaxSem [14:13:26] robh CC [14:13:29] (03PS4) 10Ottomata: Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 [14:14:07] (03CR) 10jenkins-bot: [V: 04-1] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 (owner: 10Ottomata) [14:14:31] yes? [14:15:12] RECOVERY - check_puppetrun on payments1002 is OK Puppet is currently enabled, last run 251 seconds ago with 0 failures [14:16:02] uh-oh yurik & mark, the ticket we were going to discuss can't be viewed due to phab upgrade:P [14:16:20] !log gracefuled apache on uranium [14:17:14] (03CR) 10Ottomata: [C: 032 V: 032] Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 (owner: 10Ottomata) [14:17:53] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [14:17:54] <_joe_> !log pooling the HHVM imagescalers to test if the issues are solved now. [14:18:00] Logged the message, Master [14:18:02] (03CR) 10Ottomata: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/209021 (owner: 10Ottomata) [14:18:02] PROBLEM - check if phabricator taskmaster is running on iridium is CRITICAL: PROCS CRITICAL: 1 process with regex args PhabricatorTaskmasterDaemon [14:18:06] (03PS5) 10Ottomata: Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 [14:18:25] mark, in short - MaxSem and I would like to get our hands on some test hardware, instead of a VM, due to us needing of a very large DB (1TB), and descent performance. It would be great if we can use one of the spares (we saw a few good ones in the wikitech). It would be great if this box behaved like it was part of the labs cluster, but we could even use it without any external exposure via an ssh tunnel [14:18:44] ouch, phabricator down again [14:18:45] this is purelly for our own experimentation, not a production [14:19:22] mark, we have filed a phab ticket about it, about a week ago. Yesterday robh came back to tell us that we need to discuss it with you [14:19:36] right [14:19:38] thus we are :) [14:19:45] first off, we can't make that behave like labs [14:19:53] we don't support normal hardware in labs yet unfortunately [14:20:04] tunnel is fine then [14:20:15] (03CR) 10Ottomata: [C: 032] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 (owner: 10Ottomata) [14:20:25] and we should probably meet up soon to discuss needs and get you someone to help out from ops [14:20:53] !log phabricator went down again for some minutes, seems ok now? [14:20:59] Logged the message, Master [14:21:22] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [14:22:11] (03CR) 10Anomie: [C: 031] "Looks sane." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209170 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [14:22:31] <_joe_> that was the HHVM imagescaler I suppose [14:22:35] mark, that is always great. we already have some functioning components, and the stack is starting to shape up, so the discussion will be more productive than simply "we need N boxes, but we have no idea how to use them yet" :) [14:22:40] <_joe_> !log depooling the HHVM imagescaler [14:22:45] Logged the message, Master [14:23:41] good [14:24:08] mark, mostly we need the big machine so that we can estimate our postgress queries on the full data storage (about 400GB). Of course in production it has to be SSD, so it wont be as accurate, but we will get some measures [14:24:44] and run some vector tile generation (SQL->vector) [14:25:34] perhaps we have spare SSDs as well [14:25:45] akosiaris, btw, should i file a phab request for the OSM db to come back online? [14:25:52] mark, are they big enough? [14:26:18] for production, we are likely to need 1TB SSDs [14:26:21] per machine [14:26:34] are there details about that on the ticket? [14:28:00] mark, details about potential production hardware needs were part of the hardware budgeting. The ticket for the test spare machine is much simpler - as long as it has enough space, we should be ok to get somem useful data from it [14:28:03] phab's back to live btw: https://phabricator.wikimedia.org/T97638 [14:30:24] for me, the most important part is being able to work with the full-sized DB, to optimize queries and analyze the full dataset [14:32:54] !log shutting down db1054 for maintenance [14:33:00] Logged the message, Master [14:33:27] shall we discuss this in a meeting tomorrow? [14:33:34] around this time perhaps? [14:34:18] mark, sure [14:36:43] PROBLEM - Host db1054 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:44] ok [14:36:44] mark, sounds good [14:36:44] mark, who else should be invited? [14:36:44] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:36:44] just sent an invite [14:36:45] brb [14:37:22] PROBLEM - puppet last run on iridium is CRITICAL Puppet has 1 failures [14:39:12] RECOVERY - puppet last run on iridium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:40:44] (03PS1) 10Rush: Phab: start-taskmasters renamed to taskmasters [puppet] - 10https://gerrit.wikimedia.org/r/209228 [14:41:08] (03PS2) 10Rush: Phab: start-taskmasters renamed to taskmasters [puppet] - 10https://gerrit.wikimedia.org/r/209228 [14:42:39] (03CR) 10Rush: [C: 032 V: 032] Phab: start-taskmasters renamed to taskmasters [puppet] - 10https://gerrit.wikimedia.org/r/209228 (owner: 10Rush) [14:46:47] (03PS1) 10Rush: Phab: allow no outbound requests syntax change [puppet] - 10https://gerrit.wikimedia.org/r/209230 [14:47:00] (03PS2) 10Rush: Phab: allow no outbound requests syntax change [puppet] - 10https://gerrit.wikimedia.org/r/209230 [14:47:09] (03CR) 10Rush: [C: 032 V: 032] Phab: allow no outbound requests syntax change [puppet] - 10https://gerrit.wikimedia.org/r/209230 (owner: 10Rush) [14:49:20] (03PS1) 10Rush: phab: security outbound try as explicit array [puppet] - 10https://gerrit.wikimedia.org/r/209231 [14:49:24] (03CR) 10jenkins-bot: [V: 04-1] phab: security outbound try as explicit array [puppet] - 10https://gerrit.wikimedia.org/r/209231 (owner: 10Rush) [14:49:29] (03PS2) 10Rush: phab: security outbound try as explicit array [puppet] - 10https://gerrit.wikimedia.org/r/209231 [14:50:18] (03CR) 10Rush: [C: 032] phab: security outbound try as explicit array [puppet] - 10https://gerrit.wikimedia.org/r/209231 (owner: 10Rush) [14:50:31] bd808: Ping for SWAT. Want to do it yourself? [14:50:52] If I'm still the only one in there, sure [14:51:29] anomie: yeah I'll take it [14:52:18] 6operations, 5Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1264953 (10akosiaris) [14:53:02] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [14:53:02] ACKNOWLEDGEMENT - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 4 unmerged changes in puppet (dir /var/lib/git/operations/puppet). alexandros kosiaris still bringing up the service [14:54:00] (03PS1) 10Rush: phab: metamta.maniphest.reply-handler-domain obsolete option [puppet] - 10https://gerrit.wikimedia.org/r/209232 [14:54:10] (03PS2) 10Rush: phab: metamta.maniphest.reply-handler-domain obsolete option [puppet] - 10https://gerrit.wikimedia.org/r/209232 [14:56:29] (03CR) 10Rush: [C: 032] phab: metamta.maniphest.reply-handler-domain obsolete option [puppet] - 10https://gerrit.wikimedia.org/r/209232 (owner: 10Rush) [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, bd808: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150506T1500). Please do the needful. [15:00:15] * bd808 will be SWATTING [15:00:40] (03PS1) 10Rush: phab: storage.upload-size-limit is obsolete [puppet] - 10https://gerrit.wikimedia.org/r/209233 [15:00:51] (03PS2) 10Rush: phab: storage.upload-size-limit is obsolete [puppet] - 10https://gerrit.wikimedia.org/r/209233 [15:00:56] * ^d buckles up [15:01:46] (03CR) 10Rush: [C: 032] phab: storage.upload-size-limit is obsolete [puppet] - 10https://gerrit.wikimedia.org/r/209233 (owner: 10Rush) [15:02:39] (03PS2) 10BryanDavis: Send group0 + group1 MediaWiki events to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209170 (https://phabricator.wikimedia.org/T88732) [15:02:46] (03CR) 10BryanDavis: [C: 032] Send group0 + group1 MediaWiki events to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209170 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [15:02:51] (03Merged) 10jenkins-bot: Send group0 + group1 MediaWiki events to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209170 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [15:04:14] !log bd808 Synchronized wmf-config/InitialiseSettings.php: Send group0 + group1 MediaWiki events to logstash {{gerrit|209170}} (duration: 00m 16s) [15:04:22] Logged the message, Master [15:05:26] (03PS1) 10Rush: Phab now recommends settings opcache.validate_timestamps=0 [puppet] - 10https://gerrit.wikimedia.org/r/209234 [15:05:29] w00t [15:05:37] (03PS2) 10Rush: Phab now recommends settings opcache.validate_timestamps=0 [puppet] - 10https://gerrit.wikimedia.org/r/209234 [15:05:57] log volume into logstash jumped way up and nothing scary looking in hhvm logs [15:06:05] https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki [15:06:54] (03CR) 10Rush: [C: 032] Phab now recommends settings opcache.validate_timestamps=0 [puppet] - 10https://gerrit.wikimedia.org/r/209234 (owner: 10Rush) [15:08:39] ^d: If you would be so kind as to +2 https://gerrit.wikimedia.org/r/#/c/208987/ I'd push it out to beta and then prod [15:09:42] <^d> looking [15:09:47] (03PS2) 10BBlack: move geoiplookup to text-addrs-v4 [dns] - 10https://gerrit.wikimedia.org/r/209174 [15:10:19] (03CR) 10BBlack: [C: 032] move geoiplookup to text-addrs-v4 [dns] - 10https://gerrit.wikimedia.org/r/209174 (owner: 10BBlack) [15:10:25] (03CR) 10Chad: [C: 032] Update statsd events [tools/scap] - 10https://gerrit.wikimedia.org/r/208987 (https://phabricator.wikimedia.org/T64667) (owner: 10BryanDavis) [15:10:33] <^d> bd808: there you go [15:10:38] thx [15:10:41] <^d> yw [15:10:43] 6operations, 6Commons, 6Multimedia, 7HHVM, and 4 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1264993 (10Joe) [15:10:48] (03Merged) 10jenkins-bot: Update statsd events [tools/scap] - 10https://gerrit.wikimedia.org/r/208987 (https://phabricator.wikimedia.org/T64667) (owner: 10BryanDavis) [15:10:50] 6operations, 6Commons, 6Multimedia, 7HHVM, 5Patch-For-Review: Create an HHVM 3.6.0 package, adding Tim's streaming patch - https://phabricator.wikimedia.org/T93194#1264991 (10Joe) 5Open>3Resolved [15:11:39] bd808: \o/ [15:12:21] 6operations, 6Phabricator: m3 set max_allowed_packet to 33554432 or greater - https://phabricator.wikimedia.org/T98339#1264999 (10chasemp) 3NEW a:3Springle [15:12:31] mutante: JohnMcLear says to update Etherpad Lite ASAP [15:13:32] marktraceur: again ? how many more different CVEs this time ? [15:13:55] Didn't ask, just reporting [15:14:03] akosiaris: all of them! [15:14:11] (03PS1) 10Ottomata: Fix hadoop.pp documentation default [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209236 [15:14:13] (03PS1) 10Ottomata: Fix namenode address selection in spark.pp [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209237 [15:14:25] I 've backported some CVE patches to 1.4.something [15:14:29] (03CR) 10Ottomata: [C: 032] Fix hadoop.pp documentation default [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209236 (owner: 10Ottomata) [15:14:37] (03CR) 10Ottomata: [C: 032] Fix namenode address selection in spark.pp [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209237 (owner: 10Ottomata) [15:15:24] (03PS1) 10Ottomata: Update cdh module with spark assembly jar path fix [puppet] - 10https://gerrit.wikimedia.org/r/209238 [15:15:37] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh module with spark assembly jar path fix [puppet] - 10https://gerrit.wikimedia.org/r/209238 (owner: 10Ottomata) [15:16:03] 6operations, 6Phabricator: m3 set max_allowed_packet to 33554432 or greater - https://phabricator.wikimedia.org/T98339#1265041 (10chasemp) p:5Triage>3Normal [15:17:23] PROBLEM - puppet last run on iridium is CRITICAL Puppet has 1 failures [15:17:33] marktraceur: yeah, he is probably referring to the CVEs in 1.5.3 and 1.5.5. I 've backported them to 1.4.1-2 so we are safe from that [15:17:44] fatalmonitor in logstash has quite a few "Can't connect to MySQL server" errors for a largish number of mysql hosts: 10.64.16.23 10.64.48.15 10.64.48.23 10.64.32.26 10.64.16.28 10.64.16.28 10.64.48.19 10.64.32.29 10.64.48.25 10.64.16.8 10.64.16.33 10.64.16.16 [15:17:51] Oh. [15:18:25] The fialure count on each host is small (<10 each) in the last 5 minutes [15:20:28] 6operations, 6Commons, 6Multimedia, 7HHVM, and 4 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1265055 (10Joe) Ok so, after a lot of battles with HHVM 3.6.1, the mysterious 503s on the imagescalers continued. One possible cause is the fact that apparently some ima... [15:22:29] <_joe_> bd808: 10.64.16.23 is a labs host? [15:22:47] <_joe_> nope sorry [15:23:06] 6operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1265075 (10MF-Warburg) Apart from the fact that Glaisher and Nemo are not the people who can create the new wiki, it still does not seem w... [15:23:29] the error rate is really low for it to be some major problem, but it is strange to see so many [15:24:12] RECOVERY - puppet last run on iridium is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:24:14] well... maybe it isn't so low. ~100 errors in the last 5 minutes [15:24:32] <_joe_> bd808: it seems to be s7 [15:24:35] (03PS1) 10Phuedx: Import lists for the Browse experiment on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209242 (https://phabricator.wikimedia.org/T95446) [15:24:49] but really diverse for both MW servers and db servers [15:25:03] <_joe_> bd808: they're all s7 AFAICS [15:25:28] !log trebuchet fetch for scap/scap failed on mw1222.eqiad.wmnet [15:25:35] Logged the message, Master [15:25:55] (03CR) 10Phuedx: [C: 04-2] "-2 until we're happy with the rest of the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209242 (https://phabricator.wikimedia.org/T95446) (owner: 10Phuedx) [15:27:05] !log trebuchet checkout for scap/scap failed for mw1113.eqiad.wmnet, mw1222.eqiad.wmnet, mw1104.eqiad.wmnet [15:27:11] Logged the message, Master [15:27:35] !log Updated scap to 57036d2 (Update statsd events) [15:27:39] Logged the message, Master [15:29:20] !log Stashed uncommitted change to scap on tin that disabled php opening tag check for sync-file [15:29:30] Logged the message, Master [15:36:32] why would mw1200 be trying to connect to a db server in codfw? [15:36:57] 10.64.48.* is in codfw right? [15:37:22] PROBLEM - puppet last run on iridium is CRITICAL Puppet has 1 failures [15:37:49] oh, nope that db is listed for both eqiad and codfw db configs [15:38:51] no, 10.64.48.0/24 is eqiad [15:40:27] If anybody is curious, here's the logstash report I'm looking at -- https://logstash.wikimedia.org/#dashboard/temp/r9LHSZNUQLuHXYYKvLuGqA [15:41:16] If you zoom out on the timeline you'll see it has an abrupt start but that was just when I turned on group1 MW logs into logstash [15:41:31] (03PS1) 1020after4: Symlink for sprint extension static files. [puppet] - 10https://gerrit.wikimedia.org/r/209243 [15:42:29] (03PS2) 1020after4: Symlink for sprint extension static files. [puppet] - 10https://gerrit.wikimedia.org/r/209243 (https://phabricator.wikimedia.org/T91207) [15:43:24] (03CR) 10Rush: [C: 04-1] Symlink for sprint extension static files. [puppet] - 10https://gerrit.wikimedia.org/r/209243 (https://phabricator.wikimedia.org/T91207) (owner: 1020after4) [15:43:41] 6operations, 10ops-codfw: document network switch stack cables in use - https://phabricator.wikimedia.org/T98344#1265142 (10RobH) 3NEW a:3Papaul [15:46:51] PROBLEM - puppet last run on rhodium is CRITICAL Puppet has 5 failures [15:47:51] (03CR) 10Rush: "I missed this staged change, this is done now" [puppet] - 10https://gerrit.wikimedia.org/r/208848 (owner: 10Negative24) [15:48:03] (03Abandoned) 10Rush: phabricator: Remove obsolete configs [puppet] - 10https://gerrit.wikimedia.org/r/208848 (owner: 10Negative24) [15:49:11] PROBLEM - High load average on labstore1001 is CRITICAL 57.14% of data above the critical threshold [24.0] [15:51:22] PROBLEM - puppet last run on mw1126 is CRITICAL Puppet has 1 failures [15:51:31] PROBLEM - puppet last run on mw1129 is CRITICAL Puppet has 1 failures [15:51:52] RECOVERY - DPKG on rhodium is OK: All packages OK [15:54:51] PROBLEM - puppet last run on mw2048 is CRITICAL Puppet has 1 failures [15:55:17] (03PS3) 1020after4: Symlink for sprint extension static files. [puppet] - 10https://gerrit.wikimedia.org/r/209243 (https://phabricator.wikimedia.org/T91207) [15:55:31] PROBLEM - puppetmaster backend https on rhodium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [15:57:32] 6operations, 10Traffic, 5Patch-For-Review: Build a non-trunk 3.19 kernel for jessie - https://phabricator.wikimedia.org/T97411#1265180 (10BBlack) p:5High>3Normal [16:00:05] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150506T1600). [16:01:32] (03PS3) 10BryanDavis: gdash: adjust deploy metrics [puppet] - 10https://gerrit.wikimedia.org/r/208085 (https://phabricator.wikimedia.org/T64667) (owner: 10Filippo Giunchedi) [16:01:42] PROBLEM - puppet last run on dbproxy1003 is CRITICAL Puppet has 1 failures [16:02:21] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [16:04:29] (03CR) 10BryanDavis: "Updated naming to make more sense relative to the modern scap tools." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/208085 (https://phabricator.wikimedia.org/T64667) (owner: 10Filippo Giunchedi) [16:07:16] * aude waves [16:07:41] RECOVERY - puppet last run on mw1126 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:07:51] RECOVERY - puppet last run on mw1129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:08:42] man I hate how drag-n-drop in a phab workboard's effects on priority are kinda random, if sorting by priority [16:09:10] seems there's already an upstream bugfix, maybe: https://secure.phabricator.com/rP5aca5299805803916b9d907d617128fc28929695 [16:09:32] we should have pulled that in [16:09:34] I think [16:09:57] (03PS4) 1020after4: Symlink for sprint extension static files. [puppet] - 10https://gerrit.wikimedia.org/r/209243 (https://phabricator.wikimedia.org/T91207) [16:10:43] well for instances, the scenario that caused my High>Normal a few lines above: I moved from one workboard slot to the other while Sort:priority in effect. The destination column only had 2x Normal priority. I moved it above them to try to avoid a prio change, but it still grouped up with them and made an unintended change to Normal. [16:11:02] RECOVERY - puppet last run on mw2048 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:11:47] I can try again with T96854, same basic scenario [16:11:49] yeah I hate the implicit priority change idea [16:11:51] (or you can to see it) [16:12:02] it wouldn't be so bad if the effects were predictable [16:12:16] I would guess it is we just don't understand it :) [16:12:49] well, if I move a High prio into a column with no existing High prio's in it, it seems to drop to the next one that there is something of. [16:12:57] I don't see any higher-up drag target of any kind while dragging [16:13:13] I'll try harder with that other one and make sure I didn't miss something with drag behavior [16:14:12] hmmm also, it's taking a while to actuall process the drag completion. are we still in semi-outage? [16:14:44] 6operations, 10Traffic: Reboot caches for kernel 3.19.3 globally - https://phabricator.wikimedia.org/T96854#1265243 (10BBlack) p:5High>3Normal [16:14:57] there it goes, after I gave up and nav'd away [16:14:59] hmmmm [16:15:30] maybe it's trying to wait on some other input from me. but I have no idea what [16:15:56] no outage now [16:16:02] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:16:03] that I know of tho [16:16:41] I'll try another random drag and actually wait forever on the page and see if it eventually completes visibly [16:17:13] (03CR) 10Rush: [C: 04-1] Symlink for sprint extension static files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/209243 (https://phabricator.wikimedia.org/T91207) (owner: 1020after4) [16:17:20] what I saw just now was it was greyed out in the destination column for a long time. gave up and reload and no change in effect. then eventually the log line above appeared and it had moved. [16:17:41] that's not great [16:17:42] RECOVERY - DPKG on labmon1001 is OK: All packages OK [16:18:03] I'm timing this one now, and not leaving the page until it stops being greyed out [16:18:14] took ~28s, did eventually complete [16:18:21] ok so doing the same here https://phabricator.wikimedia.org/tag/phabricator/ [16:18:26] seems to only take a moment [16:18:30] hm [16:18:35] what board are you on? [16:18:56] https://phabricator.wikimedia.org/project/board/1201/?order=priority [16:19:05] ^ perhaps order:priority affects it too, I can try without [16:19:22] RECOVERY - puppet last run on dbproxy1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:19:22] no, still the same [16:19:30] yeah it's taking forever there for me too [16:19:48] feel free to muck with upnext/inprog columns to test there [16:19:58] the distinction is always blurry in this case anyways :) [16:20:50] uh [16:20:56] that's new, you moved one to Done and it just vanished [16:21:05] idk wtf is gonig on here [16:21:08] it's haunted [16:21:22] oh you moved it back [16:21:44] I did [16:21:52] this may be affected by our lack of aphlict server [16:21:57] which is teh real time stuffs [16:22:02] (03PS5) 1020after4: Symlink for sprint extension static files. [puppet] - 10https://gerrit.wikimedia.org/r/209243 (https://phabricator.wikimedia.org/T91207) [16:22:12] bblack: pm'd you a note on poking at this [16:22:15] just an fyi [16:28:36] isn't the solution for workboard ordering to simply not "sort by priority" ? [16:30:25] if you sort natural, then it shouldn't re-prioritize your stuff when you drag a task, right? [16:30:47] correct [16:31:04] I forget to reset to natural sort before I drag a bunch of done tasks [16:31:12] sometimes, and get mad at myself [16:31:35] (03CR) 10Rush: [C: 032] "should work" [puppet] - 10https://gerrit.wikimedia.org/r/209243 (https://phabricator.wikimedia.org/T91207) (owner: 1020after4) [16:31:45] (03CR) 10Dzahn: "the redirect to mediawiki.org looks ok but a redirect to labs would be an issue because what ori said. (but this patch doesnt' do that yet" [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) (owner: 10Spage) [16:32:41] PROBLEM - salt-minion processes on rdb2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:32:41] PROBLEM - salt-minion processes on rdb2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:32:42] PROBLEM - DPKG on rdb2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:32:52] PROBLEM - DPKG on rdb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:33:02] PROBLEM - salt-minion processes on rdb2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:33:02] PROBLEM - salt-minion processes on rdb2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:33:02] PROBLEM - DPKG on rdb2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:33:11] PROBLEM - DPKG on rdb2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:33:51] (03PS1) 1020after4: qualify that command. [puppet] - 10https://gerrit.wikimedia.org/r/209249 [16:34:45] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1265396 (10hashar) The packages are already build for Trusty but I don't think they will work as is on Jessie (ex: different ruby version). The... [16:34:45] (03CR) 10Rush: [C: 032] qualify that command. [puppet] - 10https://gerrit.wikimedia.org/r/209249 (owner: 1020after4) [16:35:00] it really is non-obvious behavior or phabricator's part... [16:35:52] RECOVERY - puppet last run on iridium is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:37:32] RECOVERY - DPKG on rdb2002 is OK: All packages OK [16:37:52] RECOVERY - salt-minion processes on rdb2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:38:21] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 6Release-Engineering, and 2 others: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1265437 (10hashar) We talked about this task during our weekly RelEng checkin. T... [16:39:11] RECOVERY - salt-minion processes on rdb2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:39:42] RECOVERY - DPKG on rdb2003 is OK: All packages OK [16:40:02] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1265453 (10Dzahn) >>! In T97866#1261518, @Ottomata wrote: > All clear from me, but it is not clear what services are being asked for. original request had "Sim... [16:40:52] RECOVERY - salt-minion processes on rdb2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:41:02] RECOVERY - DPKG on rdb2001 is OK: All packages OK [16:41:02] PROBLEM - puppet last run on rdb2002 is CRITICAL Puppet has 1 failures [16:41:21] RECOVERY - salt-minion processes on rdb2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:41:21] RECOVERY - DPKG on rdb2004 is OK: All packages OK [16:41:49] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1265486 (10Ottomata) Sounds good. researchers makes sense. [16:43:29] (03PS3) 10Dzahn: stat1002/1003 access for sniedzielski [puppet] - 10https://gerrit.wikimedia.org/r/209010 (https://phabricator.wikimedia.org/T97866) (owner: 10coren) [16:43:31] !log done with all trusty salt updates in pro except for labcontrol1002 (?), doing jessie now in very tiny batches, it's being trouble [16:43:41] Logged the message, Master [16:45:21] (03PS4) 10Dzahn: stat1002/1003 access for sniedzielski [puppet] - 10https://gerrit.wikimedia.org/r/209010 (https://phabricator.wikimedia.org/T97866) (owner: 10coren) [16:46:33] (03CR) 10Dzahn: [C: 032] stat1002/1003 access for sniedzielski [puppet] - 10https://gerrit.wikimedia.org/r/209010 (https://phabricator.wikimedia.org/T97866) (owner: 10coren) [16:47:26] (03CR) 10Dzahn: "it had it all. approval, waiting period, ack from analytics, signed L3,.." [puppet] - 10https://gerrit.wikimedia.org/r/209010 (https://phabricator.wikimedia.org/T97866) (owner: 10coren) [16:49:26] expect whines from the ganeti hosts about salt, I'm taking care of it [16:50:32] PROBLEM - DPKG on ganeti2005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:50:51] PROBLEM - salt-minion processes on ganeti2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:50:52] PROBLEM - DPKG on ganeti2006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:51:11] PROBLEM - salt-minion processes on ganeti2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:51:21] PROBLEM - DPKG on ganeti2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:51:30] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1265542 (10Dzahn) @ottomata thanks @coren rebased and amended your change and merged groups: researchers and bastiononly like for mholloway @niedzielski your... [16:51:41] PROBLEM - salt-minion processes on ganeti2006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:51:42] PROBLEM - DPKG on ganeti2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:51:42] PROBLEM - DPKG on ganeti2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:51:43] PROBLEM - salt-minion processes on ganeti2005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:51:43] PROBLEM - DPKG on ganeti2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:51:43] PROBLEM - salt-minion processes on ganeti2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:52:01] PROBLEM - salt-minion processes on ganeti2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:52:12] RECOVERY - puppet last run on rdb2002 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:53:50] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1265554 (10Dzahn) 5Open>3Resolved a:3Dzahn just like on T95506 , if any issues or questions feel free to reopen [16:54:42] yeah those are salt-related, those brkoen packages, please ignore [16:54:54] so irritating [16:56:04] (03CR) 10Hashar: [C: 031] "A couple comments for information." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/208024 (owner: 10Dduvall) [16:56:48] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1265563 (10Sniedzielski) I'm in! Thanks! [16:56:52] PROBLEM - puppet last run on iridium is CRITICAL Puppet has 1 failures [17:02:11] (03PS1) 10Dzahn: admin: add niedzielski to releasers-mobile [puppet] - 10https://gerrit.wikimedia.org/r/209256 (https://phabricator.wikimedia.org/T98179) [17:03:20] !log hadoop active namenode switched back to analytics1001 after rack C4 switch replacement [17:03:22] RECOVERY - puppet last run on iridium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:25] Logged the message, Master [17:03:48] thanks jgage [17:03:53] btw, we got HA RM up and running! [17:04:13] 6operations, 6Labs, 10Labs-Infrastructure, 7Ipv6: Enable ipv6 on labs - https://phabricator.wikimedia.org/T37947#1265579 (10scfc) [17:04:14] awesome! [17:04:51] (03CR) 10Dzahn: [C: 032] "as approved on T97866" [puppet] - 10https://gerrit.wikimedia.org/r/209256 (https://phabricator.wikimedia.org/T98179) (owner: 10Dzahn) [17:05:35] (03PS1) 10Alexandros Kosiaris: puppetmaster: use require_packages [puppet] - 10https://gerrit.wikimedia.org/r/209257 [17:05:37] (03PS1) 10Alexandros Kosiaris: puppetmaster: remove uuid-generator [puppet] - 10https://gerrit.wikimedia.org/r/209258 [17:05:39] (03PS1) 10Alexandros Kosiaris: puppetmaster: Remove the package installed site [puppet] - 10https://gerrit.wikimedia.org/r/209259 [17:05:41] (03PS1) 10Alexandros Kosiaris: puppetmaster: Move system::role to the role class [puppet] - 10https://gerrit.wikimedia.org/r/209260 [17:05:43] (03PS1) 10Alexandros Kosiaris: puppetmaster: Do not manage certmanager's home [puppet] - 10https://gerrit.wikimedia.org/r/209261 [17:05:45] (03PS1) 10Alexandros Kosiaris: puppetmaster: remove legacy resources [puppet] - 10https://gerrit.wikimedia.org/r/209262 [17:05:47] (03PS1) 10Alexandros Kosiaris: puppetmaster: cleanups in gitsync [puppet] - 10https://gerrit.wikimedia.org/r/209263 [17:05:49] (03PS1) 10Alexandros Kosiaris: puppetmaster: remove extraneous empty line [puppet] - 10https://gerrit.wikimedia.org/r/209264 [17:05:51] (03PS1) 10Alexandros Kosiaris: puppetmaster::reporter::logstash. Remove the reporter namespace [puppet] - 10https://gerrit.wikimedia.org/r/209265 [17:05:53] (03PS1) 10Alexandros Kosiaris: puppetmaster::config. Minor lints [puppet] - 10https://gerrit.wikimedia.org/r/209266 [17:05:55] (03PS1) 10Alexandros Kosiaris: puppetmaster::config Avoid out of module dependencies [puppet] - 10https://gerrit.wikimedia.org/r/209267 [17:05:57] (03PS1) 10Alexandros Kosiaris: puppetmaster::gitpuppet lint cleanups [puppet] - 10https://gerrit.wikimedia.org/r/209268 [17:05:59] (03PS1) 10Alexandros Kosiaris: puppetmaster: latest to present [puppet] - 10https://gerrit.wikimedia.org/r/209269 [17:06:01] (03PS1) 10Alexandros Kosiaris: puppetmaster::logstash. Avoid out of module dependencies [puppet] - 10https://gerrit.wikimedia.org/r/209270 [17:06:03] (03PS1) 10Alexandros Kosiaris: puppetmaster: Move backups to the role class [puppet] - 10https://gerrit.wikimedia.org/r/209271 [17:06:05] (03PS1) 10Alexandros Kosiaris: puppetmaster: DRY on inclusion of hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/209272 [17:06:22] RECOVERY - Host db1054 is UPING OK - Packet loss = 0%, RTA = 2.92 ms [17:06:23] what about manager approval for https://phabricator.wikimedia.org/T98179, mutante? :/ [17:06:27] the good kind of puppetmaster spam [17:06:38] 6operations, 5Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1265599 (10akosiaris) I 've reinstalled the box as precise. Puppet has not been so happy however so I am working on fixing things. [17:06:59] Krenair: https://phabricator.wikimedia.org/T97866#1260974 [17:07:06] 6operations, 10Mathoid, 6Services, 5Patch-For-Review: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1265602 (10akosiaris) [17:07:40] Krenair: yep, i actually checked the "approved" was after "please also grant him access to caesium release server" [17:07:58] mutante, hmmm. Personally I'd double check that, but OK :) [17:09:26] Krenair: the whole thing is based on "like mholloway", so this one is fine. i do appreciate your thoroughness [17:09:30] 6operations, 10ops-eqiad, 5Patch-For-Review: db1054 MCE errors logged for CPU temperature - https://phabricator.wikimedia.org/T89801#1265616 (10Cmjohnson) Swapped cpu's today to see if error follows cpu or stays re-applied thermal paste. Updated all the f/w on the server while it's out of rotation. Since it... [17:10:47] 6operations, 6Labs, 10Labs-Infrastructure, 7Ipv6: Enable IPv6 on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T73218#1265628 (10scfc) [17:10:52] PROBLEM - MySQL Replication Heartbeat on db1045 is CRITICAL: CRIT replication delay 326 seconds [17:11:22] PROBLEM - MySQL Slave Delay on db1045 is CRITICAL: CRIT replication delay 352 seconds [17:12:26] 6operations, 10Mathoid, 6Services, 5Patch-For-Review: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1265646 (10akosiaris) The mathoid module has been update to use service::node in in https://gerrit.wikimedia.org/r/167413. Unfortunately this can not be merged before the mathoi... [17:14:32] manybubbles: i am seeing a bunch of pool-queuefull errors in the logs for elasticsearch [17:14:41] or ^d [17:14:43] 10Ops-Access-Requests, 6operations, 10Beta-Cluster, 5Patch-For-Review: Add niedzielski releasers-mobile in production and deployment-prep in labs - https://phabricator.wikimedia.org/T98179#1265666 (10Dzahn) approval (and original request to do this before that) was on https://phabricator.wikimedia.org/T97... [17:15:17] i don't know what this means or to do in this case (other than maybe restart? which i can't do) [17:15:28] they come and go [17:16:46] seems okay now but might happen again [17:17:02] RECOVERY - salt-minion processes on ganeti2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:17:03] 10Ops-Access-Requests, 6operations, 10Beta-Cluster, 5Patch-For-Review: Add niedzielski releasers-mobile in production and deployment-prep in labs - https://phabricator.wikimedia.org/T98179#1265673 (10Dzahn) 5Open>3Resolved a:3Dzahn @niedzielski and this one gave you access to "caesium" to be able to... [17:17:21] RECOVERY - DPKG on ganeti2002 is OK: All packages OK [17:17:36] <^d> aude: Most of those are tied to a single user and are mostly ignorable [17:17:37] 10Ops-Access-Requests, 6operations, 6Editing-Department, 5Patch-For-Review: Give Neil Quinn access to stats1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97746#1265681 (10coren) Nope, all good. Patch incoming. [17:17:51] * ^d is kind of in the middle of rel1_25 stuff [17:18:01] PROBLEM - puppet last run on iridium is CRITICAL Puppet has 1 failures [17:18:44] 10Ops-Access-Requests, 6operations: Grant yurik access to sca1001 cluster for graphoid debugging/restarts - https://phabricator.wikimedia.org/T98371#1265689 (10akosiaris) 3NEW [17:19:06] (03PS2) 10Alexandros Kosiaris: Assign graphoid-admin to the SCA cluster [puppet] - 10https://gerrit.wikimedia.org/r/208998 (https://phabricator.wikimedia.org/T98371) [17:19:17] (03PS1) 10Aude: Enable usage tracking on fawiki, hewiki and enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209275 (https://phabricator.wikimedia.org/T98237) [17:19:23] ^d: ok [17:21:08] 6operations, 10Traffic, 7Mobile: Replace bits URL in Firefox app, if possible - https://phabricator.wikimedia.org/T98373#1265719 (10BBlack) 3NEW a:3ori [17:21:11] RECOVERY - DPKG on ganeti2001 is OK: All packages OK [17:21:35] (03CR) 10Aude: [C: 032] Enable usage tracking on fawiki, hewiki and enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209275 (https://phabricator.wikimedia.org/T98237) (owner: 10Aude) [17:21:52] RECOVERY - salt-minion processes on ganeti2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:22:51] RECOVERY - salt-minion processes on ganeti2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:22:53] RECOVERY - DPKG on ganeti2003 is OK: All packages OK [17:23:02] RECOVERY - salt-minion processes on ganeti2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:23:11] (03Merged) 10jenkins-bot: Enable usage tracking on fawiki, hewiki and enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209275 (https://phabricator.wikimedia.org/T98237) (owner: 10Aude) [17:24:10] !log aude Synchronized wmf-config/InitialiseSettings.php: Enable Wikibase usage tracking on enwikivoyage, fawiki and hewiki (duration: 00m 18s) [17:24:18] Logged the message, Master [17:24:22] RECOVERY - DPKG on ganeti2004 is OK: All packages OK [17:24:23] RECOVERY - salt-minion processes on ganeti2005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:24:41] PROBLEM - nova-compute process on virt1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [17:24:52] RECOVERY - DPKG on ganeti2005 is OK: All packages OK [17:25:12] RECOVERY - DPKG on ganeti2006 is OK: All packages OK [17:26:02] RECOVERY - salt-minion processes on ganeti2006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:27:55] 6operations: Multiple PHP security issues - https://phabricator.wikimedia.org/T96586#1265775 (10MoritzMuehlenhoff) 5Open>3Resolved [17:28:04] 6operations: Decommission virt1001-1009 - https://phabricator.wikimedia.org/T98376#1265780 (10Andrew) 3NEW a:3Andrew [17:30:06] 6operations, 10Traffic: Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006#1265806 (10BBlack) Just tracking some stuff from irc-conversation: * The past-martianness of 85.15.56.0/22 probably isn't a pragmatic issue and can be ignored. It was only a past-martian due to being unallocated, and was all... [17:33:56] does anyone know what's up with the degraded raids on logstash 1004-6? on each of them, md0 is [UU__]. [17:34:39] (03CR) 10Muehlenhoff: [C: 032 V: 032] Switch to a non-trunk build, using abi=1 for our first build. [debs/linux] - 10https://gerrit.wikimedia.org/r/207751 (owner: 10Muehlenhoff) [17:35:13] (03PS1) 10Andrew Bogott: Undefine virt1005 and 1006. [puppet] - 10https://gerrit.wikimedia.org/r/209281 [17:36:37] (03PS3) 10coren: Add neilpquinn-wmf to researchers, statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/208732 (https://phabricator.wikimedia.org/T97746) [17:37:06] (03CR) 10Andrew Bogott: [C: 032] Undefine virt1005 and 1006. [puppet] - 10https://gerrit.wikimedia.org/r/209281 (owner: 10Andrew Bogott) [17:37:32] RECOVERY - nova-compute process on virt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [17:37:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] * Amend older changelog entries with security issues fixed in 3.19.x so that we properly keep track [debs/linux] - 10https://gerrit.wikimedia.org/r/207755 (owner: 10Muehlenhoff) [17:38:40] !log depuppeting and decommissioning virt1005 and virt1006 [17:38:47] (03CR) 10Dereckson: [C: 031] "PS2 is totally fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204473 (https://phabricator.wikimedia.org/T93339) (owner: 10Mjbmr) [17:38:47] Logged the message, Master [17:38:59] 10Ops-Access-Requests, 6operations, 10Beta-Cluster, 5Patch-For-Review: Add niedzielski releasers-mobile in production and deployment-prep in labs - https://phabricator.wikimedia.org/T98179#1265848 (10Niedzielski) I'm in! Thanks (and thanks for the proxy note)! [17:40:10] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 3.19.4 (Bug: T97411) [debs/linux] - 10https://gerrit.wikimedia.org/r/208601 (owner: 10Muehlenhoff) [17:40:44] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 3.19.5 (Bug: T97411) [debs/linux] - 10https://gerrit.wikimedia.org/r/208602 (owner: 10Muehlenhoff) [17:40:58] !log powering down virt1005 and virt1006 [17:41:03] Logged the message, Master [17:41:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 3.19.6 (Bug: T97411) [debs/linux] - 10https://gerrit.wikimedia.org/r/208662 (owner: 10Muehlenhoff) [17:42:10] <^d> aude: If you could leave any thoughts on https://gerrit.wikimedia.org/r/#/c/208168/, that'd be great. [17:42:14] (03Abandoned) 10Muehlenhoff: Update to 3.19.6 (Bug: T97441) [debs/linux] - 10https://gerrit.wikimedia.org/r/208603 (owner: 10Muehlenhoff) [17:42:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] (Bug: T97411) Refresh the control file and change the version scheme; we forked off the last 3.19 Debian upload (3.19.3) and all further upd [debs/linux] - 10https://gerrit.wikimedia.org/r/209181 (owner: 10Muehlenhoff) [17:45:25] 6operations: Decommission virt1001-1009 - https://phabricator.wikimedia.org/T98376#1265862 (10Andrew) virt1005 and virt1006 are now powered down. I've left their entries in DNS for now. [17:49:48] 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1265875 (10Gage) [17:51:13] 6operations, 6Analytics-Kanban: Event Logging data is not showing up in Graphite anymore since last week - https://phabricator.wikimedia.org/T98380#1265890 (10Milimetric) 3NEW [17:51:43] 7Blocked-on-Operations, 6Collaboration-Team, 10Echo, 6Scrum-of-Scrums, 7Schema-change: Perform schema change to echo_target_page changing from a 1 to 1 mapping between pages and user/notification to a 1 to many. - https://phabricator.wikimedia.org/T94427#1265899 (10Mattflaschen) [17:54:02] (03CR) 10Dzahn: [C: 031] Add neilpquinn-wmf to researchers, statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/208732 (https://phabricator.wikimedia.org/T97746) (owner: 10coren) [17:54:53] 6operations, 5Interdatacenter-IPsec: IPsec: add firewall rules - https://phabricator.wikimedia.org/T85823#1265905 (10Gage) [17:55:40] (03CR) 10coren: [C: 032] Add neilpquinn-wmf to researchers, statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/208732 (https://phabricator.wikimedia.org/T97746) (owner: 10coren) [17:56:37] 6operations, 10Traffic, 7Mobile: Replace bits URL in Firefox app, if possible - https://phabricator.wikimedia.org/T98373#1265916 (10brion) Ok found some documentation on changing URLs: https://developer.mozilla.org/en-US/Marketplace/Publishing/Updating_apps#Updating_hosted_apps Sounds like as long as we hav... [17:56:39] 10Ops-Access-Requests, 6operations, 6Editing-Department, 5Patch-For-Review: Give Neil Quinn access to stats1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97746#1265917 (10coren) 5Open>3Resolved a:3coren This will apply automagically next puppet run. Welcome. [17:56:58] 10Ops-Access-Requests, 6operations, 6Editing-Department, 5Patch-For-Review: Give Neil Quinn access to stats1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97746#1265920 (10Jdforrester-WMF) Thank you! [17:57:05] 6operations, 10Traffic, 5Patch-For-Review: Build a non-trunk 3.19 kernel for jessie - https://phabricator.wikimedia.org/T97411#1265921 (10MoritzMuehlenhoff) The kernel is now in operations/deb/linux git (currently updated to 3.19.6) and available on apt.wikimedia.org in the jessie-wikimedia suite. I'll add... [18:00:03] 6operations: Retire Torrus - https://phabricator.wikimedia.org/T87840#1265942 (10Gage) 5Open>3stalled [18:00:04] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150506T1800). Please do the needful. [18:00:38] 6operations, 10Wikimedia-Git-or-Gerrit, 7Monitoring: Improve monitoring of https://git.wikimedia.org/ - https://phabricator.wikimedia.org/T94320#1265945 (10Gage) 5Open>3stalled [18:01:24] (03PS2) 10coren: admin: add user for dkg [puppet] - 10https://gerrit.wikimedia.org/r/209155 (https://phabricator.wikimedia.org/T98148) (owner: 10Dzahn) [18:01:52] 6operations, 6Analytics-Kanban: Event Logging data is not showing up in Graphite anymore since last week - https://phabricator.wikimedia.org/T98380#1265951 (10chasemp) I see this in icinga for graphite1001: > Throughput of event logging events CRITICAL: 92.86% of data above the critical threshold [600.0] [18:04:36] chasemp: chasemp: if a phab ticket has a Patch-For-Review tag and the patch is merged, should the tag go away? it doesn't seem to. [18:04:49] no that's not how the bot works [18:04:57] k [18:06:11] (03PS3) 10coren: admin: add group for traceback-roots [puppet] - 10https://gerrit.wikimedia.org/r/209159 (https://phabricator.wikimedia.org/T98148) (owner: 10Dzahn) [18:08:45] mutante: can you give me a hand with ganglia? I’ve renamed some servers and powered down a couple of others, but ganglia is still reporting stats for the old names and fretting about the down instances. [18:09:58] RECOVERY - puppet last run on iridium is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:11:47] (03CR) 10Dzahn: [C: 031] admin: add user for dkg [puppet] - 10https://gerrit.wikimedia.org/r/209155 (https://phabricator.wikimedia.org/T98148) (owner: 10Dzahn) [18:12:28] (03CR) 10Dzahn: [C: 031] admin: add group for traceback-roots [puppet] - 10https://gerrit.wikimedia.org/r/209159 (https://phabricator.wikimedia.org/T98148) (owner: 10Dzahn) [18:13:41] andrewbogott: i'm afraid i don't know that [18:13:42] andrewbogott: restart gmond / gmetad [18:13:57] ori: on uranium? [18:14:05] whatever is powering ganglia.wikimedia.org, yeah [18:14:06] Or, rather, gmond on client, gemtad on server? [18:14:20] ganglia-web = uranium, yes [18:14:39] (03CR) 10coren: [C: 032] admin: add user for dkg [puppet] - 10https://gerrit.wikimedia.org/r/209155 (https://phabricator.wikimedia.org/T98148) (owner: 10Dzahn) [18:14:41] (03PS1) 10Dereckson: Alphabetical order for groupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209286 [18:14:50] !log restarted gmetad on uranium [18:14:55] (03CR) 10coren: [C: 032] admin: add group for traceback-roots [puppet] - 10https://gerrit.wikimedia.org/r/209159 (https://phabricator.wikimedia.org/T98148) (owner: 10Dzahn) [18:14:57] Logged the message, Master [18:15:52] (03CR) 10Dereckson: "PS1 alphabetical reorder part has been resubmitted as I4874d47950d9cc195d8811b86f18bfe49c819444." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204473 (https://phabricator.wikimedia.org/T93339) (owner: 10Mjbmr) [18:16:20] ori: no dice [18:16:48] andrewbogott: i gotta run, but google 'ganglia dmax' and enjoy the rabbithole [18:17:13] * andrewbogott considers lunch as an alternative [18:17:37] i think you have to restart gmond on every host in that aggegation group [18:17:52] jgage: not just the hosts that changed? [18:17:54] quite possibly, yes [18:17:59] PROBLEM - puppet last run on iridium is CRITICAL Puppet has 1 failures [18:18:06] I mean, some of the hosts that ganglia is tracking don’t exist anymore. [18:18:19] yeah, due to multicast sorcery. i don't really understand it. [18:19:12] and when you’re saying ‘gmond’ you mean ganglia-monitor, right? Since there’s no such service as gmond? [18:19:20] ah, yeah [18:19:33] yay service names != daemon names [18:20:11] (03PS1) 10Dzahn: phabricator: adjust monitoring of TaskmasterDaemon [puppet] - 10https://gerrit.wikimedia.org/r/209287 [18:20:33] 6operations, 10ops-codfw: document network switch stack cables in use - https://phabricator.wikimedia.org/T98344#1266025 (10Papaul) a:5Papaul>3RobH I email you a snap chat of my Cisco simulation to help you see how the switches are connected and the location of each cable. let me know if you have any quest... [18:20:39] it got renamed at some point [18:20:48] so old is gmond and new is ganglia-monitor [18:21:37] hm, nope, restarting ganglia-monitor on virt* and labvirt* and then restarting ganglia-monitr and gmetad on uranium = exact same behavior as before. [18:21:45] (03CR) 10Dzahn: [C: 032] phabricator: adjust monitoring of TaskmasterDaemon [puppet] - 10https://gerrit.wikimedia.org/r/209287 (owner: 10Dzahn) [18:22:28] PROBLEM - HHVM rendering on mw1081 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [18:22:31] ooh, I missed one [18:22:37] Now it’s updated and wrong in a different way! [18:22:42] * andrewbogott declares victory and goes to eat. [18:22:42] yaay [18:23:19] PROBLEM - Apache HTTP on mw1081 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [18:23:48] PROBLEM - HHVM processes on mw1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [18:23:59] RECOVERY - check if phabricator taskmaster is running on iridium is OK: PROCS OK: 2 processes with regex args PhabricatorTaskmasterDaemon [18:24:46] (03CR) 10Dereckson: [C: 031] Rename project namespace for tewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204464 (https://phabricator.wikimedia.org/T89332) (owner: 10Mjbmr) [18:24:49] RECOVERY - puppet last run on iridium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:25:33] ice cream break before I continue the jessie upgrades [18:25:40] ssiigghh [18:26:00] (03CR) 10Alexandros Kosiaris: [C: 04-1 V: 032] "minor comment, otherwise LGTM" (031 comment) [debs/contenttranslation/apertium-oc-es] - 10https://gerrit.wikimedia.org/r/207131 (https://phabricator.wikimedia.org/T96655) (owner: 10KartikMistry) [18:26:56] now I get the page. nice [18:28:58] chasemp: fixed https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=iridium&service=check+if+phabricator+taskmaster+is+running [18:29:28] RECOVERY - HHVM rendering on mw1081 is OK: HTTP OK: HTTP/1.1 200 OK - 66859 bytes in 6.475 second response time [18:30:10] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.057 second response time [18:30:38] RECOVERY - HHVM processes on mw1081 is OK: PROCS OK: 25 processes with command name hhvm [18:31:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Issue still exists" [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) (owner: 10KartikMistry) [18:34:33] !log rebooting cp3030 [18:34:46] Logged the message, Master [18:37:53] 6operations: Migrate host lists out of cache.pp to reference values in Hiera - https://phabricator.wikimedia.org/T92601#1266134 (10Gage) `manifests/role/cache.pp` has been refactored into `modules/role/manifests/cache/*` which reference `hieradata/common/cache/*`, hence the redundant data described in this task... [18:37:59] PROBLEM - puppet last run on iridium is CRITICAL Puppet has 1 failures [18:38:49] RECOVERY - MySQL Slave Delay on db1045 is OK replication delay 108 seconds [18:39:16] !log restarting apache on rhodium [18:39:25] Logged the message, Master [18:39:49] RECOVERY - puppet last run on iridium is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:40:18] RECOVERY - MySQL Replication Heartbeat on db1045 is OK replication delay -0 seconds [18:47:31] (03CR) 10Alexandros Kosiaris: [C: 04-2 V: 04-1] "gbp:error: upstream/0.3.1_r60155 is not a valid treeish" [debs/contenttranslation/apertium-eu-en] - 10https://gerrit.wikimedia.org/r/207031 (https://phabricator.wikimedia.org/T96653) (owner: 10KartikMistry) [18:48:16] (03CR) 10Alexandros Kosiaris: [C: 04-1] Added initial Debian package for apertium-eu-en [debs/contenttranslation/apertium-eu-en] - 10https://gerrit.wikimedia.org/r/207031 (https://phabricator.wikimedia.org/T96653) (owner: 10KartikMistry) [18:49:45] (03CR) 10BBlack: [C: 04-1] "+1 in general, but -1 to hold this until cache reboot process is done, because everything related to this is kinda messed up until then..." [puppet] - 10https://gerrit.wikimedia.org/r/209049 (owner: 10Ori.livneh) [18:50:00] 6operations, 10Traffic: Reboot caches for kernel 3.19.6 globally - https://phabricator.wikimedia.org/T96854#1266171 (10BBlack) [18:50:23] (03CR) 10Alexandros Kosiaris: [C: 04-1 V: 032] "Minor issue, otherwise LGTM" (031 comment) [debs/contenttranslation/apertium-pt-gl] - 10https://gerrit.wikimedia.org/r/206806 (https://phabricator.wikimedia.org/T96654) (owner: 10KartikMistry) [18:50:25] 6operations, 5Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1266173 (10Dzahn) i saw rhodium on icinga because "puppetmaster backend https" is shown as CRIT and when i looked it appeared like one of the crashes of mod_passenger, bu... [18:51:16] (03CR) 10Alexandros Kosiaris: [C: 04-1 V: 032] "minor comment inline, LGTM otherwise" (031 comment) [debs/contenttranslation/apertium-es-gl] - 10https://gerrit.wikimedia.org/r/206805 (https://phabricator.wikimedia.org/T96654) (owner: 10KartikMistry) [18:51:21] 6operations, 5Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1266178 (10Dzahn) So what created the ./rack directory on palladium? [18:52:56] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian package for apertium-tat [debs/contenttranslation/apertium-tat] - 10https://gerrit.wikimedia.org/r/206821 (https://phabricator.wikimedia.org/T95876) (owner: 10KartikMistry) [18:53:29] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian packaging for apertium-eus [debs/contenttranslation/apertium-eus] - 10https://gerrit.wikimedia.org/r/207027 (https://phabricator.wikimedia.org/T96653) (owner: 10KartikMistry) [18:55:59] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [18:56:48] well that's an interesting non-alert [18:58:18] <^d> 0 unmerged? [19:00:17] maybe it's a race condition in the icinga check [19:00:27] there was one, but it merged in the midst of the script running [19:00:55] well that's critical all right :-D [19:02:20] maybe an empty new dir? ;) [19:02:29] oh, nvm [19:02:39] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [19:05:02] 6operations, 10Traffic, 5Patch-For-Review: Build a non-trunk 3.19 kernel for jessie - https://phabricator.wikimedia.org/T97411#1266224 (10BBlack) It's running on cp1008 + cp3030 now for testing as well, looks fine so far. [19:08:13] aude: sadly, yeah, what d said [19:08:44] manybubbles: ok with me [19:13:33] jouncebot: next [19:13:34] In 0 hour(s) and 46 minute(s): Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150506T2000) [19:16:49] PROBLEM - puppet last run on iridium is CRITICAL Puppet has 1 failures [19:23:25] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: Fix ipv6 autoconf issues - https://phabricator.wikimedia.org/T94417#1266298 (10BBlack) [19:30:03] ori: The project ‘quality-assurance’ has three instances. Two of them haven’t been running for a year, one of them (you created about a year ago. [19:30:24] Shall I delete the two stopped instances, or all of them and the project, or something else? [19:35:46] (03CR) 10Dzahn: [C: 04-1] "in waiting period until May 8 per ticket" [puppet] - 10https://gerrit.wikimedia.org/r/209001 (https://phabricator.wikimedia.org/T98077) (owner: 10Dzahn) [19:36:29] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60617 bytes in 0.489 second response time [19:36:46] (03CR) 10Dzahn: [C: 04-1] "needs SSH key from andre__ who is currently on vacation" [puppet] - 10https://gerrit.wikimedia.org/r/208802 (https://phabricator.wikimedia.org/T97642) (owner: 10Dzahn) [19:37:52] (03PS1) 1020after4: Remove 1.25wmf23 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209305 [19:37:54] (03PS1) 1020after4: Add 1.26wmf5 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209306 [19:37:56] (03PS1) 1020after4: Wikipedias to 1.26wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209307 [19:37:58] (03PS1) 1020after4: Group0 to 1.26wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209308 [19:38:08] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:38:26] (03CR) 1020after4: [C: 032] Remove 1.25wmf23 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209305 (owner: 1020after4) [19:38:32] (03Merged) 10jenkins-bot: Remove 1.25wmf23 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209305 (owner: 1020after4) [19:38:34] (03CR) 1020after4: [C: 032] Add 1.26wmf5 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209306 (owner: 1020after4) [19:39:44] (03Merged) 10jenkins-bot: Add 1.26wmf5 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209306 (owner: 1020after4) [19:39:46] (03CR) 10Dzahn: "@jdlrobson the evening SWAT today is still free and has nothing else to do so far: https://wikitech.wikimedia.org/wiki/Deployments#Week_of" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson) [19:41:48] greg-g: yt? mobilefrontend is currently melting eventlogging and we'd like to get a patch deployed pdq [19:41:59] phuedx: yes plesae [19:42:13] phuedx: twentyafterfour is doing the train deploy right now, best to coordinate with him [19:44:20] !log twentyafterfour Started scap: testwiki to php-1.26wmf5 and rebuild l10n cache [19:44:27] Logged the message, Master [19:44:28] twentyafterfour: what do you need me to do? the fix (https://gerrit.wikimedia.org/r/#/c/209296/) has been merged to master [19:45:10] phuedx: which branch does that need to be in? just wmf5 or also wmf4 [19:45:22] I assume both [19:45:28] both i think [19:45:36] ok I'll deploy it [19:45:40] wmf4 is going out to wikipedias now? [19:46:13] Uh, upd https://www.mediawiki.org/wiki/MediaWiki_1.26/Roadmap [19:46:28] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [19:46:52] thanks twentyafterfour [19:47:16] i'll be idling (i'm about to start bottle feeding my son) [19:47:25] (03CR) 10BBlack: [C: 04-1] IPsec: Icinga monitor for Strongswan connections (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199787 (owner: 10Gage) [19:50:25] phuedx: yes it's going out to wikipedias very shortly [19:50:31] ACKNOWLEDGEMENT - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 21 unmerged changes in puppet (dir /var/lib/git/operations/puppet). alexandros kosiaris still being onlined. Ignore [19:50:31] ACKNOWLEDGEMENT - puppetmaster backend https on rhodium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error alexandros kosiaris still being onlined. Ignore [19:50:44] i'm very grateful twentyafterfour [19:51:09] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60618 bytes in 0.797 second response time [19:54:38] (03CR) 10BBlack: "Should this be using some kind of decom role to kill the processes/things on the hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/209188 (https://phabricator.wikimedia.org/T95687) (owner: 10Filippo Giunchedi) [19:55:28] (03CR) 10BBlack: [C: 031] "Nevermind, I missed the s/statsd/statsite/ part on the decom line :)" [puppet] - 10https://gerrit.wikimedia.org/r/209188 (https://phabricator.wikimedia.org/T95687) (owner: 10Filippo Giunchedi) [19:56:06] (03PS1) 10Dzahn: update.php: do not update timestamp on errors [debs/wikistats] - 10https://gerrit.wikimedia.org/r/209321 (https://phabricator.wikimedia.org/T46145) [19:58:45] (03CR) 10Dzahn: [C: 032] update.php: do not update timestamp on errors [debs/wikistats] - 10https://gerrit.wikimedia.org/r/209321 (https://phabricator.wikimedia.org/T46145) (owner: 10Dzahn) [19:58:53] (03CR) 10Dzahn: [V: 032] update.php: do not update timestamp on errors [debs/wikistats] - 10https://gerrit.wikimedia.org/r/209321 (https://phabricator.wikimedia.org/T46145) (owner: 10Dzahn) [20:00:04] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150506T2000). Please do the needful. [20:00:33] (03CR) 10Alexandros Kosiaris: [C: 04-1 V: 04-1] "gbp:error: upstream/0.3.3_r56159 is not a valid treeish" [debs/contenttranslation/apertium-eu-es] - 10https://gerrit.wikimedia.org/r/207038 (https://phabricator.wikimedia.org/T96653) (owner: 10KartikMistry) [20:02:41] (03CR) 10BBlack: [C: 04-1] "I think we should replace the shellscript with python + standard modules for this stuff like argparse and subprocess, at least (or even be" [puppet] - 10https://gerrit.wikimedia.org/r/208192 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [20:03:42] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian package for apertium-en-gl [debs/contenttranslation/apertium-en-gl] - 10https://gerrit.wikimedia.org/r/206803 (https://phabricator.wikimedia.org/T96654) (owner: 10KartikMistry) [20:04:14] (03CR) 10Hashar: "We have the Debian package python-statsd which is the pypi module statsd: https://pypi.python.org/pypi/statsd" [puppet] - 10https://gerrit.wikimedia.org/r/208192 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [20:05:50] (03CR) 10Ottomata: "Ok, will look into that then. I'm trying to keep things as simple as possible here, but I am for python (and maybe docopt instead of argp" [puppet] - 10https://gerrit.wikimedia.org/r/208192 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [20:07:58] (03CR) 10Alexandros Kosiaris: [C: 04-1 V: 04-1] "gbp:error: upstream/1.1.0_r60158 is not a valid treeish" [debs/contenttranslation/apertium-es-ast] - 10https://gerrit.wikimedia.org/r/207046 (https://phabricator.wikimedia.org/T96652) (owner: 10KartikMistry) [20:09:36] twentyafterfour: did it go out? has anything gone out? [20:09:37] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian packaging for apertium-es-an [debs/contenttranslation/apertium-es-an] - 10https://gerrit.wikimedia.org/r/207045 (https://phabricator.wikimedia.org/T96651) (owner: 10KartikMistry) [20:09:50] * phuedx has probably missed it [20:09:51] phuedx: still waiting for scap [20:09:54] ah [20:09:56] ta :) [20:10:08] which always seems to hang on snapshot1004.eqiad.wmnet [20:12:09] !log twentyafterfour scap failed: OSError [Errno 2] No such file or directory: '/var/lock/scap' (duration: 27m 49s) [20:12:16] Logged the message, Master [20:12:59] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian package for apertium-oc-ca (031 comment) [debs/contenttranslation/apertium-oc-ca] - 10https://gerrit.wikimedia.org/r/207130 (https://phabricator.wikimedia.org/T96655) (owner: 10KartikMistry) [20:13:16] (03CR) 10Alexandros Kosiaris: [C: 04-1 V: 04-1] Added initial Debian package for apertium-oc-ca [debs/contenttranslation/apertium-oc-ca] - 10https://gerrit.wikimedia.org/r/207130 (https://phabricator.wikimedia.org/T96655) (owner: 10KartikMistry) [20:13:53] wow that's a weird scap error... [20:13:57] (03CR) 10Ori.livneh: "Maybe we should just always print debug messages? It is a debugging tool, after all." [puppet] - 10https://gerrit.wikimedia.org/r/204155 (owner: 10Alexandros Kosiaris) [20:14:03] !log twentyafterfour Synchronized php-1.26wmf4/extensions/MobileFrontend/javascripts/modules/search/init.js: Temporarily disable MobileWebSearch logging (duration: 00m 37s) [20:14:06] (03CR) 10Alexandros Kosiaris: [V: 032] "Sigh, too many wrong clicks. Package builds well, minor comment inline, otherwise LGTM" [debs/contenttranslation/apertium-oc-ca] - 10https://gerrit.wikimedia.org/r/207130 (https://phabricator.wikimedia.org/T96655) (owner: 10KartikMistry) [20:14:09] Logged the message, Master [20:14:26] (03PS2) 10coren: Add tbayer to researchers [puppet] - 10https://gerrit.wikimedia.org/r/209131 (https://phabricator.wikimedia.org/T97916) [20:14:40] !log ignore all rumors of scap failures, the scaps were successful, with the exception of snapshot1004.eqiad.wmnet which hangs every time [20:14:45] Logged the message, Master [20:14:51] ^ lol [20:15:27] phuedx: should be good now? [20:15:36] !log twentyafterfour Synchronized php-1.26wmf5/extensions/MobileFrontend/javascripts/modules/search/init.js: Temporarily disable MobileWebSearch logging (duration: 00m 36s) [20:15:41] twentyafterfour: They were greatly exagerated? :-) [20:15:43] Logged the message, Master [20:15:50] snapshot1004 is very sick. swap death when I was looking last night. >6G of swap in use by 2 PHP processes [20:15:51] twentyafterfour: verifying [20:16:00] (03CR) 10coren: [C: 032] "Simple group addition." [puppet] - 10https://gerrit.wikimedia.org/r/209131 (https://phabricator.wikimedia.org/T97916) (owner: 10coren) [20:18:19] (03CR) 10Alexandros Kosiaris: [C: 04-1 V: 032] "Minor comment inline (same as Moritz raised). Otherwise, and since it is stated that we are fine with the software failing the tests, LGTM" (031 comment) [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/195577 (https://phabricator.wikimedia.org/T92252) (owner: 10KartikMistry) [20:18:42] (03CR) 1020after4: [C: 032] Wikipedias to 1.26wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209307 (owner: 1020after4) [20:20:45] twentyafterfour: looks good on test [20:20:59] i guess i'm waiting for that config change, right? ^^ [20:21:18] yeah that'll bump things to wmf4 branch [20:21:28] ...zuul is gonna take a bit [20:22:04] twentyafterfour: snapshot1004 is having swap issues, I will be looking at that tomorrow (still finishing up salt upgrade on prod today) [20:22:16] ZUUL [20:23:09] apergos: thanks. no rush on my account, I just wanted to make it clear in the SAL that it was just the one server failing, not all of them [20:23:13] right [20:27:11] apergos: do you have a ticket to track that? (snapshot1004) [20:27:39] I don't remember if there is one open (I got two report about it today) [20:28:10] * greg-g nods [20:28:53] ah at last... [20:29:45] !log salt upgraded to 2014.7.5 on all precise/trusty/jessie hosts in production except for: labcontrol2001, tin, virt1000 (deferred) and dysprosium/labvirt1005/labstore1002 (down) [20:29:51] Logged the message, Master [20:30:00] done for the day [20:30:24] g'night! [20:31:36] (03Merged) 10jenkins-bot: Wikipedias to 1.26wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209307 (owner: 1020after4) [20:32:20] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Add Tilman to "researchers" group on stat1003 - https://phabricator.wikimedia.org/T97916#1266634 (10Tbayer) 5Open>3Resolved a:3Tbayer Done now. Thanks, Coren! [20:32:42] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.26wmf4 [20:32:42] phuedx: should be good now? [20:32:47] Logged the message, Master [20:33:10] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Add Tilman to "researchers" group on stat1003 - https://phabricator.wikimedia.org/T97916#1266638 (10Tbayer) a:5Tbayer>3coren [20:33:11] 6operations, 10Wikimedia-Apache-configuration, 5Patch-For-Review: Apache slash expansion should not redirect from HTTPS to HTTP - https://phabricator.wikimedia.org/T95164#1266640 (10Dzahn) a:5Dzahn>3None TLDR: i fixed it on doc.wikimedia.org but it should still be fixed on integration.wikimedia.org [20:33:19] (03CR) 1020after4: [C: 032] Group0 to 1.26wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209308 (owner: 1020after4) [20:33:37] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1266646 (10Dzahn) a:5Dzahn>3None [20:34:17] twentyafterfour: checking [20:36:07] (03CR) 1020after4: [V: 032] Group0 to 1.26wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209308 (owner: 1020after4) [20:36:29] twentyafterfour: no longer seeing huge amounts of events being logged, thanks :) [20:36:59] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.26wmf5 [20:37:03] phuedx: sweet, glad to help [20:37:08] Logged the message, Master [20:39:50] !log twentyafterfour Purged l10n cache for 1.26wmf3 [20:39:54] Logged the message, Master [20:44:28] PROBLEM - puppet last run on snapshot1004 is CRITICAL puppet fail [21:03:15] (03PS1) 10BBlack: move all US/Canada to eqiad (except Calif) [dns] - 10https://gerrit.wikimedia.org/r/209377 [21:03:39] RECOVERY - puppet last run on snapshot1004 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [21:03:49] bblack: we can leave a few more [21:03:57] like Hawaii, for instance :) [21:04:15] any others? [21:04:23] portland, washington, maybe? [21:04:25] Alaska? [21:04:39] heh [21:04:40] well, we do want to get some things moved other than Montana :P [21:04:46] heh [21:05:06] IMHO, at least in the ConUS, the latency should be trivial [21:05:06] something up with ulsfo? [21:05:24] of us three, I'm guessing you're the most knowledgable on US geography & population estimates :) [21:05:24] I can see ak/hi argument though [21:05:45] maybe, but I don't care as much either, vs packet loss and asians [21:07:41] (03PS2) 10BBlack: all US/Canada to eqiad (except US:CA/AK/HI) [dns] - 10https://gerrit.wikimedia.org/r/209377 [21:08:00] (03CR) 10Faidon Liambotis: [C: 031] all US/Canada to eqiad (except US:CA/AK/HI) [dns] - 10https://gerrit.wikimedia.org/r/209377 (owner: 10BBlack) [21:08:26] Interesting [21:08:50] what is? [21:08:54] (03CR) 10BBlack: [C: 032] all US/Canada to eqiad (except US:CA/AK/HI) [dns] - 10https://gerrit.wikimedia.org/r/209377 (owner: 10BBlack) [21:09:04] Oh just reading backscroll [21:09:15] That Oregon, Nevada etc go to eqiad now [21:09:23] it's a bandaid [21:09:30] Capacity issues? [21:09:42] we're having problems with network link providers with single digit IQs [21:09:55] lol [21:16:29] PROBLEM - puppet last run on labvirt1008 is CRITICAL Puppet has 1 failures [21:19:49] RECOVERY - puppet last run on labvirt1008 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [21:22:55] that was me [21:23:08] a few stragler hsts with salt keys not accepted, so I got them too [21:26:53] apergos: upgrade happening today? [21:27:31] done on all prod except tin (deployment server) and two hosts which are salt master for labs so I want to do them tomorrow with rest of labs [21:27:47] apergos: \o/ [21:27:49] sweet [21:27:57] before you ask the synic packages are already in the repo :-P [21:28:03] *syndic [21:28:06] also the api packages [21:28:19] :D [21:28:20] nice [21:28:48] well, I’m going to allocate some time to it next week, and if it doesn’t pass muster on simple test (‘run hostname on all X hosts’) i’m going to go say fuck it and setup a dsh setup for tools [21:29:12] anyone wanting to make sure their labs instances get upgraded smoothly could make sure that there are no broken packages (apt-get install xyz won't whine) [21:29:20] by tomorrow :-P [21:30:55] doing wildcard matches is a bit more solid because you can ask for the list of hosts which did not respond to be returned [21:31:12] then you cna find out if they are swappingl down, minion is dead or whatever [21:31:22] grain based selection doesn't let you do that obviously [21:32:39] all right I was going to go to be a while ago and didn't [21:32:41] so. gone! [21:35:35] (03PS1) 10Dzahn: fix sourceforge wikis with different URL scheme [debs/wikistats] - 10https://gerrit.wikimedia.org/r/209384 (https://phabricator.wikimedia.org/T97834) [21:42:45] (03PS2) 10Dzahn: fix sourceforge wikis with different URL scheme [debs/wikistats] - 10https://gerrit.wikimedia.org/r/209384 (https://phabricator.wikimedia.org/T97834) [21:43:41] (03CR) 10Dzahn: [C: 032 V: 032] fix sourceforge wikis with different URL scheme [debs/wikistats] - 10https://gerrit.wikimedia.org/r/209384 (https://phabricator.wikimedia.org/T97834) (owner: 10Dzahn) [22:07:59] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [22:08:18] PROBLEM - puppet last run on cp3040 is CRITICAL puppet fail [22:10:40] (03PS1) 10Dzahn: ganglia: switch PDF cluster to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/209388 (https://phabricator.wikimedia.org/T93776) [22:14:20] (03PS1) 10Dzahn: ganglia: switch ocg servers to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/209389 (https://phabricator.wikimedia.org/T93776) [22:16:03] (03CR) 10Ori.livneh: "@springle: cool -- will you follow-up with the results of your investigation?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 (owner: 10Ori.livneh) [22:16:07] (03PS2) 10Dzahn: ganglia: switch ocg servers to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/209389 (https://phabricator.wikimedia.org/T93776) [22:19:09] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:19:43] 7Puppet, 6Phabricator: Puppet lock files fail because tag names are treated like dirs - https://phabricator.wikimedia.org/T98411#1267229 (10Negative24) 3NEW [22:20:04] (03CR) 10Dzahn: [C: 032] ganglia: switch ocg servers to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/209389 (https://phabricator.wikimedia.org/T93776) (owner: 10Dzahn) [22:20:58] (03CR) 10Dzahn: [C: 032] ganglia: switch PDF cluster to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/209388 (https://phabricator.wikimedia.org/T93776) (owner: 10Dzahn) [22:24:19] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [22:31:15] (03PS1) 10Dzahn: ganglia: set ocg1001 as aggregator for ocg hosts [puppet] - 10https://gerrit.wikimedia.org/r/209391 [22:32:43] (03CR) 10Dzahn: [C: 032] ganglia: set ocg1001 as aggregator for ocg hosts [puppet] - 10https://gerrit.wikimedia.org/r/209391 (owner: 10Dzahn) [22:36:54] (03CR) 10Mjbmr: [C: 031] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209286 (owner: 10Dereckson) [22:56:49] 6operations, 10Wikimedia-Mailing-lists: Create an alias for mailman list - https://phabricator.wikimedia.org/T98415#1267357 (10Krenair) [23:00:04] RoanKattouw, ^d, Mjbmr: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150506T2300). Please do the needful. [23:02:09] 6operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1267369 (10Mjbmr) @MF-Warburg, I know they are not the people who can create the new wiki. [23:03:27] 6operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1267370 (10Krenair) Well then why are you saying that they're waiting? [23:04:00] RoanKattouw: ^demon|busy Mjbmr is doing SWAT now? [23:04:11] 6operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1267371 (10Mjbmr) They know, and you not them. [23:04:40] yuvipanda: I'll do it in a minute [23:04:57] haha [23:05:13] RoanKattouw: no, i meant - jouncebot announced Mjbmr too [23:05:32] it lists the name of one of the developers listed [23:05:40] on the page [23:05:45] it's a bit broken [23:05:47] ah [23:06:25] RoanKattouw, kaldari and I are making submodule updates. [23:07:37] rmoen: Hi, could you submit a update for UniversalLanguageSelector please. [23:08:45] Mjbmr: not sure what you mean. [23:08:54] nvm. [23:09:29] Mjbmr: I'll do your config changes first if that's OK [23:09:32] While rmoen works on submodule updates [23:09:37] alright [23:09:57] (03CR) 10Catrope: [C: 032] Add autopatrolled, patroller and rollbacker user groups for svwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204473 (https://phabricator.wikimedia.org/T93339) (owner: 10Mjbmr) [23:10:12] (03Merged) 10jenkins-bot: Add autopatrolled, patroller and rollbacker user groups for svwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204473 (https://phabricator.wikimedia.org/T93339) (owner: 10Mjbmr) [23:10:24] (03CR) 10Catrope: [C: 032] Enabled ShortUrl on kn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206857 (https://phabricator.wikimedia.org/T97218) (owner: 10Dereckson) [23:10:31] (03Merged) 10jenkins-bot: Enabled ShortUrl on kn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206857 (https://phabricator.wikimedia.org/T97218) (owner: 10Dereckson) [23:11:56] Hi. [23:12:31] !log Created shorturls table on knwiki [23:12:39] RoanKattouw: er... 206857 is a ShortURL deployment needing to run this script on every row of the page table we spoke about the last time. [23:12:42] Logged the message, Mr. Obvious [23:13:05] Dereckson: Oh I don't remember that conversation [23:13:11] Sorry :S [23:13:16] Dereckson: Could you remind me? [23:13:58] (03CR) 10Catrope: [C: 032] Rename project namespace for tewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204464 (https://phabricator.wikimedia.org/T89332) (owner: 10Mjbmr) [23:14:16] Dereckson: populateShortUrlTable.php ? [23:14:34] RoanKattouw: yes, https://github.com/wikimedia/mediawiki-extensions-ShortUrl/blob/master/populateShortUrlTable.php#L32 [23:14:35] (03Merged) 10jenkins-bot: Rename project namespace for tewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204464 (https://phabricator.wikimedia.org/T89332) (owner: 10Mjbmr) [23:14:47] OK [23:14:56] And I need to run namespaceDupes for the tewiki thing too [23:15:51] !log catrope Synchronized wmf-config/InitialiseSettings.php: SWAT (duration: 00m 17s) [23:15:52] *tewikiquote [23:15:56] Logged the message, Master [23:16:00] !log Running namespaceDupes.php on tewikiquote [23:16:03] RoanKattouw: MobileFrontend submodule update for wmf5 is ready now: https://gerrit.wikimedia.org/r/#/c/209393/. Feel free to merge it. [23:16:05] Logged the message, Mr. Obvious [23:16:13] 16 601 rows, according [[Special:Statistics]], should be reasonnable as duration. [23:16:21] 204473 confirmed. [23:16:30] (if it onl process (main) ; if not, 61,189 [23:16:31] ) [23:16:43] id=4777 ns=0 dbk=వికీవ్యాఖ్య:రచ్చబండ *** dest title exists and --add-prefix not specified [23:16:45] 1 pages to fix, 0 were resolvable. [23:17:02] (yes, 61 189 so) [23:17:31] id=4777 ns=0 dbk=వికీవ్యాఖ్య:రచ్చబండ -> వికీవ్యాఖ్య:Broken/రచ్చబండ (alternate) [23:17:33] 1 pages to fix, 1 were resolvable. [23:17:42] that's a redirect from new title in main namepsace to old title in project namespace. [23:17:52] Aha OK [23:18:05] Mjbmr: It's now at https://te.wikiquote.org/w/index.php?title=%E0%B0%B5%E0%B0%BF%E0%B0%95%E0%B1%80%E0%B0%B5%E0%B1%8D%E0%B0%AF%E0%B0%BE%E0%B0%96%E0%B1%8D%E0%B0%AF:Broken/%E0%B0%B0%E0%B0%9A%E0%B1%8D%E0%B0%9A%E0%B0%AC%E0%B0%82%E0%B0%A1&redirect=no , do with it what you will [23:18:10] (edit, rename, delete, whatever) [23:18:26] alright [23:19:45] !log Running populateShortUrl.phg on knwiki [23:19:50] Logged the message, Mr. Obvious [23:19:58] Well that was fast, it's already done [23:20:08] 60800 titles done [23:20:10] 60885 titles done [23:20:11] Done [23:21:11] RoanKattouw: Thank you [23:21:36] RoanKattouw: good news it's so fast. [23:22:08] Yeah [23:22:24] I ran the script, then went "oh I should log that", typed the log message, went back to my shell, and it was already done [23:30:25] kaldari, rmoen: Does https://gerrit.wikimedia.org/r/#/c/209393/ contain everything you guys need? [23:30:40] Or are you still working on one for wmf4? [23:30:46] for wmf5 RoanKattouw. [23:30:51] RoanKattouw: Yes, it only has 1 update, which is all we need for wmf5 [23:30:53] But yes waiting for jenkins [23:30:58] OK [23:31:06] RoanKattouw: We are almost done with the submodule update for wmf4 [23:31:14] Cool [23:31:17] I'm still doing one for VE anyway [23:31:25] RoanKattouw: waiting on zuul [23:38:59] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/209400/ [23:39:06] has all the wmf4 changes [23:39:20] (03CR) 10Bmansurov: [C: 031] "Rest of the task for the interested: 209224." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209242 (https://phabricator.wikimedia.org/T95446) (owner: 10Phuedx) [23:39:35] (03CR) 10Bmansurov: "https://gerrit.wikimedia.org/r/#/c/209224/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209242 (https://phabricator.wikimedia.org/T95446) (owner: 10Phuedx) [23:41:02] rmoen: Thanks, merging that one now [23:41:23] I'll deploy the wmf5 ones now, and the wmf4 ones once Zuul is done with that one [23:42:57] RoanKattouw: sounds good [23:43:19] !log catrope Synchronized php-1.26wmf5/extensions/MobileFrontend: SWAT (duration: 00m 34s) [23:43:28] Logged the message, Master [23:43:38] !log catrope Synchronized php-1.26wmf5/extensions/VisualEditor: SWAT (duration: 00m 18s) [23:43:43] Logged the message, Master [23:48:12] RoanKattouw: wmf5 looks good [23:48:42] +1