[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170203T0000). [00:00:19] nothing to deploy [00:01:10] shoot did i add in the wrong place again..? [00:01:42] feck. but it in yesterdays didn't i. Damn timezone [00:02:04] MaxSem: ^ [00:02:07] hi jdlrobson, I can swat your change if you wish, I'm going to add one config change too [00:02:18] 07Puppet, 06Labs, 10Phabricator: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2995148 (10Paladox) 05Open>03Resolved a:03Paladox yep, this works now. [00:02:24] yes please.. am fixing the edit [00:02:29] 07Puppet, 06Labs, 10Phabricator: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2995151 (10Paladox) a:05Paladox>03None [00:02:39] two actually, * [config] {{Gerrit|335478}} Set site name for ku.wiktionary ({{phabT|29878}}) + * [config] {{Gerrit|334240}} Interwiki map update ({{phabT|156334}}) [00:02:55] (billinghurst asked for the second one this afternoon) [00:03:26] kaldari: too [00:03:45] I just snuck one in at the last second :) [00:03:47] jdlrobson: yours is {{Gerrit|335732}} Setting wgPageAssessmentsSubprojects to true on testwiki ? [00:03:52] https://gerrit.wikimedia.org/r/#/c/335732/ [00:03:54] Dereckson: yes [00:04:00] Dereckson: wait let me fix the wiki page [00:04:03] just got an edit conflict [00:04:03] (03PS10) 10Paladox: phabricator: allow to change elasticsearch configs [puppet] - 10https://gerrit.wikimedia.org/r/335703 (https://phabricator.wikimedia.org/T138881) [00:04:08] (03PS6) 10Dzahn: ldap: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334291 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:04:35] Dereckson, jdlrobson: Yeah, I put mine under the wrong day at first as well. [00:04:50] 02-02 [00:04:54] confusing [00:05:21] Dereckson: if you refresh it's up to date now [00:05:29] ok [00:05:35] I sympathize, the table isn't convenient to check it's the right date [00:05:48] Note that https://gerrit.wikimedia.org/r/#/c/335687/ is a labs only change [00:05:50] so should be easier [00:06:28] kaldari: already tested on beta? [00:06:52] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5324/" [puppet] - 10https://gerrit.wikimedia.org/r/334291 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:07:01] (03PS2) 10Dereckson: Limit page images on beta cluster to images in the lead section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335687 (https://phabricator.wikimedia.org/T152115) (owner: 10Jdlrobson) [00:07:11] Dereckson: No, not tested on beta. [00:07:28] kaldari: it's a new feature? [00:07:53] Dereckson: Yes, but minor [00:07:55] jdlrobson: you introduce a new *wmg*? [00:07:59] kaldari: no new feature in SWAT [00:08:06] Dereckson: OK [00:08:36] Dereckson: I'll test on beta first and reschedule [00:08:37] jdlrobson: it's not wgPageImagesLeadSectionOnly? [00:08:40] kaldari: good idea [00:08:46] (03PS11) 10Dzahn: phabricator: allow to change elasticsearch configs [puppet] - 10https://gerrit.wikimedia.org/r/335703 (https://phabricator.wikimedia.org/T138881) (owner: 10Paladox) [00:09:05] kaldari: if you prepare a change for beta now, I can merge it by the way [00:09:05] Dereckson: ohh shoot [00:09:16] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:09:24] Dereckson: No can't do it now, will have to wait until next week. [00:09:29] jdlrobson: here: https://gerrit.wikimedia.org/r/#/c/335687/2/wmf-config/InitialiseSettings-labs.php [00:09:31] thanks though! [00:09:32] kaldari: ok [00:09:36] You're welcome [00:09:51] (03PS3) 10Jdlrobson: Limit page images on beta cluster to images in the lead section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335687 (https://phabricator.wikimedia.org/T152115) [00:09:52] Dereckson: fixed [00:09:58] (03PS2) 10Dereckson: Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334240 (https://phabricator.wikimedia.org/T156334) (owner: 10Alex Monk) [00:10:35] (03PS2) 10Jdlrobson: Related pages is shown to 90% of mobile users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335686 (https://phabricator.wikimedia.org/T154681) [00:10:39] I fixed the other one with the same problem too ^ [00:10:42] jdlrobson: ok I do the Kren.air interwiki update map first, then I merge yours [00:11:15] (03CR) 10Dzahn: [C: 032] phabricator: allow to change elasticsearch configs [puppet] - 10https://gerrit.wikimedia.org/r/335703 (https://phabricator.wikimedia.org/T138881) (owner: 10Paladox) [00:11:24] mutante ^^ thanks :) [00:12:30] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334240 (https://phabricator.wikimedia.org/T156334) (owner: 10Alex Monk) [00:12:36] paladox: change in labs, no change in prod [00:12:38] I've checked, map is still up to date [00:12:44] paladox: which seems like it should [00:12:47] Ok thanks :) [00:13:01] Hi mutante. Hey, you come to FOSDEM this year? [00:13:14] Dereckson: Hello, unfortunately not. no [00:13:26] ok [00:14:14] (03Merged) 10jenkins-bot: Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334240 (https://phabricator.wikimedia.org/T156334) (owner: 10Alex Monk) [00:14:26] (03CR) 10jenkins-bot: Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334240 (https://phabricator.wikimedia.org/T156334) (owner: 10Alex Monk) [00:14:45] (03PS2) 10Dzahn: zuul: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/335418 (owner: 10Juniorsys) [00:14:50] (03CR) 10Dzahn: [C: 032] zuul: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/335418 (owner: 10Juniorsys) [00:15:07] twentyafterfour: still some legacy l10n cache folders to delete for the train by the way [00:15:36] twentyafterfour: try a scap pull on mwdebug1002 you'll see the issue [00:16:34] (03PS2) 10Dzahn: xdummy: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/335422 (owner: 10Juniorsys) [00:17:37] !log dereckson@tin Synchronized wmf-config/interwiki.php: Interwiki map update (duration: 00m 40s) [00:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:58] (03PS1) 10Dzahn: switch apt.wm.org from carbon to install1002 [dns] - 10https://gerrit.wikimedia.org/r/335734 (https://phabricator.wikimedia.org/T132757) [00:19:01] (03CR) 10Dzahn: [C: 032] xdummy: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/335422 (owner: 10Juniorsys) [00:19:40] (03PS4) 10Dereckson: Limit page images on beta cluster to images in the lead section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335687 (https://phabricator.wikimedia.org/T152115) (owner: 10Jdlrobson) [00:19:46] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335687 (https://phabricator.wikimedia.org/T152115) (owner: 10Jdlrobson) [00:20:01] Dereckson: i should have also let you know that i must go no later than 5.05pm PST - hope that's enough time [00:20:35] should be okay [00:21:21] (03Merged) 10jenkins-bot: Limit page images on beta cluster to images in the lead section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335687 (https://phabricator.wikimedia.org/T152115) (owner: 10Jdlrobson) [00:21:33] (03CR) 10jenkins-bot: Limit page images on beta cluster to images in the lead section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335687 (https://phabricator.wikimedia.org/T152115) (owner: 10Jdlrobson) [00:24:24] Dereckson: lemme know when testing can occur [00:24:35] !log dereckson@tin Synchronized wmf-config/InitialiseSettings-labs.php: Limit page images on beta cluster to images in the lead section (no-op in prod) (duration: 00m 41s) [00:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:08] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2995275 (10Dzahn) [00:25:34] (03PS2) 10Dereckson: Update apple touch icon for Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335689 (https://phabricator.wikimedia.org/T152538) (owner: 10Jdlrobson) [00:25:36] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5325/" [puppet] - 10https://gerrit.wikimedia.org/r/334311 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:25:44] (03PS6) 10Dzahn: redis: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334311 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:26:21] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335689 (https://phabricator.wikimedia.org/T152538) (owner: 10Jdlrobson) [00:26:40] (03PS2) 10Dereckson: RelatedArticles enabled on French Wikipedia (mobile only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335685 (https://phabricator.wikimedia.org/T156362) (owner: 10Jdlrobson) [00:26:56] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335685 (https://phabricator.wikimedia.org/T156362) (owner: 10Jdlrobson) [00:28:04] (03Merged) 10jenkins-bot: Update apple touch icon for Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335689 (https://phabricator.wikimedia.org/T152538) (owner: 10Jdlrobson) [00:28:15] (03CR) 10jenkins-bot: Update apple touch icon for Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335689 (https://phabricator.wikimedia.org/T152538) (owner: 10Jdlrobson) [00:28:25] jdlrobson: icon on mwdebug1002 [00:28:39] (03Merged) 10jenkins-bot: RelatedArticles enabled on French Wikipedia (mobile only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335685 (https://phabricator.wikimedia.org/T156362) (owner: 10Jdlrobson) [00:29:21] (03PS7) 10Dzahn: librenms/locales/logstash/lshell linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334293 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:30:12] (03CR) 10jenkins-bot: RelatedArticles enabled on French Wikipedia (mobile only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335685 (https://phabricator.wikimedia.org/T156362) (owner: 10Jdlrobson) [00:31:44] Dereckson: no way for me to completely check the icon change (I need to test on my phone where I cannot run mwdebug1002) but it's lookng good [00:31:50] will verify again when it hits everywhere [00:32:01] (03PS3) 10Dereckson: Related pages is shown to 90% of mobile users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335686 (https://phabricator.wikimedia.org/T154681) (owner: 10Jdlrobson) [00:32:28] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335686 (https://phabricator.wikimedia.org/T154681) (owner: 10Jdlrobson) [00:33:02] jdlrobson: okay syncing [00:33:37] !log dereckson@tin Synchronized static/apple-touch/wikipedia.png: Update apple touch icon for Wikipedia (T152538) (duration: 00m 39s) [00:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:42] T152538: Update apple touch icon - https://phabricator.wikimedia.org/T152538 [00:34:09] (03Merged) 10jenkins-bot: Related pages is shown to 90% of mobile users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335686 (https://phabricator.wikimedia.org/T154681) (owner: 10Jdlrobson) [00:34:17] (03CR) 10jenkins-bot: Related pages is shown to 90% of mobile users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335686 (https://phabricator.wikimedia.org/T154681) (owner: 10Jdlrobson) [00:34:20] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5326/" [puppet] - 10https://gerrit.wikimedia.org/r/334293 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:34:45] jdlrobson: RelatedArticles changes on mwdebug1002 [00:34:54] sweetttttt on it [00:36:14] jdlrobson: any preference for the sync order? [00:36:22] oh well we can do a sync-dir wmf-config/ [00:37:04] (03Draft1) 10Paladox: phabricator: Set the elasticsearch version in a string [puppet] - 10https://gerrit.wikimedia.org/r/335735 [00:37:07] (03PS2) 10Paladox: phabricator: Set the elasticsearch version in a string [puppet] - 10https://gerrit.wikimedia.org/r/335735 [00:37:53] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2995353 (10Dzahn) [00:37:56] 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2995351 (10Dzahn) 05Resolved>03Open reopening - I have one more request please. Can we please change install1001 to install1002 and install2001 to install2002? These are both V... [00:38:16] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [00:38:21] (03PS2) 10Dereckson: Set site name for ku.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335478 (https://phabricator.wikimedia.org/T29878) [00:38:31] Related pages look good Dereckson . Sync to French first please. [00:39:09] 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2995356 (10Dzahn) I realize this might be on your last day before you are away for a while, please feel free to put up for grabs and i'll ask others. [00:39:25] 06Operations, 10netops: netops: switch all subnets to use install1002/2002 as DHCP - https://phabricator.wikimedia.org/T156109#2995358 (10Dzahn) [00:40:45] (03PS3) 10Paladox: phabricator: Set the elasticsearch version in a string [puppet] - 10https://gerrit.wikimedia.org/r/335735 (https://phabricator.wikimedia.org/T138881) [00:40:49] okay [00:41:33] (03CR) 10Dzahn: [C: 032] "16:41 the version needs to be in a string otherwise it fails with the version checks we do in phab's core/" [puppet] - 10https://gerrit.wikimedia.org/r/335735 (https://phabricator.wikimedia.org/T138881) (owner: 10Paladox) [00:41:45] mutante ^^ thanks :) [00:42:30] !log dereckson@tin Synchronized dblists/related-articles-footer-blacklisted-skins.dblist: Enable RelatedArticles on Mobile French Wikipedia (T156362) (duration: 00m 44s) [00:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:34] T156362: Deploy related pages to mobile french wikipedia stable channel - https://phabricator.wikimedia.org/T156362 [00:43:00] jdlrobson: synced, I wait your go for en. [00:43:58] Yup en good to go too! [00:44:26] 06Operations, 10ChangeProp, 06Services (later): Add storage to Change-Prop for deduplication - https://phabricator.wikimedia.org/T157089#2995365 (10Pchelolo) As the work goes I want to tackle this first. Adding Redis to change-prop would benefit us no matter what and it's a basement for further improving the... [00:44:49] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335478 (https://phabricator.wikimedia.org/T29878) (owner: 10Dereckson) [00:45:07] !log dereckson@tin Synchronized wmf-config/: Adjust RelatedArticles deployment scale for Mobile English Wikipedia (T154681) (duration: 00m 42s) [00:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:13] T154681: Deploy related pages to mobile web stable for 90% English - https://phabricator.wikimedia.org/T154681 [00:45:13] jdlrobson: all synced [00:45:16] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:45:57] (03Merged) 10jenkins-bot: Set site name for ku.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335478 (https://phabricator.wikimedia.org/T29878) (owner: 10Dereckson) [00:46:43] (03CR) 10jenkins-bot: Set site name for ku.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335478 (https://phabricator.wikimedia.org/T29878) (owner: 10Dereckson) [00:47:44] {{SITENAME}} gives me Wîkîferheng on mwdebug1002, so works, syncing [00:48:26] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Set site name for ku.wiktionary (T29878) (duration: 00m 39s) [00:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:34] T29878: Change project namespace on kuwiktionary - https://phabricator.wikimedia.org/T29878 [00:48:40] Dereckson: is englishsynced? Not seeing it.. [00:48:50] jdlrobson: yes, 00:45:07 < logmsgbot> !log dereckson@tin Synchronized wmf-config/: Adjust RelatedArticles deployment scale for Mobile English Wikipedia (T154681) (duration: [00:48:54] 00m 42s) [00:49:37] mwrepl enwiki gives me: [00:49:37] print_r($wgRelatedArticlesEnabledSamplingRate); [00:49:38] 0.9 [00:49:40] so yes, deployed [00:49:49] mm something weird is happening [00:49:49] oh non [00:49:55] dblist isn't in wmf-config/ folder [00:49:58] so not sync'ed [00:50:46] jdlrobson: syncing dblist file too [00:51:05] so it will only contain dewiki and ruwiki in prod [00:51:17] !log dereckson@tin Synchronized dblists/related-articles-footer-blacklisted-skins.dblist: Adjust RelatedArticles deployment scale for Mobile English Wikipedia (T154681) (duration: 00m 39s) [00:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:21] T154681: Deploy related pages to mobile web stable for 90% English - https://phabricator.wikimedia.org/T154681 [00:52:03] jdlrobson: all works now? [00:53:13] Dereckson: verifying [00:53:52] Dereckson: works [00:53:54] thanks! [00:54:09] you're welcome [00:54:33] w00t [00:57:16] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [01:51:47] 06Operations: New Usernames/Passwords for arbcom archive access - https://phabricator.wikimedia.org/T157097#2995465 (10Jalexander) [01:52:21] 06Operations: New Usernames/Passwords for arbcom archive access - https://phabricator.wikimedia.org/T157097#2995477 (10Dzahn) a:03Dzahn [01:52:33] 06Operations: New Usernames/Passwords for arbcom archive access - https://phabricator.wikimedia.org/T157097#2995478 (10Jalexander) a:05Dzahn>03None [02:03:33] 06Operations: New Usernames/Passwords for arbcom archive access - https://phabricator.wikimedia.org/T157097#2995508 (10Dzahn) "Casliber" does not exist as a current user. The others listed as needing a reset do. I'll create that as a new user then. ? [02:13:07] 06Operations: New Usernames/Passwords for arbcom archive access - https://phabricator.wikimedia.org/T157097#2995523 (10Dzahn) @Jalexander i created users / updated password for all of them. Here are the new passwords. {F5460297} [02:16:26] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:18:51] 06Operations, 10Internet-Archive, 06Offline-Working-Group: Create backups of Wikimedia content in diverse geographic places - https://phabricator.wikimedia.org/T156544#2979128 (10tstarling) There are copies of the XML dumps on archive.org, but that's not really enough for disaster recovery. If we want to kee... [02:32:13] 06Operations: New Usernames/Passwords for arbcom archive access - https://phabricator.wikimedia.org/T157097#2995534 (10Jalexander) >>! In T157097#2995508, @Dzahn wrote: > "Casliber" does not exist as a current user. The others listed as needing a reset do. I'll create that as a new user then. ? For the record,... [02:33:07] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.10) (duration: 12m 06s) [02:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:10] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Feb 3 02:38:10 UTC 2017 (duration 5m 3s) [02:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:38] 06Operations: New Usernames/Passwords for arbcom archive access - https://phabricator.wikimedia.org/T157097#2995553 (10Dzahn) 05Open>03Resolved a:03Dzahn I used the wrong command to set passwords first, but fixed that. _now_ they should all work. I confirmed 2 random ones as working. [02:44:16] PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:44:26] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [02:52:26] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.805 second response time [02:54:26] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 0.563 second response time [02:55:51] ummm something is up with graphite [03:13:16] RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [03:13:20] (03PS1) 10Dzahn: switch webproxies away from carbon to install1002/2002 [dns] - 10https://gerrit.wikimedia.org/r/335747 [03:18:31] (03CR) 10Dzahn: "they work fine for me when manually setting them. (f.e. http_proxy="http://install1002.wikimedia.org:8080")" [dns] - 10https://gerrit.wikimedia.org/r/335747 (owner: 10Dzahn) [03:19:28] (03PS2) 10Dzahn: switch webproxies away from carbon to install1002/2002 [dns] - 10https://gerrit.wikimedia.org/r/335747 (https://phabricator.wikimedia.org/T123733) [03:36:05] (03CR) 10Tim Landscheidt: [C: 031] icinga: remove pre-jessie conditional from monitoring::group [puppet] - 10https://gerrit.wikimedia.org/r/318442 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [04:17:16] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=383.80 Read Requests/Sec=524.50 Write Requests/Sec=1.70 KBytes Read/Sec=30840.00 KBytes_Written/Sec=39.20 [04:18:16] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:26:16] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=19.90 Read Requests/Sec=1.70 Write Requests/Sec=21.80 KBytes Read/Sec=6.80 KBytes_Written/Sec=1247.60 [04:48:16] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [05:40:26] PROBLEM - puppet last run on labstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:47:50] (03PS2) 10Kaldari: Setting $wgPageAssessmentsSubprojects to true on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335732 [06:08:26] RECOVERY - puppet last run on labstore1002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:21:46] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Peering: Equinix Dallas (SR 17915024) {#11397} [10Gbps DF]BR [06:27:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 24 probes of 260 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:32:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 12 probes of 260 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:38:46] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [06:39:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 260 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:44:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 13 probes of 260 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:51:46] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:15:19] !log Stop MySQL db1095 to snapshot it to es1013:/srv/tmp - T153743 [07:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:26] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:15:46] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [07:16:26] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:20:06] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2385 [07:25:06] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 311752 Threads: 1 Questions: 3807912 Slow queries: 1429 Opens: 3349 Flush tables: 1 Open tables: 560 Queries per second avg: 12.214 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [07:39:26] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [07:39:46] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [07:41:56] !log upgrading firejail on remaining wtp/Parsoid hosts [07:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:00] graphite is still suffering a bit :( [07:44:10] I guess that we didn't find the new SSD [07:45:27] (03CR) 10Marostegui: [C: 031] "Looks good: https://puppet-compiler.wmflabs.org/5327/" [puppet] - 10https://gerrit.wikimedia.org/r/334298 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [07:50:42] (03PS1) 10Tim Landscheidt: Tools: Outfactor jobkill script to toollabs::node::all [puppet] - 10https://gerrit.wikimedia.org/r/335755 [07:51:26] (03PS1) 10Elukey: Enable aqs1008-b (AQS cassandra cluster) [puppet] - 10https://gerrit.wikimedia.org/r/335756 (https://phabricator.wikimedia.org/T155654) [07:54:33] marostegui: there is nothing better than bootstrapping a cassandra instance after the first coffee of the morning [07:54:37] you should try [07:54:38] :P [07:54:48] elukey: is that better than an ALTER table?!! [07:55:08] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/5328/aqs1008.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/335756 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [07:55:44] marostegui: I don't know, I don't usually do it in the morning! This is why I am telling you.. so you can compare! :D [07:58:39] elukey: good morning, on the 30th you logged " elukey: set mw1236.eqiad.wmnet pooled=inactive because powered off (no mentions on the SAL, still trying to find why)" to SAL, did you find something? i just noticed it's still powered down [08:01:37] moritzm: morning! I opened a ticket to dc ops but no feedback since then.. IIRC I wasn't able to use the console [08:05:37] ah,which one? I was searching phab, but couldn't find one [08:05:53] !log bootstrapping aqs1008-b (AQS Cassandra instance) [08:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:27] moritzm: https://phabricator.wikimedia.org/T156610 - ah now I remember! I wasn't able to powerup [08:07:12] weird, why doesn't https://phabricator.wikimedia.org/search/query/19iMZVlyIxxX/#R show this ticket? [08:08:26] weird indeed [08:15:05] !log restarting prometheus servers to pick up openssl update [08:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:49] PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100% [08:39:39] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/7: down - Transit: CyrusOne OOB (IP-000008-01) {#1099} [1Gbps Cu]BR [08:42:00] anybody working on it? [08:42:12] I mean, is it expected or just went down by itself? [08:43:20] yeah there was a maint notice from cyrus one, it is the management router though [08:43:23] I am a bit confused about the port's error msg [08:45:29] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:50:39] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [08:50:54] !log installing tomcat regression updates on trusty hosts (jessie update was fine) [08:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:41] (03PS1) 10Filippo Giunchedi: cache: move graphite/performance backends to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335761 (https://phabricator.wikimedia.org/T157022) [08:52:43] (03PS1) 10Filippo Giunchedi: graphite: move performance::site to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335762 (https://phabricator.wikimedia.org/T157022) [08:52:45] (03PS1) 10Filippo Giunchedi: graphite: move alerts to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335763 (https://phabricator.wikimedia.org/T157022) [08:52:47] (03PS1) 10Filippo Giunchedi: diamond: switch to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335764 (https://phabricator.wikimedia.org/T157022) [08:52:49] (03PS1) 10Filippo Giunchedi: graphite: switch graphite alerts to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335765 (https://phabricator.wikimedia.org/T157022) [08:54:39] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms [08:54:58] (03PS1) 10Filippo Giunchedi: graphite: switch to graphite2001 [dns] - 10https://gerrit.wikimedia.org/r/335766 (https://phabricator.wikimedia.org/T157022) [08:56:19] PROBLEM - DPKG on graphite1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:00:24] 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#2995779 (10fgiunchedi) I've staged the patches needed for failover in a series of reviews above. There's also a graphite-codfw dashboard at https://grafana.wikimedia.org/dashboard/d... [09:01:01] (03PS4) 10Elukey: Replace codfw Memcached/Redis mc2008->mc2011 with mc2026->mc2029 [puppet] - 10https://gerrit.wikimedia.org/r/335676 (https://phabricator.wikimedia.org/T155755) [09:05:34] moritzm: I am finally around [09:07:20] good morning, do you want to rerun the test suite on deployment-tin? otherwise I'll upload the new build now [09:08:51] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/5329/" [puppet] - 10https://gerrit.wikimedia.org/r/335676 (https://phabricator.wikimedia.org/T155755) (owner: 10Elukey) [09:10:43] !log Replace Redis/Memcached shards mc2008->2011 with mc2026->mc2029 [09:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:16] moritzm: I did run the suite on deployment-tin it is all fine [09:11:34] there are failures but that is unrelated imho . HHVM no more segfault [09:11:54] so I guess we can push 3.12.11+dfsg-1+wmf2 to apt.wikimedia.Org [09:12:07] I will upgrade the CI Jessie instances and the beta cluster mw app servers [09:12:24] then on monday we can look at the logs and see what happened :} [09:14:13] ok, uploading the new build now [09:15:29] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:18:40] new codfw shards looks good, just checked redis replication [09:18:59] 7 left to go [09:22:00] !log uploaded hhvm_3.12.11+dfsg-1+wmf2 to apt.wikimedia.org [09:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:33] hashar: ^, I'll upgrade the canary app servers as well, they had been running the new HHVM package before I reverted them yesterday to 3.12.7 to investigate the regression [09:23:58] the production canaries ? [09:26:04] bah deployment-prep instances no more run unattended / automatic upgrade [09:26:14] yeah, the production canaries [09:29:49] (03PS2) 10Marostegui: site.pp: Change db1064 to ROW [puppet] - 10https://gerrit.wikimedia.org/r/335407 (https://phabricator.wikimedia.org/T153743) [09:35:07] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2995857 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff A new HHVM package has been uploaded... [09:36:39] (03PS1) 10Marostegui: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335774 (https://phabricator.wikimedia.org/T153743) [09:38:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335774 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:39:52] !log upgraded mwdebug* and mw1261 to the new HHVM package [09:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335774 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:40:24] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335774 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:41:16] (03CR) 10Marostegui: [C: 032] site.pp: Change db1064 to ROW [puppet] - 10https://gerrit.wikimedia.org/r/335407 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:41:45] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db1064 - T153743 (duration: 00m 40s) [09:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:49] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:42:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1064 - T153743 (duration: 00m 40s) [09:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:06] !log Restart mysql on db1064 to get its binary log changed to ROW - T153743 [09:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:10] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:50:41] 06Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-Cluster: Requesting access to analytics-privatedata-users for musikanimal - https://phabricator.wikimedia.org/T156986#2991994 (10MoritzMuehlenhoff) @MusikAnimal : Please generate a new SSH key for production access (see https://wikitech.wikimedi... [09:51:34] (03PS1) 10Gehel: relforge - switch master to relforge1002 [puppet] - 10https://gerrit.wikimedia.org/r/335776 (https://phabricator.wikimedia.org/T151326) [09:53:08] moritzm: another question, does Trusty still get HHVM upgrades? [09:53:13] or are we solely on Jessie nowadays? [09:54:08] !log upgrade & restart of db2063 T111654 [09:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:12] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [09:54:19] (03CR) 10DCausse: [C: 031] relforge - switch master to relforge1002 [puppet] - 10https://gerrit.wikimedia.org/r/335776 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [09:54:35] (03CR) 10Gehel: [C: 032] relforge - switch master to relforge1002 [puppet] - 10https://gerrit.wikimedia.org/r/335776 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [09:55:18] 06Operations, 10Wikimedia-Logstash, 07HHVM: Fatal exception of type "Scribunto_LuaInterpreterNotFoundError" - https://phabricator.wikimedia.org/T157110#2995907 (10Josve05a) [09:55:35] !log restarting relforge1002 to pick up new master configuration [09:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:08] 06Operations, 10Wikimedia-Logstash, 07HHVM: Fatal exception of type "Scribunto_LuaInterpreterNotFoundError" - https://phabricator.wikimedia.org/T157110#2995920 (10Josve05a) See also old ticket {T88942}. [09:56:20] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335777 [09:56:31] 06Operations, 10Wikimedia-Logstash, 07HHVM: Fatal exception of type "Scribunto_LuaInterpreterNotFoundError" - https://phabricator.wikimedia.org/T157110#2995907 (10Legoktm) ``` 2017-02-03 09:51:00 [WJRSgwpAADgAAXiYZDYAAAAR] mw1261 commonswiki 1.29.0-wmf.10 exception ERROR: [WJRSgwpAADgAAXiYZDYAAAAR] /wiki/Fil... [09:58:04] moritzm: luasandbox might be missing on mw1261? related to hhvm upgrade? ^^ [09:58:05] 06Operations, 10Wikimedia-Logstash, 07HHVM: Fatal exception of type "Scribunto_LuaInterpreterNotFoundError" - https://phabricator.wikimedia.org/T157110#2995926 (10Josve05a) >>! In T157110#2995922, @Legoktm wrote: > How is LuaSandbox not installed? @Anomie has the same question here: >>! In T72051#718149, @A... [09:59:18] !log Upgrade db1064 from MariaDB 10.0.23 to 10.0.29 - T153743 [09:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:22] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:59:46] legoktm: no, it's installed there; 2.0.12~jessie1 [10:00:47] I've downgraded mw1261 to 3.12.7 now [10:01:51] was that limited to mw1261? hhvm 3.12.11 was running on mw1261-mw1264 for several days before [10:02:37] legoktm@fluorine:/a/mw-log$ grep "Scribunto_LuaInterpreterNotFoundError from line 257" exception.log | grep -v mw1261 -c [10:02:37] 0 [10:02:39] so just mw1261 [10:02:43] but it stopped [10:03:08] last one was at 09:58:41 [10:03:28] that's strange, will try to reproduce in a VM with the hhvm package [10:03:34] it started at 09:38:21 [10:04:31] !log Reboot db1064 to pick up the new kernels T153743 [10:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:35] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [10:04:57] legoktm: that matches the time the new hhvm build was installed [10:05:33] it's still weird, since that build is identical to what we ran before except the patch that broke the mw test suite (and the testsuite ran successful with it as well) [10:05:44] but none of the other hosts had log errors... [10:05:48] #anyone, will try to reproduce in a VM later [10:05:53] ok :) [10:05:54] I only upgraded mw1261 initially [10:05:58] * legoktm really goes to sleep now [10:08:31] 06Operations, 10Wikimedia-Logstash, 07HHVM: Fatal exception of type "Scribunto_LuaInterpreterNotFoundError" - https://phabricator.wikimedia.org/T157110#2995907 (10MoritzMuehlenhoff) I reverted the new HHVM build from mw1261. [10:09:26] moritzm: don't you need to rebuild the hhvm extensions as well ? [10:09:45] (03PS1) 10Elukey: Disable auto-restart for nutcracker when config.yaml changes [puppet] - 10https://gerrit.wikimedia.org/r/335780 (https://phabricator.wikimedia.org/T155755) [10:10:42] moritzm: (whenever you have time) - I'd like to discuss with you --^ vs base::service_unit [10:10:54] but nothing urgent [10:11:03] hashar: only if the module ABI changes, not usually not. needs further digging [10:11:28] 06Operations, 10Wikimedia-Logstash, 07HHVM: Fatal exception of type "Scribunto_LuaInterpreterNotFoundError" - https://phabricator.wikimedia.org/T157110#2995988 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03MoritzMuehlenhoff [10:11:44] moritzm: I got HHVM upgraded on the beta cluster. So might help to reproduce [10:12:47] hashar: well, deployment-tin had the new HHVM package as well, so I'm wondering why that didn't occur there [10:12:47] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [10:14:47] moritzm: cause I only ran the mediawiki/core test suite [10:14:52] not the scribunto one :} [10:17:05] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335777 (owner: 10Marostegui) [10:17:15] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Make it possible to run the mediawiki testsuite against a staging repo of apt.wikimedia.org - https://phabricator.wikimedia.org/T157038#2995994 (10MoritzMuehlenhoff) The tests should also cover the extensions (e.g. Scribunto)... [10:18:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335777 (owner: 10Marostegui) [10:18:35] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335777 (owner: 10Marostegui) [10:19:49] elukey: that looks good to me, but better doublecheck with Giuseppe on Monday, I might possibly miss some cornercase for which it's needed [10:19:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1064 - T153743 (duration: 00m 42s) [10:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:58] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [10:25:49] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2996018 (10hashar) 05Resolved>03Open Reopening since it might still fails on some CI jobs and/or on the beta... [10:27:45] (03CR) 10Ema: [C: 031] "> > @godog: I'm not convinced this is the right solution, if the RAM" [puppet] - 10https://gerrit.wikimedia.org/r/334364 (https://phabricator.wikimedia.org/T155876) (owner: 10Filippo Giunchedi) [10:29:11] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2996021 (10MoritzMuehlenhoff) p:05Unbreak!>03High [10:31:31] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Make it possible to run the mediawiki testsuite against a staging repo of apt.wikimedia.org - https://phabricator.wikimedia.org/T157038#2993373 (10hashar) We can get a copy of the mediawiki-extensions-* job. That clones mediaw... [10:33:24] moritzm: scribunto CI job still runs HHVM tests on Trusty instances [10:33:49] I did a recheck of some Scribunto change https://gerrit.wikimedia.org/r/#/c/43176/14 I guess it will pass just fine ( https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm/36297/console ) [10:34:27] PROBLEM - puppet last run on es1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:27] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Make it possible to run the mediawiki testsuite against a staging repo of apt.wikimedia.org - https://phabricator.wikimedia.org/T157038#2996046 (10MoritzMuehlenhoff) The mechanism should not be specific to HHVM, but apply to t... [10:35:09] 06Operations, 10ops-codfw, 06Community-Liaisons, 13Patch-For-Review, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2996051 (10elukey) Today I discovered that Swapping mc hosts in codfw is not a completely safe operation since all the mw hosts in eqiad have mult... [10:35:47] moritzm: ah yes I wanted to know if base::service_unit was not there for a reason or just because not available at the time [10:37:18] elukey: I suppose this predates base::service_unit [10:38:04] super [10:38:08] thanks :) [10:39:39] 06Operations, 10ops-codfw, 06Community-Liaisons, 13Patch-For-Review, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2953444 (10Jseddon) I'm not a CL but let me know if its worth setting up a CentralNotice maintenance banner to logged in users just for those time... [10:40:03] elukey: indeed, the nutcracker service was added in 2014 and base::service_unit in 2015 [10:41:29] 06Operations, 10ops-codfw, 06Community-Liaisons, 13Patch-For-Review, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2996056 (10elukey) @Jseddon Hello! The next round of host swaps will not create impact, so for the moment the CentralNotice banner is not needed,... [10:50:52] !log mwdebug* and mw1261 have been reverted to previous HHVM package [10:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:42] claro [10:54:49] !log preparing to reimage db2053 T111654 [10:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:54] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [10:56:15] (03PS1) 10KartikMistry: Deploy Compact Language Links out of beta in French/Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335781 (https://phabricator.wikimedia.org/T157108) [10:56:53] moritzm: running Scribunto test on Nodepool Jessie pass just fine ( https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-jessie/181/console ) [10:56:56] reporting on task [10:57:35] (03PS1) 10Muehlenhoff: Add remaining email addresses [puppet] - 10https://gerrit.wikimedia.org/r/335782 [10:58:16] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2996078 (10hashar) 05Open>03Resolved There were mentions of Scribunto errors. I did a `check experimental`... [10:58:35] hmm, this might be a problem with the registration of the HHVM extension, will poke at it later on [11:00:42] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2996082 (10hashar) The Scribunto error is on production canaries and got reported at T157110 [11:01:18] moritzm: maybe we could upgrade mwdebug1002 for that ? [11:01:25] would let us reproduce / debug it [11:01:56] !log Alter table metawiki.pagelinks on db1039 (depooled) - T153300 [11:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:01] T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300 [11:02:15] hashar: sure, we can do that [11:02:31] (03PS2) 10Muehlenhoff: Add remaining email addresses [puppet] - 10https://gerrit.wikimedia.org/r/335782 [11:02:43] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey, 07User-notice: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2996087 (10Elitre) [11:03:27] RECOVERY - puppet last run on es1019 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:03:48] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey, 07User-notice: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2953444 (10Elitre) If I understand this correctly, you want people to be notified about past outages. That's a line for Tech News, that included similar... [11:05:48] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Make it possible to run the mediawiki testsuite against a staging repo of apt.wikimedia.org - https://phabricator.wikimedia.org/T157038#2996102 (10hashar) Ok that makes sense. For Nodepool we would need a new image, that woul... [11:07:09] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey, 07User-notice: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2996105 (10elukey) >>! In T155755#2996087, @Elitre wrote: > If I understand this correctly, you want people to be notified about past outages. That's a l... [11:08:45] (03CR) 10Muehlenhoff: [C: 032] Add remaining email addresses [puppet] - 10https://gerrit.wikimedia.org/r/335782 (owner: 10Muehlenhoff) [11:12:39] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#2996116 (10MoritzMuehlenhoff) [11:12:42] 06Operations, 13Patch-For-Review: Require/track email addresses - https://phabricator.wikimedia.org/T142826#2996114 (10MoritzMuehlenhoff) 05Open>03Resolved All user accounts in data.yaml are now annotated with an email address. The consistency check script warns if some should be missing in the future. [11:14:27] PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:21:13] !log preparing to reimage db2054 T111654 [11:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:18] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [11:38:04] (03PS1) 10Muehlenhoff: Remove madhuvishy from statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/335787 (https://phabricator.wikimedia.org/T142836) [11:43:27] RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [11:49:47] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [12:16:40] !log restarting relforge1001 to pick up new master configuration [12:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:27] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 95 threshold =0.1% breach: status: red, number_of_nodes: 2, unassigned_shards: 91, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 205, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 68.3 [12:19:37] ACKNOWLEDGEMENT - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 95 threshold =0.1% breach: status: red, number_of_nodes: 2, unassigned_shards: 91, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 205, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_numb [12:20:19] ^restart for new master election is taking longer than expected, situation back to normal on relforge in a few seconds / minutes [12:20:27] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:22:27] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: yellow, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 259, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 99.0, active_shards: 297, initial [12:26:54] (03PS1) 10Jcrespo: Reimage by default all db2XXX hosts > db2034 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/335794 [12:28:38] (03CR) 10Jcrespo: [C: 032] Reimage by default all db2XXX hosts > db2034 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/335794 (owner: 10Jcrespo) [12:31:27] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:32:08] (03PS1) 10Elukey: Revert "Revert "Add JVM Heap usage alarms for basic Hadoop daemons"" [puppet] - 10https://gerrit.wikimedia.org/r/335795 [12:32:22] (03PS2) 10Elukey: Revert "Revert "Add JVM Heap usage alarms for basic Hadoop daemons"" [puppet] - 10https://gerrit.wikimedia.org/r/335795 [12:33:10] (03PS1) 10Mforns: Add config for banner activity pivot data set [puppet] - 10https://gerrit.wikimedia.org/r/335796 (https://phabricator.wikimedia.org/T155141) [12:38:22] moritzm, should I merge your patch? [12:40:23] seems harmless, so I will go ahead, moritzm [12:41:18] ah, sorry, yes. please go ahead [12:43:08] puppet is disabled on carbon for 'reason not specified' [12:43:42] no message on icinga either [12:44:09] jynus: 20:35 mutante: carbon - disabling puppet (to stop it from re-adding second IPv6 address causing issues with ferm rules) [12:44:20] https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2017-02-02 [12:46:54] (03PS2) 10Mforns: Add config for banner activity pivot data set [puppet] - 10https://gerrit.wikimedia.org/r/335796 (https://phabricator.wikimedia.org/T155141) [12:47:17] PROBLEM - Check systemd state on install1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:47:38] I do not see the error on https://gerrit.wikimedia.org/r/#/c/335794/1/modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 [12:47:51] oh, I see it now [12:49:32] (03PS1) 10Jcrespo: install_server: Followup to gerrit:335794 [puppet] - 10https://gerrit.wikimedia.org/r/335798 [12:49:34] (03PS3) 10Elukey: Add config for banner activity pivot data set [puppet] - 10https://gerrit.wikimedia.org/r/335796 (https://phabricator.wikimedia.org/T155141) (owner: 10Mforns) [12:49:37] PROBLEM - Check systemd state on install2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:50:56] (03CR) 10Jcrespo: [C: 032] install_server: Followup to gerrit:335794 [puppet] - 10https://gerrit.wikimedia.org/r/335798 (owner: 10Jcrespo) [12:51:20] (03CR) 10Elukey: [C: 032] Add config for banner activity pivot data set [puppet] - 10https://gerrit.wikimedia.org/r/335796 (https://phabricator.wikimedia.org/T155141) (owner: 10Mforns) [12:51:27] (03PS4) 10Elukey: Add config for banner activity pivot data set [puppet] - 10https://gerrit.wikimedia.org/r/335796 (https://phabricator.wikimedia.org/T155141) (owner: 10Mforns) [12:52:05] (03CR) 10Elukey: [V: 032 C: 032] Add config for banner activity pivot data set [puppet] - 10https://gerrit.wikimedia.org/r/335796 (https://phabricator.wikimedia.org/T155141) (owner: 10Mforns) [12:53:13] fixed [12:53:17] RECOVERY - Check systemd state on install1001 is OK: OK - running: The system is fully operational [12:53:37] RECOVERY - Check systemd state on install2001 is OK: OK - running: The system is fully operational [12:57:51] (03PS3) 10Nschaaf: Drop wdqs_extract partitions older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335437 (https://phabricator.wikimedia.org/T146915) [13:00:37] (03CR) 10Nschaaf: [C: 031] "I tested the script with the dry run option and the parameters as specified in the puppet manifest, and the output was consistent with dro" [puppet] - 10https://gerrit.wikimedia.org/r/335437 (https://phabricator.wikimedia.org/T146915) (owner: 10Nschaaf) [13:13:09] 06Operations, 10Traffic, 07Wikimedia-Incident: Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently - https://phabricator.wikimedia.org/T154801#2924351 (10BBlack) Recording this while I remember it: # The VSLP director code panics if there are no backends defined for a dir... [13:17:09] 06Operations, 10Analytics, 10Traffic: Add global last-access cookie for top domain (*.wikipedia.org) - https://phabricator.wikimedia.org/T138027#2996314 (10BBlack) [13:24:28] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#2996318 (10Paladox) For testing phabricator on Jessie, I've setup phabricator-01.wmflabs.org. All looks ok at the moment. [13:35:29] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:04] linting is fun : ✖ 40242 problems (40242 errors, 0 warnings) [13:58:04] !log restarting and upgrading db2041 T111654 [13:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:09] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [13:59:48] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [14:04:28] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:07:58] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:17:24] (03PS1) 10Jcrespo: mariadb: Depool db2061 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335804 [14:21:31] (03CR) 10Marostegui: [C: 031] mariadb: Depool db2061 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335804 (owner: 10Jcrespo) [14:21:53] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2061 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335804 (owner: 10Jcrespo) [14:23:41] (03Merged) 10jenkins-bot: mariadb: Depool db2061 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335804 (owner: 10Jcrespo) [14:23:53] (03CR) 10jenkins-bot: mariadb: Depool db2061 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335804 (owner: 10Jcrespo) [14:26:22] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2061 (duration: 00m 40s) [14:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:40] (03PS1) 10Ema: Add PyOpenSSL to requirements.txt, explain how to run tests [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/335807 [14:30:58] !log upgrade and restart db2061 T111654 [14:30:58] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:03] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [14:33:02] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2061 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335808 [14:33:40] (03CR) 10Ema: [V: 032 C: 032] Add PyOpenSSL to requirements.txt, explain how to run tests [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/335807 (owner: 10Ema) [14:33:48] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 0.207 second response time [14:34:35] (03Abandoned) 10Ema: Add PyOpenSSL to requirements.txt, explain how to run tests [debs/pybal] - 10https://gerrit.wikimedia.org/r/334193 (owner: 10Ema) [14:35:22] !log restart apache on graphite1001 to see if it helps sqlite lock isssue [14:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:58] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [14:36:04] (03CR) 10Marostegui: [C: 031] Revert "mariadb: Depool db2061 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335808 (owner: 10Jcrespo) [14:39:38] (03PS3) 10Hashar: Introduce linters using rake [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/331330 (https://phabricator.wikimedia.org/T154894) [14:41:37] (03CR) 10Hashar: "check experimental" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/331330 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:44:43] (03CR) 10Elukey: [C: 032] "bundle exec rake test" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/331330 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:46:55] (03CR) 10Elukey: [C: 032] Ignore flake8 error about duplicate keys in dict [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/331333 (owner: 10Hashar) [14:47:03] 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#2996402 (10Gehel) Note to @Cmjohnson: before shutting down relforge1001 for maintenance, shards should be drained from it with `es-tool ban-node 10.64.4.13`. This can take a few hours. Pin... [14:48:09] (03PS1) 10Hashar: nodepool: disambiguate images/snapshot/labels [puppet] - 10https://gerrit.wikimedia.org/r/335809 [14:53:33] (03PS7) 10Hashar: Introduce linters using rake [puppet/cdh] - 10https://gerrit.wikimedia.org/r/331312 (https://phabricator.wikimedia.org/T154894) [14:53:51] (03CR) 10Hashar: "check experimental" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/331312 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:54:06] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2061 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335808 (owner: 10Jcrespo) [14:56:26] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2061 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335808 (owner: 10Jcrespo) [14:56:40] (03CR) 10Elukey: [C: 032] Introduce linters using rake [puppet/cdh] - 10https://gerrit.wikimedia.org/r/331312 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:56:46] (03PS3) 10Hashar: Introduce linters using rake [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/331327 (https://phabricator.wikimedia.org/T154894) [14:56:52] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2061 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335808 (owner: 10Jcrespo) [14:57:17] (03CR) 10Hashar: "check experimental" [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/331327 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [15:00:08] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [15:01:28] (03PS4) 10Hashar: Introduce linters using rake [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/331328 (https://phabricator.wikimedia.org/T154894) [15:01:47] !log preparing to reimage db2039 T111654 [15:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:52] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [15:01:53] (03CR) 10Hashar: "check experimental" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/331328 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [15:02:27] (03PS1) 10BBlack: remove outdated commentary [puppet] - 10https://gerrit.wikimedia.org/r/335812 [15:03:05] (03CR) 10Elukey: [C: 032] Introduce linters using rake [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/331328 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [15:03:33] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2061 (duration: 00m 40s) [15:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:42] (03PS2) 10BBlack: replace outdated commentary on TLS redirects [puppet] - 10https://gerrit.wikimedia.org/r/335812 [15:04:50] (03CR) 10BBlack: [C: 032] replace outdated commentary on TLS redirects [puppet] - 10https://gerrit.wikimedia.org/r/335812 (owner: 10BBlack) [15:04:56] 06Operations: Harmonise "Directory Managers" group - https://phabricator.wikimedia.org/T157131#2996506 (10MoritzMuehlenhoff) [15:05:03] grrrr submodules! [15:05:08] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [15:07:12] modules/cdh is bad now, pushing fix [15:08:22] (03PS1) 10BBlack: Revert bad CDH submodule update [puppet] - 10https://gerrit.wikimedia.org/r/335813 [15:08:47] bblack: sorry it was me, but what happened? it was a rakefile update [15:08:59] ahhh the sha [15:09:00] no, it was my error [15:09:09] yes super annoying, sorry for that [15:09:10] (03PS3) 10Hashar: Introduce linters using rake [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/331332 (https://phabricator.wikimedia.org/T154894) [15:09:11] I didn't see the submodules update when I did "git pull -r" [15:09:21] * elukey blames ottomata :P [15:09:59] (03CR) 10BBlack: [C: 032] Revert bad CDH submodule update [puppet] - 10https://gerrit.wikimedia.org/r/335813 (owner: 10BBlack) [15:10:08] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [15:10:27] (03CR) 10Hashar: "check experimental" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/331332 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [15:10:39] 06Operations, 07Puppet, 06Operations-Software-Development: Consider adding a --skip-conftool option to puppet-merge - https://phabricator.wikimedia.org/T157133#2996538 (10jcrespo) [15:10:51] 06Operations, 07Puppet, 06Operations-Software-Development: Consider adding a --skip-conftool option to puppet-merge - https://phabricator.wikimedia.org/T157133#2996550 (10jcrespo) p:05Triage>03Low [15:11:07] I puppet-merged both together, so actual systems shouldn't have seen the issue [15:11:41] super, thanks.. I am going to review a couple more submodules for hashar and then I'll update their shas in operations/puppet [15:11:55] we are almost done :-)} [15:11:56] submodules are evil [15:12:28] ^ +1 to that [15:12:30] analytics fault, you know those bad people [15:12:36] :P [15:15:08] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 169 seconds ago with 0 failures [15:16:48] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:17:06] (03PS3) 10Hashar: Introduce linters using rake [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/331329 (https://phabricator.wikimedia.org/T154894) [15:17:47] (03CR) 10Hashar: "After some review with Elukey:" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/331329 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [15:17:56] (03CR) 10Hashar: "check experimental" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/331329 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [15:18:12] * elukey blames marostegui for puppet/mariadb [15:18:15] :P [15:18:33] (03CR) 10Hashar: "That change is pending Jynus / DBA approval since it touches the DB and might well explode something on deployment." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/331329 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [15:18:43] hashar, I can start with that [15:18:55] when I finish my current task [15:19:08] not sure if I want to do it on a friday evening [15:19:12] though [15:19:29] I know it is a noop [15:19:36] but it is on a critical path [15:21:10] (03CR) 10Elukey: [C: 032] Introduce linters using rake [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/331332 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [15:21:35] (03CR) 10Elukey: [V: 032 C: 032] Add .gitreview [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/331331 (owner: 10Hashar) [15:21:53] (03PS1) 10Marostegui: mariadb: Add gtid_domain_id to s4 [puppet] - 10https://gerrit.wikimedia.org/r/335816 (https://phabricator.wikimedia.org/T149418) [15:27:13] (03PS1) 10Marostegui: mariadb: Use the common gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/335817 (https://phabricator.wikimedia.org/T149418) [15:28:01] (03CR) 10Jcrespo: [C: 031] mariadb: Use the common gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/335817 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [15:28:04] (03CR) 10Elukey: [C: 032] Introduce linters using rake [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/331327 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [15:29:21] (03CR) 10Jcrespo: "Actually, we do not use this code. From our point of view, this can be deployed at any time." [puppet] - 10https://gerrit.wikimedia.org/r/334298 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [15:30:08] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [15:31:05] grrr. heka puppetflap is probably puppet vs puppetmaster not chattering happily after package upgrades. fixing [15:32:41] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/5331/ looks good and gtid_domain_id value doesn't change" [puppet] - 10https://gerrit.wikimedia.org/r/335817 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [15:35:08] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 204 seconds ago with 0 failures [15:35:25] (03PS1) 10Elukey: Update the varnishkafka submule's sha [puppet] - 10https://gerrit.wikimedia.org/r/335818 [15:42:38] (03CR) 10Hashar: [C: 04-1] "Most patches have landed. There is one pending for mariadb. Also need submodule to be bumped and then this change can be rebased/tested a" [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [15:42:40] (03CR) 10Elukey: [C: 032] "No op - https://puppet-compiler.wmflabs.org/5334/" [puppet] - 10https://gerrit.wikimedia.org/r/335818 (owner: 10Elukey) [15:46:48] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:47:06] (03PS2) 10Marostegui: mariadb: Add gtid_domain_id to s4 [puppet] - 10https://gerrit.wikimedia.org/r/335816 (https://phabricator.wikimedia.org/T149418) [15:47:57] (03CR) 10Jcrespo: mariadb: Add gtid_domain_id to s4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/335816 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [15:47:59] (03PS1) 10Elukey: Update the kafkatee submule's sha [puppet] - 10https://gerrit.wikimedia.org/r/335819 [15:51:26] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/5336/ - no op" [puppet] - 10https://gerrit.wikimedia.org/r/335819 (owner: 10Elukey) [15:52:22] (03CR) 10Marostegui: "Yeah, let's go for s6. I chose s4 because we were going to import it on db1095, but that was the sole reason. Let's go for less critical w" [puppet] - 10https://gerrit.wikimedia.org/r/335816 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [15:53:36] (03PS3) 10Marostegui: mariadb: Add gtid_domain_id to s6 [puppet] - 10https://gerrit.wikimedia.org/r/335816 (https://phabricator.wikimedia.org/T149418) [15:54:41] (03PS1) 10Elukey: Update the zookeeper submule's sha [puppet] - 10https://gerrit.wikimedia.org/r/335820 [15:58:28] 06Operations, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2996697 (10demon) >>! In T152525#2994909, @RobH wrote: > Assigning this task to Chad. Once he is aware that this system is all theirs, he can resolve. Confirmed. >>! In T152525#... [15:59:59] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/5337/ this compiles fine and only touches the hosts in s6." [puppet] - 10https://gerrit.wikimedia.org/r/335816 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [16:08:10] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/5339/ - no op" [puppet] - 10https://gerrit.wikimedia.org/r/335820 (owner: 10Elukey) [16:10:14] !log rsync coal data graphite1001 -> graphite2001 [16:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:22] godog: \o/ [16:10:53] elukey: \o/ [16:10:55] 06Operations, 10Ops-Access-Requests: Update cmjohnson's keys - https://phabricator.wikimedia.org/T157139#2996708 (10Cmjohnson) [16:11:04] I'd like at least a sanity check on https://gerrit.wikimedia.org/r/#/c/335761 [16:11:23] ema: ^ perhaps? [16:14:58] elukey: are you familiar with el zmq stream? for https://gerrit.wikimedia.org/r/#/c/335762/1/modules/role/manifests/eventlogging/analytics/zeromq.pp [16:15:14] I think it should be fine, ther ewill be two consumers and that's it, graphite1001 and graphite2001 [16:15:53] (03CR) 10Ema: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/335761 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [16:16:10] (03PS1) 10Elukey: Update cdh submodule's sha [puppet] - 10https://gerrit.wikimedia.org/r/335821 [16:16:14] godog: afaik that one is not used because it is kafka only [16:16:41] mmmm [16:17:03] ah it *sends* to zeromq [16:17:08] I wasn't aware of that [16:17:37] yeah, so basically adding another consumer from graphite2001 [16:17:40] ema: thanks! [16:18:03] godog: no idea, but I guess it is fine.. maybe we can ask to performance? [16:18:15] godog: I'm here to please! [16:19:08] elukey: ok, I'm not sure anyone is online atm [16:19:10] brb [16:20:21] (03CR) 10Jcrespo: [C: 031] mariadb: Add gtid_domain_id to s6 [puppet] - 10https://gerrit.wikimedia.org/r/335816 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [16:22:49] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/5340/ - no op" [puppet] - 10https://gerrit.wikimedia.org/r/335821 (owner: 10Elukey) [16:25:26] I've pinged performance [16:25:44] (03CR) 10Filippo Giunchedi: [C: 032] graphite: move performance::site to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335762 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [16:26:49] (03PS1) 10Elukey: Update the jmxtrans submule's sha [puppet] - 10https://gerrit.wikimedia.org/r/335822 [16:27:25] (03PS2) 10Filippo Giunchedi: graphite: move performance::site to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335762 (https://phabricator.wikimedia.org/T157022) [16:27:27] (03PS2) 10Filippo Giunchedi: cache: move graphite/performance backends to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335761 (https://phabricator.wikimedia.org/T157022) [16:27:29] (03PS2) 10Filippo Giunchedi: graphite: move alerts to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335763 (https://phabricator.wikimedia.org/T157022) [16:27:31] (03PS2) 10Filippo Giunchedi: diamond: switch to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335764 (https://phabricator.wikimedia.org/T157022) [16:27:33] (03PS2) 10Filippo Giunchedi: graphite: switch graphite alerts to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335765 (https://phabricator.wikimedia.org/T157022) [16:27:36] (03PS1) 10Rush: admin: cmjohnson update ssh pubkey [puppet] - 10https://gerrit.wikimedia.org/r/335823 [16:27:45] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] graphite: move performance::site to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335762 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [16:28:03] (03PS3) 10Filippo Giunchedi: graphite: move performance::site to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335762 (https://phabricator.wikimedia.org/T157022) [16:28:42] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] graphite: move performance::site to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335762 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [16:29:02] (03PS2) 10Rush: admin: cmjohnson update ssh pubkey [puppet] - 10https://gerrit.wikimedia.org/r/335823 [16:30:04] (03PS3) 10Rush: admin: cmjohnson update ssh pubkey [puppet] - 10https://gerrit.wikimedia.org/r/335823 [16:30:44] (03PS1) 10DCausse: Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) [16:31:23] (03PS4) 10Rush: admin: cmjohnson update ssh pubkey [puppet] - 10https://gerrit.wikimedia.org/r/335823 [16:34:25] (03PS2) 10Elukey: Update the jmxtrans submule's sha [puppet] - 10https://gerrit.wikimedia.org/r/335822 [16:34:35] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/5341/ - no op" [puppet] - 10https://gerrit.wikimedia.org/r/335822 (owner: 10Elukey) [16:36:17] godog: do we need to update graphite.eqiad.wmnet? [16:36:48] (03PS2) 10DCausse: Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) [16:37:39] chasemp: I think that there was a code review for operations/dns but probably Filippo is double checking everything before flipping [16:38:00] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2996814 (10Johan) [16:38:12] ah yes https://gerrit.wikimedia.org/r/#/c/335766/1/templates/wmnet [16:38:28] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2953444 (10Johan) OK, then I won't include it (but thanks for checking if it should be, @Elitre!). [16:38:35] (03CR) 10DCausse: [C: 04-1] "should be deployed only when dependent code is fully deployed to the cluster and after ttm indices have been manually replicated to codfw." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) (owner: 10DCausse) [16:38:48] cool [16:38:52] (03CR) 10Hashar: "Will want to stop Nodepool, merge/deploy that change. Make sure Nodepool is stopped." [puppet] - 10https://gerrit.wikimedia.org/r/335809 (owner: 10Hashar) [16:42:20] (03CR) 10Cmjohnson: [C: 031] admin: cmjohnson update ssh pubkey [puppet] - 10https://gerrit.wikimedia.org/r/335823 (owner: 10Rush) [16:42:20] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2222359 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db2039.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-re... [16:42:44] (03PS5) 10Rush: admin: cmjohnson update ssh pubkey [puppet] - 10https://gerrit.wikimedia.org/r/335823 [16:43:01] (03CR) 10Rush: [V: 032 C: 032] admin: cmjohnson update ssh pubkey [puppet] - 10https://gerrit.wikimedia.org/r/335823 (owner: 10Rush) [16:43:28] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2996828 (10Elitre) >>! In T155755#2996105, @elukey wrote: >>>! In T155755#2996087, @Elitre wrote: >> If I understand this correctly, you want people to be notified about... [16:44:28] 06Operations, 10Ops-Access-Requests: Update cmjohnson's keys - https://phabricator.wikimedia.org/T157139#2996844 (10chasemp) 05Open>03Resolved a:03chasemp https://gerrit.wikimedia.org/r/#/c/335823/ [16:45:20] (03PS1) 10Eevans: Enable JMX exporter on RESTBase Staging nodes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/335826 (https://phabricator.wikimedia.org/T155120) [16:47:15] (03PS2) 10Eevans: Enable JMX exporter on RESTBase Staging nodes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/335826 (https://phabricator.wikimedia.org/T155120) [16:47:40] chasemp: the nodepool image renaming, I guess it is not a good idea to do it on a friday [16:47:54] hashar: it'll hold till next week then sure [16:47:56] chasemp: though I am feeling adventurous, it is probably better to do it on monday/tuesday your monring? [16:48:03] it might just work [16:48:04] plan on tuesday [16:48:10] but can also screw up everything :} [16:48:16] (03CR) 10Eevans: [C: 031] Enable JMX exporter on RESTBase Staging nodes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/335826 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [16:48:19] I would rather not try to fix up the mess on a friday night hehe [16:48:22] and we won't know for sure until next regen which is overnight? [16:48:22] k [16:48:30] ohh soryr [16:48:37] so the issue we had was HHVM segfaulting [16:48:53] moritz and I debugged it yesterday, he compiled a new package we tested and have deployed this morning [16:49:00] and CI is all fine. [16:49:26] chasemp: yeah we will at some point, I have the reviews lined up, mark joined me in brussels [16:49:50] (03CR) 10Hashar: "Checked with Chase, we can do that on Tuesday. It is not a great idea to attempt to break CI on a friday evening :-}" [puppet] - 10https://gerrit.wikimedia.org/r/335809 (owner: 10Hashar) [16:51:08] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:51:52] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2996900 (10jcrespo) 34 is done, I think 62 and 70 are pending. [16:55:26] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2996923 (10Deskana) 05Open>03Resolved [16:56:40] (03PS3) 10Filippo Giunchedi: cache: move graphite/performance backends to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335761 (https://phabricator.wikimedia.org/T157022) [16:57:40] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] cache: move graphite/performance backends to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335761 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [17:01:08] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:02:48] (03CR) 10Hashar: [C: 04-1] "Elukey and I sprinted the review/merge of all the submodules linting I made. He took care of bumping the submodules." [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [17:03:32] (03PS1) 10Andrew Bogott: mwopenstackclients: Specify public endpoint for all keystone clients [puppet] - 10https://gerrit.wikimedia.org/r/335828 [17:03:34] (03PS1) 10Andrew Bogott: mwopenstackclients: Use keystoneauath1 for our session [puppet] - 10https://gerrit.wikimedia.org/r/335829 [17:03:48] (03CR) 10Hashar: "This patch is the last one pending before we can merge in the operations/puppet.git patch that adds puppet-lint support as a rake file ( h" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/331329 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [17:05:34] !log fail over read traffic from graphite1001 to graphite2001 https://gerrit.wikimedia.org/r/335761 - T157022 [17:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:40] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [17:09:16] (03PS1) 10Jdlrobson: Disable RelatedSites on English, French and Italian Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335830 (https://phabricator.wikimedia.org/T128326) [17:09:20] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2996940 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2039.codfw.wmnet'] ``` and were **ALL** successful. [17:10:49] (03CR) 10Andrew Bogott: [C: 032] mwopenstackclients: Specify public endpoint for all keystone clients [puppet] - 10https://gerrit.wikimedia.org/r/335828 (owner: 10Andrew Bogott) [17:13:13] 06Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-Cluster: Requesting access to analytics-privatedata-users for musikanimal - https://phabricator.wikimedia.org/T156986#2996961 (10MusikAnimal) @MoritzMuehlenhoff Done! Should have clarified in the description, this is a new key pair and is not us... [17:16:18] (03CR) 10Andrew Bogott: [C: 032] mwopenstackclients: Use keystoneauath1 for our session [puppet] - 10https://gerrit.wikimedia.org/r/335829 (owner: 10Andrew Bogott) [17:16:31] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:21:31] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:56] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 3121 bytes in 5.015 second response time [17:27:44] Confirmed, graphite isn't happy ^ [17:27:54] yeah :( working on it [17:27:56] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1686 bytes in 3.147 second response time [17:36:26] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [17:37:26] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set [17:48:29] 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#2997068 (10fgiunchedi) Read traffic has been switched over to graphite2001 now and seems to work. Note that graphite2001 was unable to talk to eventlog1001, the root cause is that... [17:49:44] ostriches: should be better now [17:51:32] worksforme :) [17:53:17] 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#2997071 (10fgiunchedi) I've also searched from graphite1001's address in router configs and the only place it shows up is `analytics-in4` filter for carbon/statsd traffic. [17:55:36] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:52] 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#2997077 (10fgiunchedi) graphite2001 has been added to cr1/cr2 for `analytics-in4` [18:00:36] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:02:56] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [18:07:01] !log stop carbon-cache on graphite1001 to prevent useless write load [18:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:46] PROBLEM - Graphite Carbon on graphite1001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [18:10:54] godog: I'm online if needed [18:11:01] not for very long though [18:11:58] 06Operations, 10Monitoring: monitor SSD wear levels - https://phabricator.wikimedia.org/T86556#2997082 (10fgiunchedi) p:05Low>03Normal a:03Volans Moving to @Volans as per hangout chat :) [18:12:21] volans: ^ [18:12:24] thanks godog :D [18:12:30] you're welcome! [18:12:35] * volans hides [18:12:42] me and mark didn't even read IRC before updating the task [18:13:06] 06Operations, 10Monitoring, 06Operations-Software-Development: monitor SSD wear levels - https://phabricator.wikimedia.org/T86556#2997087 (10Volans) [18:20:36] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:22:02] godog: should we ack/downtime it? ^^^ [18:23:08] (03PS1) 10Ema: Log etcd connection status [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/335844 (https://phabricator.wikimedia.org/T134893) [18:27:16] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#2997132 (10jcrespo) Pending hosts: ``` db1036.eqiad.wmnet: cacert db1021.eqiad.wmnet: cacert db1022.eqiad.wmnet: cacert db1015.eqiad.wmnet: cacert db2050.codfw.wmnet: cacert db20... [18:33:26] 06Operations, 07Puppet, 06Operations-Software-Development: Consider adding a --skip-conftool option to puppet-merge - https://phabricator.wikimedia.org/T157133#2997141 (10Volans) @jcrespo it never happened to me that puppet-merge was so slow, usually completes in few seconds. Do you have any output/evidence... [18:35:36] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:45:40] 06Operations, 10Monitoring: monitor smart wearout indicators in icinga checks - https://phabricator.wikimedia.org/T157159#2997209 (10RobH) [18:50:36] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:36] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:02:39] !log gerrit: flushed all web_sessions, you'll have to login again. Sorry [19:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:36] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:06:51] ostriches is this due to the user that we were trying to block from gerrit yesturday? [19:07:06] Yes, he had a second account we hadn't blocked. [19:07:06] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:07:12] #nothingtoseehere #movealong [19:07:29] ostriches ok, did you know you could have blocked him through All-Project. [19:07:49] I don't like editing all-projects. [19:07:51] by setting the view option in refs/* for the user (username) and set deny. [19:07:53] * ostriches set inactive instead [19:07:55] oh [19:23:56] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.251 second response time [19:24:56] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.330 second response time [19:34:46] PROBLEM - puppet last run on wtp1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:35:06] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [19:38:39] (03PS1) 10Nuria: Adding uaprser to eventlogging deps [puppet] - 10https://gerrit.wikimedia.org/r/335854 (https://phabricator.wikimedia.org/T153207) [19:43:30] (03PS1) 10Madhuvishy: labstore: Diamond collector to track tools home and project dir sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 [19:44:32] (03PS2) 10Madhuvishy: labstore: Diamond collector to track tools home and project dir sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 [19:44:40] (03CR) 10jerkins-bot: [V: 04-1] labstore: Diamond collector to track tools home and project dir sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (owner: 10Madhuvishy) [19:46:25] (03CR) 10jerkins-bot: [V: 04-1] labstore: Diamond collector to track tools home and project dir sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (owner: 10Madhuvishy) [19:46:33] (03CR) 10Mobrovac: "What about restbase-test200x ?" [puppet] - 10https://gerrit.wikimedia.org/r/335826 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [19:47:31] (03PS3) 10Madhuvishy: labstore: Diamond collector to track tools home and project dir sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 [19:48:25] (03CR) 10jerkins-bot: [V: 04-1] labstore: Diamond collector to track tools home and project dir sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (owner: 10Madhuvishy) [19:53:37] (03PS4) 10Madhuvishy: labstore: Diamond collector to track tools home and project dir sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) [19:55:46] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:00:04] (03PS5) 10Madhuvishy: labstore: Diamond collector to track tools home and project dir sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) [20:01:20] 06Operations, 10ops-codfw, 10DBA: db2060 not accessible - https://phabricator.wikimedia.org/T156161#2997396 (10Papaul) I will need a maintenance window set for this system on Monday from 10am to 1pm for the controller replacement. Thanks Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Ent... [20:02:46] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:03:46] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:05:32] (03PS1) 10Dzahn: lower TTL for webproxy.$dc.wmnet [dns] - 10https://gerrit.wikimedia.org/r/335857 [20:07:43] (03PS2) 10Dzahn: lower TTL for webproxy.$dc.wmnet [dns] - 10https://gerrit.wikimedia.org/r/335857 (https://phabricator.wikimedia.org/T123733) [20:15:00] (03CR) 10Dzahn: [C: 032] lower TTL for webproxy.$dc.wmnet [dns] - 10https://gerrit.wikimedia.org/r/335857 (https://phabricator.wikimedia.org/T123733) (owner: 10Dzahn) [20:25:46] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:29:54] 06Operations, 10ChangeProp, 06Services (later): Add storage to Change-Prop for deduplication - https://phabricator.wikimedia.org/T157089#2997452 (10GWicke) [20:29:58] (03PS3) 10Dzahn: switch webproxies away from carbon to install1002/2002 [dns] - 10https://gerrit.wikimedia.org/r/335747 (https://phabricator.wikimedia.org/T123733) [20:30:26] 06Operations, 10ChangeProp, 06Services (later): Add storage to Change-Prop for deduplication - https://phabricator.wikimedia.org/T157089#2995197 (10GWicke) I added a requirements section that more explicitly calls out what we are looking for in a storage backend. [20:31:46] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:39:52] 06Operations, 10ChangeProp, 06Services (later): Add storage to Change-Prop for deduplication - https://phabricator.wikimedia.org/T157089#2997494 (10Pchelolo) Interesting solution from Netflix for multi-datacenter replication of Redis: https://github.com/Netflix/dynomite [20:53:57] (03PS2) 10Dzahn: DNS/Decom: Remove mgmt dns productions DNS for db2015,db2025-db2027 Bug:T156342,T149102 [dns] - 10https://gerrit.wikimedia.org/r/335667 (owner: 10Papaul) [21:03:25] (03CR) 10Eevans: [C: 031] "> What about restbase-test200x ?" [puppet] - 10https://gerrit.wikimedia.org/r/335826 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [21:04:40] 06Operations, 06Multimedia, 10Wikimedia-Site-requests, 07Performance: Choose a sensible set of thumbnail sizes for Special:Preferences - https://phabricator.wikimedia.org/T106640#2997545 (10Quiddity) [21:14:06] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [21:15:06] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3237060 keys, up 95 days 12 hours - replication_delay is 44 [21:19:06] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 613 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3238834 keys, up 95 days 12 hours - replication_delay is 613 [21:20:14] (03PS2) 10Andrew Bogott: Horizon: Only display puppet roles that have filtertags in the puppet comments. [puppet] - 10https://gerrit.wikimedia.org/r/335593 (https://phabricator.wikimedia.org/T149589) [21:20:16] (03PS1) 10Andrew Bogott: Add a bunch of filtertags to puppet class comments [puppet] - 10https://gerrit.wikimedia.org/r/335869 (https://phabricator.wikimedia.org/T149589) [21:22:06] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3237035 keys, up 95 days 13 hours - replication_delay is 0 [21:26:49] (03PS5) 10Andrew Bogott: labstore: Don't use wikitech API to find labs instances in nfs-exportd [puppet] - 10https://gerrit.wikimedia.org/r/328609 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [21:30:46] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:44] (03CR) 10Mobrovac: [C: 031] "Sounds sane. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/335826 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [21:40:52] (03CR) 10Madhuvishy: [C: 031] labstore: Don't use wikitech API to find labs instances in nfs-exportd [puppet] - 10https://gerrit.wikimedia.org/r/328609 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [21:41:43] (03CR) 10Andrew Bogott: [C: 032] labstore: Don't use wikitech API to find labs instances in nfs-exportd [puppet] - 10https://gerrit.wikimedia.org/r/328609 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [21:43:13] (03PS1) 10Andrew Bogott: Revert "mwopenstackclients: Use keystoneauath1 for our session" [puppet] - 10https://gerrit.wikimedia.org/r/335898 [21:46:03] (03CR) 10Andrew Bogott: [C: 032] Revert "mwopenstackclients: Use keystoneauath1 for our session" [puppet] - 10https://gerrit.wikimedia.org/r/335898 (owner: 10Andrew Bogott) [21:46:12] (03PS2) 10Andrew Bogott: Revert "mwopenstackclients: Use keystoneauath1 for our session" [puppet] - 10https://gerrit.wikimedia.org/r/335898 [21:50:26] what is happening to gerrit? i got logged out for the second time today [21:50:46] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:47] (03PS1) 10Andrew Bogott: Labstore: set openstack::version: mitaka on 1004 and 1005 [puppet] - 10https://gerrit.wikimedia.org/r/335917 (https://phabricator.wikimedia.org/T104575) [21:53:53] MatmaRex: 19:02 ostriches: gerrit: flushed all web_sessions, you'll have to login again. Sorry [21:54:50] (03CR) 10Dzahn: [C: 032] DNS/Decom: Remove mgmt dns productions DNS for db2015,db2025-db2027 Bug:T156342,T149102 [dns] - 10https://gerrit.wikimedia.org/r/335667 (owner: 10Papaul) [21:55:41] (03CR) 10Madhuvishy: [C: 031] Labstore: set openstack::version: mitaka on 1004 and 1005 [puppet] - 10https://gerrit.wikimedia.org/r/335917 (https://phabricator.wikimedia.org/T104575) (owner: 10Andrew Bogott) [21:55:46] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:00:09] (03CR) 10Andrew Bogott: [C: 032] Labstore: set openstack::version: mitaka on 1004 and 1005 [puppet] - 10https://gerrit.wikimedia.org/r/335917 (https://phabricator.wikimedia.org/T104575) (owner: 10Andrew Bogott) [22:01:46] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:03:47] ostriches: did you (or someone) upgrade gerrit today? [22:04:08] Nope, no upgrade [22:04:44] ok [22:04:53] Stil running 2.13.4-13-gc0c5cc4742 [22:04:55] I have a patch which I can't submit or rebase. But I'll just rebase locally [22:05:46] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:05:46] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-novaclient] [22:08:16] (03PS3) 10Andrew Bogott: Revert "mwopenstackclients: Use keystoneauath1 for our session" [puppet] - 10https://gerrit.wikimedia.org/r/335898 [22:08:46] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [22:09:16] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:10:36] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [22:13:36] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [22:16:36] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [22:20:46] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:22:21] (03PS4) 10Dzahn: switch webproxies away from carbon to install1002/2002 [dns] - 10https://gerrit.wikimedia.org/r/335747 (https://phabricator.wikimedia.org/T123733) [22:25:24] (03CR) 10Dzahn: [C: 032] "curl -x http://install2002.wikimedia.org:8080 http://lala.de | head" [dns] - 10https://gerrit.wikimedia.org/r/335747 (https://phabricator.wikimedia.org/T123733) (owner: 10Dzahn) [22:27:39] !log switching webproxy.*.wmnet CNAMEs from carbon to new install servers (T123733) - watching squid access logs [22:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:43] T123733: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733 [22:30:32] 06Operations, 13Patch-For-Review: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#2997741 (10Dzahn) ``` [radon:~] $ for dc in eqiad codfw esams ulsfo; do host webproxy.${dc}.wmnet; done webproxy.eqiad.wmnet is an alias for install1002.wikimedia.org. install1002.wikimedia.org has address 2... [22:31:40] (03CR) 10Dzahn: "[radon:~] $ for dc in eqiad codfw esams ulsfo; do host webproxy.${dc}.wmnet; done" [dns] - 10https://gerrit.wikimedia.org/r/335747 (https://phabricator.wikimedia.org/T123733) (owner: 10Dzahn) [22:38:16] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [22:40:46] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:55:46] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:06:46] PROBLEM - puppet last run on graphite1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:15:36] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [23:16:11] (03PS7) 10Rush: nodepool: track and alert on age of instance states [puppet] - 10https://gerrit.wikimedia.org/r/335373 [23:17:15] (03CR) 10jerkins-bot: [V: 04-1] nodepool: track and alert on age of instance states [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [23:19:35] (03CR) 10Dzahn: [C: 031] udp2log: mirror traffic to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/335625 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [23:23:51] (03PS6) 10Dzahn: Puppet style: Use one line per include/require [puppet] - 10https://gerrit.wikimedia.org/r/334322 (owner: 10Juniorsys) [23:24:42] (03PS8) 10Rush: nodepool: track and alert on age of instance states [puppet] - 10https://gerrit.wikimedia.org/r/335373 [23:30:31] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5343/" [puppet] - 10https://gerrit.wikimedia.org/r/334322 (owner: 10Juniorsys)