[12:08:19] (03CR) 10BBlack: [C: 031] mailman: move exim outbound ip config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/232080 (owner: 10John F. Lewis) [12:15:58] 6operations, 6Performance-Team, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1552590 (10BBlack) So, our legit requests for recent versions that we expect all have an obvious Referer header of ours, as in:... [12:40:55] (03PS4) 10BBlack: bits-legacy: remove beacon/statsv support [puppet] - 10https://gerrit.wikimedia.org/r/231777 (https://phabricator.wikimedia.org/T95448) [12:44:16] (03CR) 10BBlack: [C: 032] bits-legacy: remove beacon/statsv support [puppet] - 10https://gerrit.wikimedia.org/r/231777 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [12:45:13] (03PS4) 10BBlack: bits-legacy: remove special https://bits redirects for secure wikis [puppet] - 10https://gerrit.wikimedia.org/r/231778 (https://phabricator.wikimedia.org/T95448) [12:45:25] (03CR) 10BBlack: [C: 032 V: 032] bits-legacy: remove special https://bits redirects for secure wikis [puppet] - 10https://gerrit.wikimedia.org/r/231778 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [12:46:06] (03PS3) 10BBlack: re-order range / purge to align text+mobile [puppet] - 10https://gerrit.wikimedia.org/r/232027 (https://phabricator.wikimedia.org/T109286) [12:46:37] (03CR) 10BBlack: [C: 032 V: 032] re-order range / purge to align text+mobile [puppet] - 10https://gerrit.wikimedia.org/r/232027 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [12:48:13] (03PS3) 10BBlack: align text+mobile on filter_(headers|noise) in shared code [puppet] - 10https://gerrit.wikimedia.org/r/232028 (https://phabricator.wikimedia.org/T109286) [12:48:29] (03CR) 10BBlack: [C: 04-1] "Will make this less-ugly first" [puppet] - 10https://gerrit.wikimedia.org/r/232028 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [12:53:29] (03PS4) 10BBlack: align text+mobile on filter_(headers|noise) in shared code [puppet] - 10https://gerrit.wikimedia.org/r/232028 (https://phabricator.wikimedia.org/T109286) [12:54:31] 6operations, 10Wikimedia-Mailing-lists: rename lists mwapi-team.disabled.T97148 and flowfunding.disabled.T97328 ? - https://phabricator.wikimedia.org/T109539#1552661 (10JohnLewis) Presumably the disable followed the old system so I recommend we reverse it and just correctly disable the lists. [12:54:54] (03PS5) 10BBlack: align text+mobile on filter_(headers|noise) in shared code [puppet] - 10https://gerrit.wikimedia.org/r/232028 (https://phabricator.wikimedia.org/T109286) [13:14:26] 6operations, 6Labs, 10wikitech.wikimedia.org: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1552732 (10Krenair) >>! In T107547#1551799, @Dzahn wrote: > Isn't the solution here to add the appropriate firewall holes to resolve T98682 instead of l... [13:14:43] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I will give a more thorough look at everything in a short while; some 1-km-high considerations:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [13:15:38] (03CR) 10BBlack: [C: 032] align text+mobile on filter_(headers|noise) in shared code [puppet] - 10https://gerrit.wikimedia.org/r/232028 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [13:15:58] (03CR) 10Giuseppe Lavagetto: cassandra: WIP support for multiple instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [13:18:39] (03PS1) 10BBlack: bugfix for ddea86b4 [puppet] - 10https://gerrit.wikimedia.org/r/232490 [13:18:41] (03PS1) 10BBlack: filter_noise: wm2015 campaign is over [puppet] - 10https://gerrit.wikimedia.org/r/232491 [13:18:59] (03CR) 10BBlack: [C: 032 V: 032] bugfix for ddea86b4 [puppet] - 10https://gerrit.wikimedia.org/r/232490 (owner: 10BBlack) [13:19:14] (03CR) 10BBlack: [C: 032 V: 032] filter_noise: wm2015 campaign is over [puppet] - 10https://gerrit.wikimedia.org/r/232491 (owner: 10BBlack) [13:20:08] (03CR) 10Filippo Giunchedi: [C: 04-1] "trivial s/deploy/deployment/ but LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/218829 (https://phabricator.wikimedia.org/T78310) (owner: 10Giuseppe Lavagetto) [13:20:55] PROBLEM - puppet last run on cp3030 is CRITICAL Puppet has 1 failures [13:20:55] PROBLEM - puppet last run on cp3006 is CRITICAL Puppet has 1 failures [13:21:08] ^ already fixed, they'll fix themselves [13:21:55] PROBLEM - puppet last run on cp1058 is CRITICAL Puppet has 1 failures [13:22:05] PROBLEM - puppet last run on cp2003 is CRITICAL Puppet has 1 failures [13:22:14] PROBLEM - puppet last run on cp2010 is CRITICAL Puppet has 1 failures [13:22:25] PROBLEM - puppet last run on cp4019 is CRITICAL Puppet has 1 failures [13:22:26] PROBLEM - puppet last run on cp1071 is CRITICAL Puppet has 1 failures [13:22:30] akosiaris, hi, the pgsql perms collapsed yesterday for some weird reason :( [13:22:44] PROBLEM - puppet last run on cp4015 is CRITICAL Puppet has 1 failures [13:26:17] <_joe_> yurik: akosiaris is going to be back tomorrow [13:26:28] _joe_, thanks! [13:26:44] RECOVERY - puppet last run on cp3030 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:26:44] RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:23] (03PS1) 10John F. Lewis: add disable/enable list script to mailman::scripts [puppet] - 10https://gerrit.wikimedia.org/r/232494 [13:28:38] (03PS2) 10John F. Lewis: add disable/enable list script to mailman::scripts [puppet] - 10https://gerrit.wikimedia.org/r/232494 [13:29:15] (03PS2) 10Filippo Giunchedi: swift: remove support for Ganglia stats [puppet] - 10https://gerrit.wikimedia.org/r/231232 (owner: 10Faidon Liambotis) [13:29:22] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: remove support for Ganglia stats [puppet] - 10https://gerrit.wikimedia.org/r/231232 (owner: 10Faidon Liambotis) [13:30:05] RECOVERY - puppet last run on cp2010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:35:14] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [13:36:34] stupid puppet [13:38:57] <_joe_> just stupid? [13:40:14] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [13:40:28] (03PS2) 10Filippo Giunchedi: swift_new: add precise support [puppet] - 10https://gerrit.wikimedia.org/r/231233 (owner: 10Faidon Liambotis) [13:40:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift_new: add precise support [puppet] - 10https://gerrit.wikimedia.org/r/231233 (owner: 10Faidon Liambotis) [13:44:35] (03PS2) 10Filippo Giunchedi: swift: reduce the delta with swift_new [puppet] - 10https://gerrit.wikimedia.org/r/231234 (owner: 10Faidon Liambotis) [13:44:45] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: reduce the delta with swift_new [puppet] - 10https://gerrit.wikimedia.org/r/231234 (owner: 10Faidon Liambotis) [13:47:03] _joe_: well, that and a 38 page unpublished manifesto [13:47:12] <_joe_> ahahah [13:47:24] RECOVERY - puppet last run on cp1058 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [13:47:26] RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:47:54] RECOVERY - puppet last run on cp1071 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:48:06] RECOVERY - puppet last run on cp4015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:48:11] The Unipuppeter Manifesto? [13:48:45] mostly in the form of scribbles in heavy black pencil [13:48:55] <_joe_> bblack: you produced a few memorable quotes on the puppet desing, I must have them stashed somewhere [13:49:02] <_joe_> *design [13:49:37] (03PS1) 10Filippo Giunchedi: swift: adjust variable names in label_filesystem [puppet] - 10https://gerrit.wikimedia.org/r/232496 [13:49:46] RECOVERY - puppet last run on cp4019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:50:44] PROBLEM - puppet last run on ms-be1001 is CRITICAL puppet fail [13:50:55] PROBLEM - puppet last run on ms-be3002 is CRITICAL puppet fail [13:51:04] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: adjust variable names in label_filesystem [puppet] - 10https://gerrit.wikimedia.org/r/232496 (owner: 10Filippo Giunchedi) [13:51:09] <_joe_> godog: ^^ anything to worry about? [13:51:21] <_joe_> I guess you're already fixing :) [13:51:26] hehe indeed, just merged [13:52:34] PROBLEM - puppet last run on ms-be3001 is CRITICAL puppet fail [13:53:01] almost there... [13:53:04] (03PS1) 10Filippo Giunchedi: swift: add parted dependency [puppet] - 10https://gerrit.wikimedia.org/r/232497 [13:53:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: add parted dependency [puppet] - 10https://gerrit.wikimedia.org/r/232497 (owner: 10Filippo Giunchedi) [13:54:08] testing full puppet runs in production is like driving your car by constantly bumping into the left/right guard rails [13:54:34] PROBLEM - puppet last run on ms-be1009 is CRITICAL puppet fail [13:54:35] PROBLEM - puppet last run on ms-be1007 is CRITICAL puppet fail [13:56:15] PROBLEM - puppet last run on ms-be1017 is CRITICAL puppet fail [13:56:34] RECOVERY - puppet last run on ms-be1001 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:58:08] !log stop puppet on ms-be1* while merging swift refactoring [13:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:25] !log stop puppet on ms-fe1* while merging swift refactoring [13:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:45] (03PS1) 10Filippo Giunchedi: swift: define [puppet] - 10https://gerrit.wikimedia.org/r/232500 [14:03:35] (03PS2) 10Filippo Giunchedi: swift: define ${mount_base} [puppet] - 10https://gerrit.wikimedia.org/r/232500 [14:03:43] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: define ${mount_base} [puppet] - 10https://gerrit.wikimedia.org/r/232500 (owner: 10Filippo Giunchedi) [14:12:47] (03PS1) 10John F. Lewis: mailman: don't notify emails being removed by script [puppet] - 10https://gerrit.wikimedia.org/r/232502 [14:18:24] RECOVERY - puppet last run on ms-be3002 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [14:18:24] RECOVERY - puppet last run on ms-be3003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:55] RECOVERY - puppet last run on ms-be3001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:22:20] (03PS2) 10Filippo Giunchedi: swift_new: add hiera data for eqiad/esams [puppet] - 10https://gerrit.wikimedia.org/r/231235 (owner: 10Faidon Liambotis) [14:27:53] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1552887 (10GWicke) > Any system that verified that hhvm had correctly restarted would have given @ori a green light too. Correct, but a... [14:28:08] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: Rename zh-min-nan -> nan - https://phabricator.wikimedia.org/T30442#1552888 (10Liuxinyu970226) [14:28:28] (03CR) 10Dzahn: [C: 032] mailman: don't notify emails being removed by script [puppet] - 10https://gerrit.wikimedia.org/r/232502 (owner: 10John F. Lewis) [14:28:32] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: Rename zh-min-nan -> nan - https://phabricator.wikimedia.org/T30442#335159 (10Liuxinyu970226) [14:28:41] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: Rename zh-min-nan -> nan - https://phabricator.wikimedia.org/T30442#335159 (10Liuxinyu970226) [14:30:05] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds [14:31:17] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Create a package for python-pykafka for ubuntu precise and debian sid - https://phabricator.wikimedia.org/T109567#1552896 (10Ottomata) Ah! Forgot about hafnium, and that it was precise. Will package it for precise too. [14:32:25] (03PS3) 10Filippo Giunchedi: swift_new: add hiera data for eqiad/esams [puppet] - 10https://gerrit.wikimedia.org/r/231235 (owner: 10Faidon Liambotis) [14:32:32] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift_new: add hiera data for eqiad/esams [puppet] - 10https://gerrit.wikimedia.org/r/231235 (owner: 10Faidon Liambotis) [14:34:19] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1552923 (10GWicke) [14:35:55] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.015 second response time [14:44:27] (03CR) 10Dzahn: "thanks for this! testing on fermium" [puppet] - 10https://gerrit.wikimedia.org/r/232494 (owner: 10John F. Lewis) [14:50:35] (03CR) 10Dzahn: "if i run it with an argument that ist not a valid list name, there will be some error including "No such list" but afterwards it will stil" [puppet] - 10https://gerrit.wikimedia.org/r/232494 (owner: 10John F. Lewis) [14:52:28] Ops: I’m going to reboot bastion-restricted-01 in 10 minutes. That’s the bastion that all ops use to access labs instances. [14:59:42] (03CR) 10Dzahn: "also tested re-enabling. works !:)" [puppet] - 10https://gerrit.wikimedia.org/r/232494 (owner: 10John F. Lewis) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150819T1500). [15:00:04] kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:22] !log rebooting labvirt1007 [15:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:47] jouncebot_: yes sir. [15:00:55] Who is SWAT'ng? [15:01:06] kart_: I can SWAT [15:01:43] cool. [15:02:45] PROBLEM - puppet last run on stat1003 is CRITICAL Puppet has 1 failures [15:03:05] PROBLEM - Disk space on labvirt1007 is CRITICAL: Connection refused by host [15:03:28] …and just like every day I forgot to schedule downtime in icinga [15:04:42] oh good, zuul didn't pick up my +2 [15:05:42] ah. [15:06:45] guess it's just slow :\ [15:08:03] thcipriani: core+slow zuul. I can finish my dinner :D [15:08:41] whoa, fatalmonitor just blew up with cirrussearch pool error [15:09:31] well, seems to have calmed now, just a momentary spike. [15:10:30] Yeah. Technically that warning is poolcounter doing the right thing. [15:11:00] I think I filed a task about moving that log output [15:14:01] (03PS4) 10Alex Monk: Adjust mediawiki.org RSS whitelist to allow Wikimedia's technology blog feeds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118956 (https://phabricator.wikimedia.org/T63888) (owner: 1001tonythomas) [15:14:44] thcipriani, can we add that to the list? ^ [15:15:18] Krenair: sure [15:16:58] (03PS1) 10Filippo Giunchedi: hieradata: fix missing ] in esams swift_new [puppet] - 10https://gerrit.wikimedia.org/r/232512 [15:17:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] hieradata: fix missing ] in esams swift_new [puppet] - 10https://gerrit.wikimedia.org/r/232512 (owner: 10Filippo Giunchedi) [15:18:15] kart_: that's weird, had a failure on the zend test: ResourceLoaderModuleTest::testGetVersionHash [15:19:27] blah. [15:19:32] Nikerabbit: ^ [15:19:49] thcipriani: which one? [15:20:03] Nikerabbit: https://gerrit.wikimedia.org/r/#/c/232485/ [15:20:19] (03PS3) 10John F. Lewis: add disable/enable list script to mailman::scripts [puppet] - 10https://gerrit.wikimedia.org/r/232494 [15:20:36] thcipriani: umm no idea, it went cleanly to master [15:20:44] is the other branch fine? [15:20:50] Nikerabbit: kk, I'll try a recheck on it [15:21:45] Krenair: I'm going to get your config change out quickly here while I'm waiting on the recheck [15:21:52] ok [15:22:34] (03CR) 10Dzahn: [C: 031] Adjust mediawiki.org RSS whitelist to allow Wikimedia's technology blog feeds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118956 (https://phabricator.wikimedia.org/T63888) (owner: 1001tonythomas) [15:22:58] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118956 (https://phabricator.wikimedia.org/T63888) (owner: 1001tonythomas) [15:23:24] (03Merged) 10jenkins-bot: Adjust mediawiki.org RSS whitelist to allow Wikimedia's technology blog feeds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118956 (https://phabricator.wikimedia.org/T63888) (owner: 1001tonythomas) [15:24:03] (03PS2) 10Andrew Bogott: replace $::instanceproject with $::labsproject [puppet] - 10https://gerrit.wikimedia.org/r/230652 (https://phabricator.wikimedia.org/T93684) [15:24:25] (03PS2) 10Filippo Giunchedi: Switch ms-fe/ms-be esams to swift_new [puppet] - 10https://gerrit.wikimedia.org/r/231236 (owner: 10Faidon Liambotis) [15:24:32] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Switch ms-fe/ms-be esams to swift_new [puppet] - 10https://gerrit.wikimedia.org/r/231236 (owner: 10Faidon Liambotis) [15:25:25] (03PS4) 10John F. Lewis: add disable/enable list script to mailman::scripts [puppet] - 10https://gerrit.wikimedia.org/r/232494 [15:25:47] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Adjust mediawiki.org RSS whitelist to allow technology blog feeds [[gerrit:118956]] (duration: 00m 13s) [15:25:54] ^ Krenair sync'd! [15:27:55] (03CR) 10Andrew Bogott: [C: 032] replace $::instanceproject with $::labsproject [puppet] - 10https://gerrit.wikimedia.org/r/230652 (https://phabricator.wikimedia.org/T93684) (owner: 10Andrew Bogott) [15:28:22] thanks thcipriani [15:28:41] works as expected [15:29:42] Krenair: awesome, thanks for checking. [15:29:58] godog: "Switch ms-fe/ms-be esams to swift_new” ? [15:31:31] (03PS1) 10BBlack: vcl_cookies: standardize on X-Orig-Cookie [puppet] - 10https://gerrit.wikimedia.org/r/232515 (https://phabricator.wikimedia.org/T109286) [15:31:34] (03PS1) 10BBlack: vcl_cookies: switch mobile to text-common code, mostly [puppet] - 10https://gerrit.wikimedia.org/r/232516 (https://phabricator.wikimedia.org/T109286) [15:31:36] (03PS1) 10BBlack: vcl_cookies: re-arrange mobile recv order a bit to match text [puppet] - 10https://gerrit.wikimedia.org/r/232517 (https://phabricator.wikimedia.org/T109286) [15:31:38] (03PS1) 10BBlack: vcl_cookies: use common pass_auth in mobile [puppet] - 10https://gerrit.wikimedia.org/r/232518 (https://phabricator.wikimedia.org/T109286) [15:32:34] Nikerabbit: kart_ recheck worked fine, pushing to wmf19 wikis (testwiki) in 1 sec. [15:32:54] thcipriani: cool [15:34:45] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1553095 (10demon) 5Open>3Resolved [15:35:16] !log thcipriani@tin Synchronized php-1.26wmf19/includes/changetags/ChangeTags.php: SWAT: Avoid full RC table scans in ChangeTags::updateTags() [[gerrit:232485]] (duration: 00m 13s) [15:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:38] ^ Nikerabbit kart_ sync'd for testwiki, check please [15:37:49] thcipriani: I'm monitoring logs to see if our issue goes away and nothing new comes up, I don't have a way to produce the error deterministicly [15:38:15] kk, I'll proceed with wmf18 [15:38:36] 6operations, 6Performance-Team, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1553139 (10greg) >>! In T102991#1552590, @BBlack wrote: > I took another 10-minute sample, looking only at 1.26wmfN where N is... [15:42:29] (03PS1) 10Filippo Giunchedi: hieradata: fix missing ] in eqiad swift_new [puppet] - 10https://gerrit.wikimedia.org/r/232521 [15:43:26] (03PS2) 10Filippo Giunchedi: hieradata: fix missing ] in eqiad swift_new [puppet] - 10https://gerrit.wikimedia.org/r/232521 [15:43:32] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] hieradata: fix missing ] in eqiad swift_new [puppet] - 10https://gerrit.wikimedia.org/r/232521 (owner: 10Filippo Giunchedi) [15:46:55] (03CR) 10Dzahn: "piping the output list_lists -b into grep gets this from the python script, not the bash script:" [puppet] - 10https://gerrit.wikimedia.org/r/232494 (owner: 10John F. Lewis) [15:48:30] 6operations, 6Services, 3Mobile-Content-Service, 5Patch-For-Review, 7service-deployment-requests: New Service Request mobileapps - https://phabricator.wikimedia.org/T105538#1553248 (10bearND) [15:49:04] (03PS5) 10Dzahn: add disable/enable list script to mailman::scripts [puppet] - 10https://gerrit.wikimedia.org/r/232494 (owner: 10John F. Lewis) [15:49:34] 6operations, 6Services, 3Mobile-Content-Service, 5Patch-For-Review, 7service-deployment-requests: New Service Request mobileapps - https://phabricator.wikimedia.org/T105538#1446023 (10bearND) Thank you. FYI, @Mholloway is going to be the backup contact for the times when I'm not available. [15:51:13] !log thcipriani@tin Synchronized php-1.26wmf18/includes/changetags/ChangeTags.php: SWAT: Avoid full RC table scans in ChangeTags::updateTags() [[gerrit:232484]] (duration: 00m 12s) [15:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:51:24] cmjohnson1: puppet is running on cp1050 [15:51:26] ^ kart_ Nikerabbit sync'd to wmf18 wikis [15:51:35] dunno why, but i just revoked the cert on palladium ,and deleted cert files there [15:51:39] and then redid cert sign [15:51:40] (03CR) 10Dzahn: [C: 032] "replaced the "list_lists -b" part with "find /var/lib/mailman/lists -maxdepth 1" wfm now" [puppet] - 10https://gerrit.wikimedia.org/r/232494 (owner: 10John F. Lewis) [15:51:41] seems ok [15:51:50] (03PS6) 10Dzahn: add disable/enable list script to mailman::scripts [puppet] - 10https://gerrit.wikimedia.org/r/232494 (owner: 10John F. Lewis) [15:53:25] PROBLEM - puppet last run on ms-fe3002 is CRITICAL puppet fail [15:54:21] thcipriani: not spotting any issues [15:54:43] Nikerabbit: kk, I don't see anything in fatalmonitor either FWIW, thanks for checking. [15:57:01] !log thcipriani@tin Synchronized php-1.26wmf19/extensions/ContentTranslation/api/ApiContentTranslationPublish.php: SWAT: Temporarily disable notifications for cx [[gerrit:232505]] (duration: 00m 12s) [15:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:57:07] ^ kart_ check please [15:57:25] PROBLEM - puppet last run on ms-fe3001 is CRITICAL puppet fail [15:57:39] thcipriani: wmf19 doesn't have any wikipedias yet, does it? [15:58:18] yes. So go ahead for wmf18! [15:58:21] just testwiki, OK, moving forward [15:59:24] stat1003 is trying to git pull "geowiki-scripts" [15:59:28] but that fails now [15:59:40] I thought some wikis were on wmf19 alreadY? [15:59:43] Error: /Stage[main]/Geowiki/Git::Clone[geowiki-scripts]/Exec[git_pull_geowiki-scripts] [16:00:34] RECOVERY - Disk space on labstore1002 is OK: DISK OK [16:00:36] test wikis and mediawiki.org only [16:00:37] milimetric: ^^^^ [16:00:40] re geowiki? [16:01:16] 6operations, 6Performance-Team, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1553321 (10BBlack) Honestly, I have no idea, but probably something like that... [16:01:53] (03PS1) 10Andrew Bogott: Exclude yet more kernels that break nova-compute. [puppet] - 10https://gerrit.wikimedia.org/r/232524 [16:02:37] (03CR) 10jenkins-bot: [V: 04-1] Exclude yet more kernels that break nova-compute. [puppet] - 10https://gerrit.wikimedia.org/r/232524 (owner: 10Andrew Bogott) [16:02:39] !log thcipriani@tin Synchronized php-1.26wmf18/extensions/ContentTranslation/api/ApiContentTranslationPublish.php: SWAT: Temp disable notifications for cx [[gerrit:232504]] (duration: 00m 13s) [16:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:54] ^ kart_ should be live on wikipedias now [16:02:59] ok testing [16:03:02] thanks [16:03:30] fix works as far as I can see [16:03:44] 6operations, 6Performance-Team, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1553328 (10BBlack) Actually, it can't be as simple as FB caching and serving outdated text to a browser, which then fetches the... [16:03:46] Nikerabbit: cool—thanks! [16:03:50] \0/ [16:03:56] cmjohnson1: what racks were these new nodes in again? [16:04:06] (03PS2) 10Andrew Bogott: Exclude yet more kernels that break nova-compute. [puppet] - 10https://gerrit.wikimedia.org/r/232524 [16:04:22] 1050-51 b3 52+ a3 [16:04:38] 52+ is in a3 [16:04:51] (03CR) 10jenkins-bot: [V: 04-1] Exclude yet more kernels that break nova-compute. [puppet] - 10https://gerrit.wikimedia.org/r/232524 (owner: 10Andrew Bogott) [16:05:00] 1053/1057 are being exchanged and will be in sometime next week [16:06:29] (03PS2) 10BBlack: vcl_cookies: use common pass_auth in mobile [puppet] - 10https://gerrit.wikimedia.org/r/232518 (https://phabricator.wikimedia.org/T109286) [16:06:31] (03PS2) 10BBlack: vcl_cookies: re-arrange mobile recv order a bit to match text [puppet] - 10https://gerrit.wikimedia.org/r/232517 (https://phabricator.wikimedia.org/T109286) [16:06:37] (03PS2) 10BBlack: vcl_cookies: switch mobile to text-common code, mostly [puppet] - 10https://gerrit.wikimedia.org/r/232516 (https://phabricator.wikimedia.org/T109286) [16:07:18] (03CR) 10jenkins-bot: [V: 04-1] vcl_cookies: use common pass_auth in mobile [puppet] - 10https://gerrit.wikimedia.org/r/232518 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [16:07:21] (03CR) 10jenkins-bot: [V: 04-1] vcl_cookies: re-arrange mobile recv order a bit to match text [puppet] - 10https://gerrit.wikimedia.org/r/232517 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [16:07:22] (03PS3) 10Andrew Bogott: Exclude yet more kernels that break nova-compute. [puppet] - 10https://gerrit.wikimedia.org/r/232524 [16:07:43] (03CR) 10jenkins-bot: [V: 04-1] vcl_cookies: switch mobile to text-common code, mostly [puppet] - 10https://gerrit.wikimedia.org/r/232516 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [16:09:46] 6operations, 6Performance-Team, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1553482 (10BBlack) some details from others' investigations: https://www.webmasterworld.com/search_engine_spiders/4436904.htm [16:10:55] ACKNOWLEDGEMENT - puppet last run on stat1003 is CRITICAL Puppet has 1 failures daniel_zahn https://phabricator.wikimedia.org/T109594 [16:10:55] Exec[git_pull_geowiki-scripts]/returns: You are not currently on a branch. [16:15:50] (03PS1) 10Ottomata: Add analytics1050-1052, analytics1054-1056 as Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/232528 [16:16:23] (03PS2) 10Ottomata: Add analytics1050-1052, analytics1054-1056 as Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/232528 [16:17:04] (03CR) 10jenkins-bot: [V: 04-1] Add analytics1050-1052, analytics1054-1056 as Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/232528 (owner: 10Ottomata) [16:18:28] (03PS3) 10Ottomata: Add analytics1050-1052, analytics1054-1056 as Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/232528 [16:19:17] (03CR) 10jenkins-bot: [V: 04-1] Add analytics1050-1052, analytics1054-1056 as Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/232528 (owner: 10Ottomata) [16:20:24] (03PS4) 10Ottomata: Add analytics1050-1052, analytics1054-1056 as Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/232528 [16:22:11] PROBLEM - puppet last run on ms-be1007 is CRITICAL puppet fail [16:22:11] PROBLEM - puppet last run on ms-be1017 is CRITICAL puppet fail [16:22:29] PROBLEM - puppet last run on ms-be1003 is CRITICAL puppet fail [16:22:45] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1553498 (10Dzahn) The title says "from tin, terbium, etc". Could you specify the "etc" part of that? [16:22:51] PROBLEM - puppet last run on ms-be1009 is CRITICAL puppet fail [16:22:51] PROBLEM - puppet last run on ms-be1002 is CRITICAL puppet fail [16:23:04] (03CR) 10Ottomata: [C: 032] Add analytics1050-1052, analytics1054-1056 as Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/232528 (owner: 10Ottomata) [16:23:40] PROBLEM - puppet last run on ms-be1004 is CRITICAL puppet fail [16:24:20] RECOVERY - puppet last run on ms-be1003 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:25:35] mutante: thanks, that's my fault. I'm trying to repair that job and I'm messing with the repo [16:25:54] milimetric: ah:) cool! [16:25:54] 6operations, 10ops-codfw: RAID disk failure on db2023 - https://phabricator.wikimedia.org/T108701#1553500 (10jcrespo) 5Open>3Resolved a:3jcrespo ``` Firmware: Online, Spun Up ``` [16:25:56] unfortunately it's in the middle of what looks like a really long run, but I'll fix it as soon as that's done [16:26:26] milimetric: i made https://phabricator.wikimedia.org/T109594 for it [16:26:27] ms-be/ms-fe above should be recovering shortyl [16:26:45] !log added analytics105[012456] into hadoop cluster as worker nodes [16:26:50] RECOVERY - puppet last run on ms-be1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:10] RECOVERY - puppet last run on ms-be1007 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:28:31] PROBLEM - puppet last run on analytics1050 is CRITICAL puppet fail [16:28:31] 6operations, 10ops-eqiad, 10Analytics-Cluster, 5Patch-For-Review: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1553505 (10Cmjohnson) All are racked and @ottomata has already added as worker nodes. Unfortunately, we had 2 more servers that were came to us not working. Dell t... [16:28:53] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1553507 (10Krenair) So we may want it to be accessible from certain other DB hosts (T89548), maybe snapshot hosts too (for dumps)? I think this would techn... [16:30:39] RECOVERY - puppet last run on analytics1050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:31:41] (03PS1) 10Dzahn: silver/wikitech: allow mysql connection from tin [puppet] - 10https://gerrit.wikimedia.org/r/232529 (https://phabricator.wikimedia.org/T98682) [16:34:34] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1553518 (10Dzahn) There are 2 steps here: a) allow connection per firewall rules b) grant permission in mysql itself a) needs an... [16:36:50] (03PS2) 10Dzahn: silver/wikitech: allow mysql connection from tin [puppet] - 10https://gerrit.wikimedia.org/r/232529 (https://phabricator.wikimedia.org/T98682) [16:40:08] 6operations, 6Performance-Team, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1553544 (10BBlack) My thoughts here at this point are: 1) We should delete the branches automatically and regularly, with some... [16:40:37] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1553545 (10jcrespo) b) Depends on knowing a), as users in MySQL have an srange, too. I suppose you want the standard production users. [16:45:44] (03CR) 10Andrew Bogott: [C: 031] silver/wikitech: allow mysql connection from tin [puppet] - 10https://gerrit.wikimedia.org/r/232529 (https://phabricator.wikimedia.org/T98682) (owner: 10Dzahn) [16:45:49] 6operations, 6Labs, 10wikitech.wikimedia.org: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1553629 (10Dzahn) I think with "add firewalling -> notice that it breaks things" it's natural to adjust the firewall. [16:45:56] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds [16:46:36] (03PS3) 10Dzahn: silver/wikitech: allow mysql connection from tin [puppet] - 10https://gerrit.wikimedia.org/r/232529 (https://phabricator.wikimedia.org/T98682) [16:47:45] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.011 second response time [16:50:37] RECOVERY - puppet last run on ms-be1004 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:51:26] RECOVERY - puppet last run on ms-be1009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:52:07] RECOVERY - puppet last run on ms-be1017 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:55:19] 6operations, 10Wikimedia-Mailing-lists: Rename usergroups@ to usergroup-applications@ - https://phabricator.wikimedia.org/T108099#1553662 (10Varnent) Thank you everyone! [17:13:17] (03CR) 10Andrew Bogott: "Tested on labs. Can't merge this until after I reboot labvirt1008 tomorrow, though, as it's still running a 3.13 kernel." [puppet] - 10https://gerrit.wikimedia.org/r/232524 (owner: 10Andrew Bogott) [17:14:24] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1553768 (10RobH) @Dzahn outlined that quite nicely! In fact, his way is the only proper way to achieve what you are looking for. As such, what you request would best use... [17:14:30] (03PS4) 10Andrew Bogott: Exclude yet more kernels that break nova-compute. [puppet] - 10https://gerrit.wikimedia.org/r/232524 [17:20:01] 6operations, 10ops-codfw, 7network: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#1553786 (10faidon) a:5Papaul>3faidon [17:20:25] 6operations, 10ops-codfw, 7network: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#1553789 (10RobH) I've chatted with Faidon about this in IRC. He's claimed this and will update with the network details. [17:20:46] 6operations, 6Performance-Team, 6Release-Engineering, 10Traffic, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1553790 (10Nemo_bis) [17:30:35] PROBLEM - HHVM rendering on mw2027 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.322 second response time [17:32:35] RECOVERY - HHVM rendering on mw2027 is OK: HTTP OK: HTTP/1.1 200 OK - 63716 bytes in 0.449 second response time [17:32:57] (03PS1) 10BryanDavis: labs: Set Vagrant Puppet environment in mwvagrant wrapper [puppet] - 10https://gerrit.wikimedia.org/r/232532 [17:33:28] (03PS2) 10BBlack: vcl_cookies: standardize on X-Orig-Cookie [puppet] - 10https://gerrit.wikimedia.org/r/232515 (https://phabricator.wikimedia.org/T109286) [17:33:40] (03CR) 10BBlack: [C: 032 V: 032] vcl_cookies: standardize on X-Orig-Cookie [puppet] - 10https://gerrit.wikimedia.org/r/232515 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [17:36:56] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds [17:39:58] (03PS3) 10BBlack: vcl_cookies: use common pass_auth in mobile [puppet] - 10https://gerrit.wikimedia.org/r/232518 (https://phabricator.wikimedia.org/T109286) [17:40:00] (03PS3) 10BBlack: vcl_cookies: re-arrange mobile recv order a bit to match text [puppet] - 10https://gerrit.wikimedia.org/r/232517 (https://phabricator.wikimedia.org/T109286) [17:40:02] (03PS3) 10BBlack: vcl_cookies: switch mobile to text-common code, mostly [puppet] - 10https://gerrit.wikimedia.org/r/232516 (https://phabricator.wikimedia.org/T109286) [17:42:08] (03PS1) 10Ottomata: Rename analytics1022 to kafka1022 [dns] - 10https://gerrit.wikimedia.org/r/232534 (https://phabricator.wikimedia.org/T106581) [17:45:45] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [17:45:56] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [17:49:35] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [17:49:46] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [17:50:36] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 9.207 second response time [17:54:36] PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100% [17:56:25] RECOVERY - Host mw2031 is UPING OK - Packet loss = 0%, RTA = 52.08 ms [18:00:04] twentyafterfour greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150819T1800). [18:02:55] 6operations: audit hr staff and tracking sheet (2015-08-17 revision) against shell access/ldap wmf group - https://phabricator.wikimedia.org/T109382#1553989 (10RobH) [18:02:56] 6operations, 5Patch-For-Review: Determine Sam Reed's access rights - https://phabricator.wikimedia.org/T109386#1553988 (10RobH) 5Open>3Resolved [18:05:30] (03PS1) 1020after4: group1 wikis to 1.26wmf19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232538 [18:05:44] (03CR) 1020after4: [C: 032] group1 wikis to 1.26wmf19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232538 (owner: 1020after4) [18:05:49] (03Merged) 10jenkins-bot: group1 wikis to 1.26wmf19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232538 (owner: 1020after4) [18:06:18] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.26wmf19 [18:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:38] twentyafterfour: i see some fatal errors on wikidata in the new code [18:11:57] we'll probably have a fix soon or else might want to put wikidata back on wmf18 [18:21:14] 6operations: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1554077 (10Moushira) 3NEW [18:22:43] 6operations, 10Security-Reviews: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1554093 (10Krenair) [18:24:13] 6operations, 10Security-Reviews: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1554101 (10Moushira) [18:25:48] twentyafterfour: i have a patch but will take a little bit of time to get it backported [18:25:51] and reviewed [18:27:04] aude: I can review when you have it ready [18:27:20] ok [18:29:17] https://gerrit.wikimedia.org/r/#/c/232541/ looks good to me [18:29:20] (03PS1) 10Ottomata: Rename analytics1022 -> kafka1022 [puppet] - 10https://gerrit.wikimedia.org/r/232542 (https://phabricator.wikimedia.org/T106581) [18:29:49] (03CR) 10Ottomata: [C: 032 V: 032] Rename analytics1022 -> kafka1022 [puppet] - 10https://gerrit.wikimedia.org/r/232542 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [18:29:59] (03PS1) 10Jean-Frédéric: Add *.naturalis.nl to wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232543 (https://phabricator.wikimedia.org/T109429) [18:30:40] !log starting reinsatll of analytics1022 -> kafka1022 as jessie [18:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:30:52] 6operations, 10Wikimedia-Mailing-lists: cleanup mailman archives - introduce apache rewrites - https://phabricator.wikimedia.org/T109609#1554130 (10RobH) 3NEW a:3RobH [18:31:26] JohnFLewis: ^ see that when you have a moment, but dont go doing the work, i'd like to implement =] [18:31:40] already chatted with daniel but i'd like to see if you have any holes to poke in that [18:32:12] twentyafterfour: yeah that's it [18:32:12] basically, on can request something like api.php?action=wbgetentities&ids=Q5&languages=en-gb&languagefallback=&format=json&props=labels [18:32:13] props means you want only labels and then other stuff like descriptions gets filtered out [18:32:13] before we apply language fallback [18:32:13] was able to reproduce the bug and verify this fixes it [18:32:30] 6operations, 10Wikimedia-Mailing-lists: cleanup mailman archives - introduce apache rewrites - https://phabricator.wikimedia.org/T109609#1554147 (10RobH) I'll note we'll have to change the documentation to match this, but @JohnLewis has already undertaken some rewrites (as well as the rest of us contributing.) [18:33:28] robh: ack. It's fine and he gave me a brief overview of what you two discussed :) [18:33:52] (03CR) 10Ottomata: [C: 032] Rename analytics1022 to kafka1022 [dns] - 10https://gerrit.wikimedia.org/r/232534 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [18:34:47] aude: +2'd [18:34:55] thanks [18:34:58] shall I deploy it? [18:35:00] jzerebecki also looked [18:35:07] i need to apply it to our build [18:36:23] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: service IP can't be switched over - https://phabricator.wikimedia.org/T108080#1554173 (10RobH) You cannot just use a service ip from public1-eqiad-c.... Until ganeti vm's have a public subnet for use, this seems blocked. [18:38:10] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1021 is CRITICAL 71.43% of data above the critical threshold [10.0] [18:38:14] robh: i guess we never figured out how to really flush PTR records, eh? [18:38:19] just have to wait? [18:38:30] ah, that' smee, need to schedule downtime for that sevice, aye [18:39:51] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on analytics1018 is CRITICAL 71.43% of data above the critical threshold [10.0] ottomata I am reinstalling 1022, so it will be like this for an hourish [18:39:51] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on analytics1021 is CRITICAL 71.43% of data above the critical threshold [10.0] ottomata I am reinstalling 1022, so it will be like this for an hourish [18:39:51] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL 50.00% of data above the critical threshold [10.0] ottomata I am reinstalling 1022, so it will be like this for an hourish [18:42:40] (03PS1) 10Ottomata: Rename analytics1022 -> kafka1022 [puppet] - 10https://gerrit.wikimedia.org/r/232557 (https://phabricator.wikimedia.org/T106581) [18:43:16] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/232556/ (waiting for jenkins now) [18:43:48] and think we can be patient and wait [18:43:51] ottomata: heh... i guess thats going to come up again eh? [18:44:06] yeah, and at least once more after this time :) [18:44:14] renaing 3 of these nodes, this is the second [18:44:18] renaming* [18:45:04] hm, robh, i guess i can change hte PTR record ahead of time next time :) [18:45:13] since that is the only one that blocks this [18:45:14] set them to 5 min [18:45:22] hm, k [18:45:24] i do that for all the serives for migrations [18:45:27] =] [18:45:31] then after its done change it back [18:45:43] (just set it to that well in advance, since you'll need the 1h to expire) [18:45:47] 122 5M IN PTR kafka1022.eqiad.wmnet. [18:45:48] ? [18:45:54] yeah [18:46:01] yep, there are otehr ones in there still i bet ;D [18:46:20] well, i just need the PTR to expire, since i want to reinstall now [18:46:22] the others are new names [18:46:27] so they resolve [18:47:09] (03PS1) 10Ottomata: Temporarily set expire of PTR for kafka1022 to 5 min so I can reinstall asap [dns] - 10https://gerrit.wikimedia.org/r/232559 (https://phabricator.wikimedia.org/T106581) [18:47:23] (03CR) 10Ottomata: [C: 032] Temporarily set expire of PTR for kafka1022 to 5 min so I can reinstall asap [dns] - 10https://gerrit.wikimedia.org/r/232559 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [18:47:54] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: service IP can't be switched over - https://phabricator.wikimedia.org/T108080#1554233 (10JohnLewis) From the VM creation task and Alex comments there - my understanding is Ganeti is just using the public subnet for eqiad c1 and no specific subnet bey... [18:49:11] (03PS1) 10Ottomata: Return expire of kafka1022 PTR to 1H [dns] - 10https://gerrit.wikimedia.org/r/232560 (https://phabricator.wikimedia.org/T106581) [18:51:48] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: service IP can't be switched over - https://phabricator.wikimedia.org/T108080#1554241 (10Dzahn) the comment above refers to T108065#1513988 [18:53:09] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: service IP can't be switched over - https://phabricator.wikimedia.org/T108080#1554245 (10RobH) oh, then yea just allocate one out of that subnet... and setting the proper vlan id's in creation. [18:53:34] twentyafterfour: our branch is updated / merged and the submodule update done [18:55:13] aude: deploying it [18:56:32] thanks [18:59:00] !log twentyafterfour@tin Synchronized php-1.26wmf19: deploy hotfix for Wikidata: https://gerrit.wikimedia.org/r/#/c/232556/ (duration: 02m 39s) [18:59:09] hm, robh even with 5M, old PTR is still returned [18:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:59:54] ottomata: uhh... when did you change it? [19:00:02] it takes an hour for it to change to 5 minutes you realize...? [19:00:06] it has to expire the old one [19:00:09] https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q5&languages=en-gb&languagefallback=&format=json&props=labels [19:00:12] looks good :) [19:00:20] twentyafterfour: thanks for deploying the hotfix [19:00:49] ottomata: sorry, i ahve like ten open tabs and tasks, im not really paying that close attention to your issue [19:00:56] is there a summary of it where each step was tried and the result? [19:01:11] cuz at this point the only way i can offer suggestion is to sit down and attempt to do whatever you are doing [19:01:21] sidelining isnt giving me any kind of indicator of whats going on. [19:01:28] * aude off to eat soon (assuming no more such issues) [19:01:34] robh, i mean, it isn't a big deal, it sucks at this moment, but will be fine in 30 mins :/ [19:01:42] for the next one i can just change PTR an hour ahead of time [19:01:52] robh, i did not realize that [19:01:54] well, it soudns like a real issue, but im not willing to just make partial comments when its obvious we arnet tacklign the issue [19:02:03] and that would be the problem then, the 5M won't really help. [19:02:05] :) [19:02:07] changing that back [19:02:13] well, im not sure what the issue is [19:02:19] i have no idea what you are trying to do is my point. [19:02:23] haha [19:02:23] ah [19:02:29] you were tyring to change a hostname [19:02:37] but it didnt change and you had to wait out the hour [19:02:37] goal: rename a node, changing all DNS entries to a new name [19:02:39] so now you want to do it again [19:02:45] and you just change it from 1h to 5m? [19:02:46] i'm doing it to a new node now [19:02:52] You have ot wait an hour for the 1h to change. [19:02:54] (not like 2 days ago) [19:02:57] well, up to one hour. [19:03:00] aye, so. [19:03:09] all forward records are fine, because the name is new [19:03:12] but, since IP is the same [19:03:16] (03CR) 10Alex Monk: [C: 032] Add *.naturalis.nl to wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232543 (https://phabricator.wikimedia.org/T109429) (owner: 10Jean-Frédéric) [19:03:24] (03Merged) 10jenkins-bot: Add *.naturalis.nl to wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232543 (https://phabricator.wikimedia.org/T109429) (owner: 10Jean-Frédéric) [19:03:25] PTR doesn't change until the 1H is over [19:03:52] i have already changed the PTR entry in DNS, was hoping I could flush cache somehow, so I could do this reinstall without waiting an hour [19:04:03] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/232543/ (duration: 00m 13s) [19:04:03] it is only sorta time sensitive, because we are reinstalling but keeping all data on this node [19:04:05] you shold be able to do just that though [19:04:09] so when it comes back up, it has to replicate from where it left off [19:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:04:31] right, so the issue seems to be that rec_control wipe-cache doesn't clear the pTR [19:04:53] ok, lemme take a look at the dns stuff [19:05:40] aude: seems fixed :) [19:05:40] btw, robh, i'm changing 5M back to 1H, cause it doesn't matter now i thikn [19:06:23] (03CR) 10Ottomata: [C: 032] Return expire of kafka1022 PTR to 1H [dns] - 10https://gerrit.wikimedia.org/r/232560 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [19:06:53] ottomata: [19:06:55] robh@acamar:~$ sudo rec_control wipe-cache analytics1022.eqiad.wmnet [19:06:55] wiped 7 records, 1 negative records [19:07:04] looks like you didnt wipe them on all of the recurors [19:07:15] i suggested just eqiad but advised you may have to run on all 5 of them [19:07:16] =] [19:07:22] running on all 5 now, pls hold [19:07:58] k, robh, i did that the other day and it didn't help, but today I only did just eqiad [19:08:08] ok, we'll see, dunno [19:08:13] k [19:08:47] bah [19:08:49] i ran on all [19:08:58] still ip hosts to analytics old name [19:09:04] ottomata: yea, fubar. [19:09:22] I'd suggest documenting this into a task and we can ask others who understand our dns implementation better [19:09:30] but my understanding is this should solve it. [19:10:03] ottomata: oh wait [19:10:05] templates/10.in-addr.arpa:122 1H IN PTR analytics1022.eqiad.wmnet. [19:10:13] you didnt remove from there? [19:10:22] thats the reverse entry =] [19:10:27] ? [19:10:30] ??? [19:10:37] or i have out of date git hold [19:10:40] hahaha [19:10:41] yes [19:10:43] phew [19:10:46] ha [19:10:53] man how confused id you just get eh? [19:10:56] sorry about that! [19:11:15] So indeed, there is no more mention of this and we have done the wipe cache [19:11:22] so .... I dunno. [19:11:22] (03PS1) 10Ottomata: Fix syntax error in hadoop net-topology.py.erb (missing comma) [puppet] - 10https://gerrit.wikimedia.org/r/232564 [19:11:35] see suggestion on task and pls cc me on it cuz i'll run into this someday. [19:11:41] (03CR) 10Ottomata: [C: 032 V: 032] Fix syntax error in hadoop net-topology.py.erb (missing comma) [puppet] - 10https://gerrit.wikimedia.org/r/232564 (owner: 10Ottomata) [19:11:55] though most renames also involve row changes and ip changes [19:12:09] suggestion on task? [19:12:59] like who to ask? [19:13:07] huh? [19:13:12] "see suggestion on task " [19:13:14] what task? [19:13:19] I'd suggest documenting this into a task and we can ask others who understand our dns implementation better [19:13:23] (comment i made earlier) [19:13:24] oh ok [19:13:25] make a task. [19:13:31] yea, heh [19:13:32] oh sorry [19:13:34] k [19:13:36] no worries [19:13:43] confusing phrasing [19:15:21] (03PS1) 10Alex Monk: Allow importers right to be granted on rswikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232565 (https://phabricator.wikimedia.org/T109613) [19:16:55] (03PS2) 10Alex Monk: Allow importers right to be granted on rswikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232565 (https://phabricator.wikimedia.org/T109613) [19:17:18] (03CR) 10Alex Monk: [C: 032] Allow importers right to be granted on rswikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232565 (https://phabricator.wikimedia.org/T109613) (owner: 10Alex Monk) [19:17:24] (03Merged) 10jenkins-bot: Allow importers right to be granted on rswikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232565 (https://phabricator.wikimedia.org/T109613) (owner: 10Alex Monk) [19:18:04] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/232565/ (duration: 00m 12s) [19:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:26:21] 6operations: rec_control wipe-cache does not clear PTR records - https://phabricator.wikimedia.org/T109617#1554348 (10Ottomata) 3NEW a:3RobH [19:26:42] wai? [19:26:46] why is it assgined to me? [19:26:48] haha [19:26:54] who should I assign it to? [19:26:55] i wanna cc not be assgined [19:26:57] i dont konw the answer [19:27:01] ahha ok [19:27:12] I'd suggest documenting this into a task and we can ask others who understand our dns implementation better [19:27:14] 6operations: rec_control wipe-cache does not clear PTR records - https://phabricator.wikimedia.org/T109617#1554360 (10Ottomata) a:5RobH>3None [19:27:15] others [19:27:33] I think bblack was part of the overhaul/implementation [19:27:47] I dunno if he is the perso to ask but would be a good startin gpoint [19:28:16] I am merely interested since many folks first reaciton is to ask me =] [19:28:24] ha [19:28:29] (since historically ive done a lot of installs and reinstalls) [19:29:29] So the likely easiest answer is start pinging random folks in ops who likely know. the make it someone elses provblem is make it moritzm since he is on clinic duty [19:29:36] the former is nicer ;D [19:30:14] but we already pinged him a few times in here during this chat so ilikely youll have to kick the ticket to him [19:30:17] heh [19:32:04] luckily, this is also really easy to recreate wihtout any kind of actual service fqdn =] [19:34:31] ottomata: can you give an example? [19:34:54] 6operations: Run assert check to verify the existence of certain texts in the footer - https://phabricator.wikimedia.org/T108081#1554398 (10chasemp) I have a basic working version of a script that validates certain HTML exists on designated pages. It's not sophisticated but it should do the job. However, I c... [19:34:55] easily: he renamed analytics1022.eqiad.wmnet to kafka1022.eqiad.wmnet [19:35:05] and now you can forward dig and get the ip [19:35:11] but the ip hosts back to the old name [19:35:16] cache expired now though [19:35:19] but we did the rec_control wipe-cache on all the recursors [19:35:21] so PTR is what I need to install [19:35:23] indeed, its now expired [19:35:38] but ya, had to wait an hour [19:35:41] before i could continue [19:36:10] well I mean, an actual paste of the failed command [19:36:20] heh [19:36:25] yea that task is very detail light [19:36:34] i'd put in a timeline of your actual commands and the results [19:36:38] sorry, i am in the middle of upgrade, can try to make better later [19:36:40] it's not our command, it belongs to upstream PowerDNS. You'd think they'd notice if it didn't work at all :P [19:37:16] PTR wiping works fine [19:37:18] so my first guesses are (a) syntax wrong or (b) we have some custom hack breaking it, but I don't think the prod recursors have much config hack going on [19:37:19] I've done it a few times [19:37:44] what did you run? [19:37:45] well, i also only wiped the fwdn [19:37:47] not the ip [19:37:54] well, lol [19:38:02] uhm what? [19:38:08] they're completely separate records, they're independent [19:38:08] hey, its not my task [19:38:10] lemme update task with quick summary [19:38:22] i was sideline supporting ottomata and i adivsed i needed a step by step breakdown already! ;] [19:38:23] if you want rec_control to wipe the PTR, you have to ask it to do that :) [19:38:25] * robh run sawayyy [19:38:25] oh, paravoid is there a special way to wipe PTR [19:38:33] it's not special, it's normal! [19:38:36] hahah [19:38:41] ok, but not documented on DNS page [19:38:42] :) [19:38:45] but yea, we should have triped to wipe_cache the ip i guess =P [19:38:50] # rec_control wipe-cache hostname [19:38:51] on all the DNS resolvers. This will also clear any negative cache records. [19:38:56] https://wikitech.wikimedia.org/wiki/DNS#Remove_a_record_from_the_DNS_resolver_caches [19:38:57] yea, we jused hostnames [19:39:01] shoudl we have used teh IP there? [19:39:12] not the IP, the reverse DNS record [19:39:16] you should have used the DNS record that you wanted wiped [19:39:19] i tried with IP and it said it didn't clear anything [19:39:20] [@hydrogen:~] $ sudo rec_control wipe-cache 10.64.36.122 [19:39:20] did that [19:39:20] wiped 0 records, 0 negative records [19:39:26] OH [19:39:27] reversed IP [19:39:28] 10.64.36.122 is not a DNS record [19:39:28] doh [19:39:35] as in 122.36.64.10.in-addr.arpa [19:39:39] go tit [19:39:41] ahhh [19:39:42] 122.36.64.10.in-addr.arpa. I think [19:39:51] iirc it needs the final dot too [19:39:52] this quote is incorrect then 'This will also clear any negative cache records.'? [19:39:58] well... that makes perfect sense [19:40:02] no, that quote is fine [19:40:05] ok [19:40:15] a negative cache record is when the recursor has cached an NXDOMAIN [19:40:33] (and other minor related cases) [19:41:42] k [19:41:55] https://wikitech.wikimedia.org/w/index.php?title=DNS&type=revision&diff=174268&oldid=169111 [19:41:56] :) [19:42:33] 6operations: rec_control wipe-cache does not clear PTR records - https://phabricator.wikimedia.org/T109617#1554426 (10Ottomata) 5Open>3Invalid a:3Ottomata We just are noobs: https://wikitech.wikimedia.org/w/index.php?title=DNS&type=revision&diff=174268&oldid=169111 [19:42:40] don't forget IPv6 too :) [19:42:52] ottomata: nice update [19:42:56] 2.2.1.0.6.3.0.0.4.6.0.0.0.1.0.0.6.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa domain name pointer kafka1022.eqiad.wmnet. [19:43:50] (03CR) 10BryanDavis: "Tested via cherry-pick on shaved-yak.mediawiki-core-team.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/232532 (owner: 10BryanDavis) [19:45:24] bd808: What is shaved-yak. ? [19:45:29] A staging server? [19:45:41] my test sever for vagrant stuff [19:45:48] (I got the ip6.arpa name by typing "host 2620:0:861:106:10:64:36:122". otherwise it's a PITA to construct (it's the whole address, backwards, by-nibble) [19:46:08] also a fun host name [19:50:40] RoanKattouw: https://en.wiktionary.org/wiki/yak_shaving#Etymology :) [19:50:59] I know that :) [19:51:07] I just didn't know their team had a staging server [19:51:12] i learned it from the host name:) [19:51:43] I had gilded-yak.mediawiki-core-team.eqiad.wmflabs for a while too [19:52:22] * bd808 watched a lot of Ren & Stimpy in college [19:52:22] https://en.wiktionary.org/wiki/forget,_when_up_to_one%27s_neck_in_alligators,_that_the_mission_is_to_drain_the_swamp [19:56:38] (03PS1) 10Alex Monk: Allow rswikimedia bureaucrats to remove sysops and bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232580 (https://phabricator.wikimedia.org/T109621) [20:00:04] gwicke cscott arlolra subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150819T2000). Please do the needful. [20:00:20] (03CR) 10Alex Monk: [C: 032] Allow rswikimedia bureaucrats to remove sysops and bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232580 (https://phabricator.wikimedia.org/T109621) (owner: 10Alex Monk) [20:00:26] (03Merged) 10jenkins-bot: Allow rswikimedia bureaucrats to remove sysops and bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232580 (https://phabricator.wikimedia.org/T109621) (owner: 10Alex Monk) [20:00:55] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/232580/ (duration: 00m 12s) [20:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:04:15] deploying parsoid [20:10:29] (03PS1) 10John F. Lewis: mailman: give g+x on scripts [puppet] - 10https://gerrit.wikimedia.org/r/232581 [20:10:39] (03PS2) 10John F. Lewis: mailman: give g+x on scripts [puppet] - 10https://gerrit.wikimedia.org/r/232581 [20:11:35] "https://en.wikipedia.org/wiki/Main_Page and https://en.wikibooks.org/wiki/Main_Page copyrights are slightly different and the link sends you to somewhere different" [20:11:45] chasemp, you mean wikibooks and wikipedia? not wikipedia and wikipedia :) [20:13:01] (03CR) 10Dzahn: "what about disable_list.sh" [puppet] - 10https://gerrit.wikimedia.org/r/232581 (owner: 10John F. Lewis) [20:13:37] (03CR) 10John F. Lewis: "Ironically my repo didn't know about the change I made :)" [puppet] - 10https://gerrit.wikimedia.org/r/232581 (owner: 10John F. Lewis) [20:16:19] (03PS2) 10Dzahn: mailman: move exim outbound ip config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/232080 (owner: 10John F. Lewis) [20:16:26] (03PS2) 10Ottomata: Rename analytics1022 -> kafka1022 [puppet] - 10https://gerrit.wikimedia.org/r/232557 (https://phabricator.wikimedia.org/T106581) [20:16:42] (03CR) 10Ottomata: [C: 032 V: 032] Rename analytics1022 -> kafka1022 [puppet] - 10https://gerrit.wikimedia.org/r/232557 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [20:16:54] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: Bhojpuri wikipedia should start with 'bho' instead of 'bh' to avoid confusion with Bihari - https://phabricator.wikimedia.org/T41968#1554527 (10Krenair) [20:17:18] !log deployed parsoid version 8d617c99 [20:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:50] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#1554543 (10Krenair) [20:19:19] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 7Tracking: Wikipedias with zh-* language codes waiting to be renamed (zh-min-nan -> nan, zh-yue -> yue, zh-classical -> lzh) (tracking) - https://phabricator.wikimedia.org/T10217#1554546 (10Krenair) [20:19:32] 6operations, 10Wikimedia-Mailing-lists: move mailman server and service IPs to hiera / make it possible to run multiple instances at once - https://phabricator.wikimedia.org/T109624#1554550 (10Dzahn) 3NEW a:3JohnLewis [20:19:36] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: language code change for Samogitian: "bat-smg" to "sgs" - https://phabricator.wikimedia.org/T27522#1554558 (10Krenair) [20:19:46] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: Renaming the Aramaic (arc) Wikipedia to the Syriac (syc) Wikipedia - https://phabricator.wikimedia.org/T28725#1554560 (10Krenair) [20:19:56] (03PS3) 10Dzahn: mailman: move exim outbound ip config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/232080 (https://phabricator.wikimedia.org/T109624) (owner: 10John F. Lewis) [20:20:08] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: Rename emlwiki -> eglwiki - https://phabricator.wikimedia.org/T36217#1554570 (10Krenair) [20:22:12] (03PS4) 10Dzahn: mailman: move exim outbound ip config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/232080 (https://phabricator.wikimedia.org/T109624) (owner: 10John F. Lewis) [20:22:50] 6operations, 10Wikimedia-Mailing-lists: move mailman server and service IPs to hiera / make it possible to run multiple instances at once - https://phabricator.wikimedia.org/T109624#1554598 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/232080/ [20:23:52] (03CR) 10Dzahn: [C: 032] "so nice that it could be checked with the compiler again. and we needed this for the migration, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/232080 (https://phabricator.wikimedia.org/T109624) (owner: 10John F. Lewis) [20:29:20] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [20:31:31] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1554637 (10Dzahn) [20:31:33] 6operations, 10Wikimedia-Mailing-lists: move mailman server and service IPs to hiera / make it possible to run multiple instances at once - https://phabricator.wikimedia.org/T109624#1554636 (10Dzahn) 5Open>3Resolved [20:34:40] (03PS3) 10Dzahn: mailman: give g+x on scripts [puppet] - 10https://gerrit.wikimedia.org/r/232581 (owner: 10John F. Lewis) [20:35:43] (03PS4) 10Dzahn: mailman: give g+x on scripts [puppet] - 10https://gerrit.wikimedia.org/r/232581 (owner: 10John F. Lewis) [20:35:49] (03CR) 10Dzahn: [C: 032] mailman: give g+x on scripts [puppet] - 10https://gerrit.wikimedia.org/r/232581 (owner: 10John F. Lewis) [20:36:11] (03PS4) 10Dzahn: silver/wikitech: allow mysql connection from tin [puppet] - 10https://gerrit.wikimedia.org/r/232529 (https://phabricator.wikimedia.org/T98682) [20:36:33] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL 100.00% of data above the critical threshold [5000000.0] [20:36:51] sok^ new broker coming up [20:42:57] !log disabling puppet on restbase1001 to temporarily enable additional GC logging [20:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:44:09] (03PS5) 10Dzahn: silver/wikitech: allow mysql connection from tin [puppet] - 10https://gerrit.wikimedia.org/r/232529 (https://phabricator.wikimedia.org/T98682) [20:44:28] (03CR) 10Dzahn: [C: 032] silver/wikitech: allow mysql connection from tin [puppet] - 10https://gerrit.wikimedia.org/r/232529 (https://phabricator.wikimedia.org/T98682) (owner: 10Dzahn) [20:46:00] !log restarting Cassandra on restbase1001 to enable -XX:+PrintAdaptiveSizePolicy [20:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:30] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1554694 (10BBlack) Re the cookies issues in the above commits, what I've found so far just by looking at logs is: 1. Since it's the same MediaWiki, yes, the Vary headers seem to... [20:51:13] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1021 is OK Less than 1.00% above the threshold [1.0] [20:52:11] Krenair: so you type "sql labswiki" on tin and get an error message, and i type "sql labswiki" on tin and get ... nothing [20:52:21] odd [20:52:47] mutante, I typed it again and got nothing [20:52:58] hmm, this is before the firewall change that i'm going to apply [20:53:09] all i wanted is confirm :p [20:53:18] Should be able to achieve the same thing by "mysql -u wikiadmin -p`wikiadmin_pass` -h silver" [20:53:54] but with "labswiki" at the end for the db name, I guess.. [20:55:06] (03PS1) 10BBlack: vcl_cookies: reduce text-vs-mobile cookie variance [puppet] - 10https://gerrit.wikimedia.org/r/232638 (https://phabricator.wikimedia.org/T109286) [20:55:42] Krenair: i see it tries to use IPv6 ... [20:56:01] i wonder what could have changed though since you got that error message on the ticket [20:56:10] looking [20:56:17] oh, it works [20:56:22] it does?? [20:56:35] inet_pton(AF_INET6, "2620:0:861:2:208:80:154:136", &sin6_addr), [20:56:40] ) = -1 ETIMEDOUT (Connection timed out) [20:56:46] it tries v6 first until timeout [20:56:49] and then v4 and then it does? [20:57:00] nope, I typed the command wrong [20:57:42] how? [20:57:58] missed the second ` [20:58:22] so the shell prompted for more input, not mysql [20:58:40] eh, but even just "[tin:~] $ mysql -u wikiadmin -h silver.wikimedia.org labswiki [20:58:45] also nothing [20:59:04] can't connect via ipv4 either [20:59:17] something to do with silver being on a public ip perhaps? [20:59:18] same, just tried IP [20:59:35] I wonder what silver sees as tin's IP when connecting [21:00:04] andrewbogott chasemp: Respected human, time to deploy Labs network upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150819T2100). Please do the needful. [21:00:39] 21:00:23.574059 IP6 tin.eqiad.wmnet.35747 > silver.wikimedia.org.mysql: Flags [S] [21:00:43] ^ tcpdump on silver [21:00:47] it does get a conneciton [21:00:50] attempt [21:01:11] presumably a log somewhere will tell you whether it was accepted or blocked? [21:01:16] (03CR) 10MaxSem: [C: 031] "Looks harmless from a MobileFrontend POV." [puppet] - 10https://gerrit.wikimedia.org/r/232516 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [21:02:06] (03PS5) 10Andrew Bogott: Exclude yet more kernels that break nova-compute. [puppet] - 10https://gerrit.wikimedia.org/r/232524 [21:02:07] you did actually apply the new firewall rule, right mutante? [21:02:08] (03PS1) 10Andrew Bogott: Move nova-network over to labnet1002 [puppet] - 10https://gerrit.wikimedia.org/r/232640 [21:02:19] (03PS2) 10Andrew Bogott: Move nova-network over to labnet1002 [puppet] - 10https://gerrit.wikimedia.org/r/232640 [21:02:21] it's not merged in gerrit [21:02:55] MaxSem: thanks for looking at that. I hate having to inflict VCL on other people. [21:03:19] Krenair: no, but i still wondered why you got "Can't connect to MySQL server" [21:03:52] submits it [21:04:05] (03CR) 10Rush: "are we worried about?" [puppet] - 10https://gerrit.wikimedia.org/r/232640 (owner: 10Andrew Bogott) [21:04:10] !log disabling puppeton labnet1001 and labnet1002 [21:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:04:18] (03PS3) 10Rush: Move nova-network over to labnet1002 [puppet] - 10https://gerrit.wikimedia.org/r/232640 (owner: 10Andrew Bogott) [21:05:09] (03CR) 10Rush: [C: 031] "lets give it a try" [puppet] - 10https://gerrit.wikimedia.org/r/232640 (owner: 10Andrew Bogott) [21:05:13] mutante: i wanna livehack some apache redirect tests on sodium [21:05:18] you doing anything on it right now? [21:05:24] (just checking to be nice ;) [21:05:29] robh: no, i only use fermium for mailman stuff [21:05:43] cool [21:07:05] Krenair: now: ACCEPT tcp -- tin.eqiad.wmnet anywhere tcp dpt:mysql [21:07:22] and db1011 like before [21:07:37] ERROR 1045 (28000): Access denied for user 'wikiadmin'@'10.64.0.196' (using password: YES) [21:07:46] so that's expected, because GRANTs [21:07:49] but one step closer [21:08:42] !log disabled puppet on sodium for livehacking tests for T109609 [21:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:09] (03CR) 10Andrew Bogott: [C: 032] Move nova-network over to labnet1002 [puppet] - 10https://gerrit.wikimedia.org/r/232640 (owner: 10Andrew Bogott) [21:10:08] (03PS2) 10BBlack: vcl_cookies: reduce text-vs-mobile cookie variance [puppet] - 10https://gerrit.wikimedia.org/r/232638 (https://phabricator.wikimedia.org/T109286) [21:11:02] well, thats a promising test [21:11:24] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1554808 (10Dzahn) merged the change above. watched on silver. can now do this from tin: [tin:~] $ mysql -u wikiadmin -h silver.wiki... [21:11:27] works jsut fine [21:11:45] * robh is always a bit surprised when things work as expected. [21:13:33] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK Less than 1.00% above the threshold [1000000.0] [21:14:32] me too! [21:14:42] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1554820 (10Dzahn) @jcrespo let's do `'wikiadmin'@'tin.eqiad.wmnet'` and `'wikiuser'@'tin.eqiad.wmnet'` for now? [21:15:07] 6operations, 10Wikimedia-Mailing-lists: cleanup mailman archives - introduce apache rewrites - https://phabricator.wikimedia.org/T109609#1554830 (10RobH) This will result in some list archive renumbering, which would happen eventually no matter what. When these are all imported in the future, we'll have to ac... [21:17:53] RECOVERY - nova-network process on labnet1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-network [21:17:59] robh: btw, there is also that labs instance that is just like sodium [21:18:01] robh: quick q; ever heard of labs? :) [21:18:09] meh [21:18:13] yes [21:18:39] JohnFLewis: ask bblack about the gndns issue [21:18:42] seems me and mutante were going to make the same point at the same time. the timing :) [21:18:43] gdnsd [21:18:55] it can't listen on port 53 you said? [21:18:58] ;p [21:19:01] is he here? :p [21:19:04] PROBLEM - puppet last run on labvirt1006 is CRITICAL puppet fail [21:19:04] yes [21:19:21] indeed - I'll pm [21:19:24] You are both correct, which is why I did it without asking. [21:19:27] =] [21:19:37] YuviPanda would approve [21:20:31] 6operations, 6Labs, 10wikitech.wikimedia.org: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1554837 (10Dzahn) there is now a hole in the firewall to allow connections to mysql on silver from tin [21:20:33] !log livehack reverted sodium back to normal, testing done [21:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:23:23] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [21:28:43] PROBLEM - salt-minion processes on labnet1002 is CRITICAL: Connection refused by host [21:28:51] 6operations, 10MediaWiki-extensions-CentralNotice, 10Traffic, 10Wikimedia-Fundraising: Provide location, logged-in status and device information in ResourceLoaderContext - https://phabricator.wikimedia.org/T103695#1554860 (10BBlack) I'd have to dig into the details of this more to see, but I'm inclined to... [21:29:03] PROBLEM - DPKG on labnet1002 is CRITICAL: Connection refused by host [21:29:03] PROBLEM - nova-api process on labnet1002 is CRITICAL: Connection refused by host [21:29:12] PROBLEM - configured eth on labnet1002 is CRITICAL: Connection refused by host [21:29:23] PROBLEM - Disk space on labnet1002 is CRITICAL: Connection refused by host [21:29:33] PROBLEM - nova-network process on labnet1002 is CRITICAL: Connection refused by host [21:29:43] PROBLEM - RAID on labnet1002 is CRITICAL: Connection refused by host [21:30:14] PROBLEM - puppet last run on labnet1002 is CRITICAL: Connection refused by host [21:30:21] sorry what is this about gdnsd above? [21:30:23] PROBLEM - dhclient process on labnet1002 is CRITICAL: Connection refused by host [21:31:54] RECOVERY - salt-minion processes on labnet1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:32:22] John uses it, outside WMF, for a mediawiki cluster and had questions [21:32:23] RECOVERY - DPKG on labnet1002 is OK: All packages OK [21:32:24] RECOVERY - nova-api process on labnet1002 is OK: PROCS OK: 37 processes with regex args ^/usr/bin/python /usr/bin/nova-api [21:32:34] RECOVERY - configured eth on labnet1002 is OK - interfaces up [21:32:35] RECOVERY - Disk space on labnet1002 is OK: DISK OK [21:32:40] ah yes, I had a pull req for that too [21:32:44] RECOVERY - RAID on labnet1002 is OK [21:33:04] RECOVERY - dhclient process on labnet1002 is OK: PROCS OK: 0 processes with command name dhclient [21:33:14] RECOVERY - nova-network process on labnet1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-network [21:33:22] 6operations, 10Wikimedia-Mailing-lists: cleanup mailman archives - introduce apache rewrites - https://phabricator.wikimedia.org/T109609#1554887 (10RobH) Also, shame on me for livehacking and not using labs. I have been called out (quite correctly. ;) [21:33:43] when I saw traffic here in -ops about gdnsd and "can't listen on port 53", my first thought was "oh god someone just shut off all traffic to our authdns servers with a firewall rule" [21:34:46] bblack: yeah, this probably wasn't the best place to generically go 'gdnsd won't listen on port 53!' in :) [21:35:25] RECOVERY - puppet last run on labnet1002 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [21:37:33] RECOVERY - puppet last run on labvirt1006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:38:46] bblack: sorry, should have said "non-WMF" right away :p [21:40:02] 6operations, 6Discovery, 10Maps, 10Traffic: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1554925 (10Yurik) [21:41:13] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1554926 (10Krenair) I'm not sure wikiuser is needed [22:05:52] 10Ops-Access-Requests, 6operations: add papaul to ops LDAP group - https://phabricator.wikimedia.org/T109640#1555027 (10Dzahn) [22:06:08] 6operations, 6Labs: bastion-02.bastion.eqiad.wmflabs not restricted_from=(ops) like bastion-01 is - https://phabricator.wikimedia.org/T109641#1555032 (10Krenair) 3NEW a:3yuvipanda [22:22:58] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1555130 (10Tgr) >>! In T102566#1548300, @Tau wrote: > Can this issue be fixed if I upgrade from 1.23.3 to 1.23.10? Possibly.... [22:30:40] (03PS1) 10Alex Monk: Remove reference to lost wikitech apple touch icon file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232651 (https://phabricator.wikimedia.org/T102699) [22:38:30] (03CR) 10Dzahn: [C: 031] "checked in a /srv/mediawiki/docroot/bits nothing wikitech here at all, we'd have to create one first it seems" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232651 (https://phabricator.wikimedia.org/T102699) (owner: 10Alex Monk) [22:45:51] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 4 others: Try to fail over to labnet1002 - https://phabricator.wikimedia.org/T109329#1555190 (10Andrew) We switched things over to labnet1002 today, and it went OK. It should be faster next time! Etherpad of our experience in progress... [22:46:38] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1555192 (10BBlack) @Tgr Do we have any immediate plans to fix those anyways, or a sane plan to fix them that would apply to t... [22:49:29] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1555215 (10BBlack) Answering my own question: there's mw/core patches attached to both, last activity about a month ago, with... [22:53:27] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 4 others: move nova api to labnet1002 - https://phabricator.wikimedia.org/T109653#1555249 (10Andrew) 3NEW a:3Andrew [22:59:05] (03PS1) 10Andrew Bogott: Move nova-api to labnet1002. [puppet] - 10https://gerrit.wikimedia.org/r/232652 (https://phabricator.wikimedia.org/T109653) [23:02:53] I'm here for SWAT. Sorry if I missed the ping. [23:02:59] I can do. [23:03:02] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1555351 (10CCogdill_WMF) 5Open>3Resolved a:3CCogdill_WMF Confirmed with @EWilfong_WMF that we have what we need. Thanks for... [23:03:08] matt_flaschen ^ [23:03:23] Thanks. [23:04:16] Oh, I have a random thing... [23:04:30] https://gerrit.wikimedia.org/r/232651 [23:04:57] jouncebot, help [23:05:01] it's still alive [23:05:02] jouncebot, next [23:05:03] In 0 hour(s) and 54 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150820T0000) [23:05:05] eek [23:05:12] a notice? [23:05:27] oh, a channel notice [23:05:27] heh [23:05:48] greg-g: probably a leftover. it used to notice everything [23:06:01] so why didn't it notify us of the deployment window? [23:06:21] notice seems correct to me, that's exactly what it tries to do, notice the channel [23:06:26] Krenair: because it joined the channel at 23:01 [23:06:44] ah [23:06:57] mutante: in response to help? [23:07:17] well, the regular notification which isn't a notice :p [23:07:48] it was until J.ames_F complained [23:07:50] but i think wasn't the discussion that technically notice is correct but we change it anyways for broken clients or something [23:08:10] heh [23:08:18] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1555371 (10Tgr) Nothing apart from lack of reviewers, I think. I can review them in the next couple days, but that only makes... [23:10:04] in my client the notice is _way_ more obvious, different color and such. as it should be [23:11:36] (03CR) 10Dzahn: "checked on wikitech-static. couldn't find an apple-touch icon there either and the former server virt1000 is dead" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232651 (https://phabricator.wikimedia.org/T102699) (owner: 10Alex Monk) [23:14:36] 6operations, 10Security-Reviews: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1555387 (10Dzahn) Is it about installing LimeSurvey on our servers or about the option to have it hosted on 3rd party servers? There are both "Download" and "Hosting" links. [23:15:15] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1555390 (10Tgr) To answer myself, we log them in `api-feature-usage` but I have no idea how to tell curl vs. fopen from that.... [23:16:14] !log rmoen@tin Synchronized php-1.26wmf18/extensions/Flow: Add debugging code to detect and workaround type hint failure (duration: 00m 14s) [23:18:41] (03PS3) 10Dzahn: admin: add group 'mw-log-readers' [puppet] - 10https://gerrit.wikimedia.org/r/232282 (https://phabricator.wikimedia.org/T108696) [23:20:14] (03CR) 10Dzahn: [C: 032] "going ahead since it only adds the group but no people to the group." [puppet] - 10https://gerrit.wikimedia.org/r/232282 (https://phabricator.wikimedia.org/T108696) (owner: 10Dzahn) [23:21:43] !log rmoen@tin Synchronized php-1.26wmf19/extensions/TimedMediaHandler/: Re-disable 2-pass Theora encoding temporarily [23:22:09] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to fluorine for Trey Jones (tjones) - https://phabricator.wikimedia.org/T108696#1555432 (10Dzahn) merged. the new group exists on fluorine now, but is still empty [23:23:08] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to fluorine for Trey Jones (tjones) - https://phabricator.wikimedia.org/T108696#1555434 (10Dzahn) a:5ArielGlenn>3Dzahn [23:24:32] And then Krenair's config change [23:24:37] ty [23:25:00] (03CR) 10Robmoen: [C: 032] Remove reference to lost wikitech apple touch icon file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232651 (https://phabricator.wikimedia.org/T102699) (owner: 10Alex Monk) [23:25:07] (03Merged) 10jenkins-bot: Remove reference to lost wikitech apple touch icon file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232651 (https://phabricator.wikimedia.org/T102699) (owner: 10Alex Monk) [23:26:19] (03PS1) 10Dzahn: admins: add tjones to mw-log-readers group [puppet] - 10https://gerrit.wikimedia.org/r/232658 (https://phabricator.wikimedia.org/T108696) [23:27:29] !log rmoen@tin Synchronized wmf-config/InitialiseSettings.php: Remove reference to lost wikitech apple touch icon file (duration: 00m 13s) [23:27:47] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1555476 (10brion) [23:27:56] Krenair: done [23:28:05] thanks rmoen [23:28:15] \o/ yay swat [23:28:22] np [23:37:34] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1555533 (10Platonides) links.wikimedia.mtk4988.com is a really bad solution. As already pointed by BBlack, it looks like a phishing at...