[00:08:21] (03PS1) 10Yuvipanda: Revert "toollabs: Allow HBA login to all hosts" [puppet] - 10https://gerrit.wikimedia.org/r/230257 [00:08:26] (03PS2) 10Yuvipanda: Revert "toollabs: Allow HBA login to all hosts" [puppet] - 10https://gerrit.wikimedia.org/r/230257 [00:08:33] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "toollabs: Allow HBA login to all hosts" [puppet] - 10https://gerrit.wikimedia.org/r/230257 (owner: 10Yuvipanda) [00:11:26] (03PS1) 10BBlack: bugfix for 15bdf16c: set $cluster for role::cache::maps [puppet] - 10https://gerrit.wikimedia.org/r/230258 [00:11:30] (03PS1) 10Rush: diamond: nutcracker collector improvements [puppet] - 10https://gerrit.wikimedia.org/r/230259 [00:11:40] (03CR) 10BBlack: [C: 032 V: 032] bugfix for 15bdf16c: set $cluster for role::cache::maps [puppet] - 10https://gerrit.wikimedia.org/r/230258 (owner: 10BBlack) [00:12:15] (03CR) 10jenkins-bot: [V: 04-1] diamond: nutcracker collector improvements [puppet] - 10https://gerrit.wikimedia.org/r/230259 (owner: 10Rush) [00:13:02] PROBLEM - puppet last run on mw1002 is CRITICAL puppet fail [00:14:32] (03PS2) 10Rush: diamond: nutcracker collector improvements [puppet] - 10https://gerrit.wikimedia.org/r/230259 [00:15:02] RECOVERY - puppet last run on mw1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:15:02] RECOVERY - puppet last run on cp1043 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [00:15:31] RECOVERY - puppet last run on cp1044 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:18:31] (03PS3) 10Rush: diamond: nutcracker collector improvements [puppet] - 10https://gerrit.wikimedia.org/r/230259 [00:30:32] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1520412 (10BBlack) This is now basically working at https://maps.wikimedia.org/static/ . Don't link that anywhere or use it on wikis an... [00:42:39] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1520431 (10chasemp) >>! In T105076#1520412, @BBlack wrote: > This is now basically working at https://maps.wikimedia.org/static/ . Don'... [00:44:55] gwicke: https://phabricator.wikimedia.org/T107493 [00:45:00] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1520440 (10BBlack) [00:49:36] 6operations, 5Patch-For-Review: Ferm rules for backup roles - https://phabricator.wikimedia.org/T104996#1520443 (10Dzahn) >>! In T104996#1517436, @akosiaris wrote: > Is it me or this incident documentation is not in https://wikitech.wikimedia.org/wiki/Incident_documentation ? I seem to be unable to find it ...... [00:50:49] 6operations: Ferm rules for backup roles - https://phabricator.wikimedia.org/T104996#1520452 (10Dzahn) [00:51:50] (03CR) 10Dzahn: [C: 04-1] "wrong order of things" [puppet] - 10https://gerrit.wikimedia.org/r/228137 (owner: 10Dzahn) [00:52:07] 10Ops-Access-Requests, 6operations: stat1002 access to tgr - https://phabricator.wikimedia.org/T108417#1520466 (10Tgr) 3NEW [00:53:50] 10Ops-Access-Requests, 6operations: stat1002 access for tgr - https://phabricator.wikimedia.org/T108417#1520491 (10Tgr) [00:54:07] 10Ops-Access-Requests, 6operations: stat1002 access for tgr - https://phabricator.wikimedia.org/T108417#1520495 (10Tgr) [01:05:33] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1419 bytes in 0.099 second response time [01:06:38] jzerebecki: ^ it's back again :p [01:07:12] hoo: [01:07:22] argh [01:07:24] looking [01:08:30] jzerebecki: I'm seeing the array_key_exists warning again, btw [01:17:34] RECOVERY - Cassandra database on restbase1001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [01:23:20] bblack, you are amazing, thank you!!!!!!!!!!!!!!!!!!!!!!!!!! [01:32:01] !log ori Synchronized php-1.26wmf17/resources/src/mediawiki.legacy/wikibits.js: I664ba9b0af: Override document.writeln to prevent it from blanking pages (duration: 00m 13s) [01:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:34:41] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1520639 (10Yurik) @bblack, thank you for doing this on such a short notice! The results so far are amazing! We will be doing performance... [01:37:53] !log Deleted changes 237357747 and 237363245 from wikidata's wb_changes [01:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:40:03] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0] [01:42:02] I start to run out of ideas, tbh [01:42:46] hoo: re? [01:42:51] oh [01:43:17] I know the underlying bug, but fixing it now is not an option [01:43:30] (a table field is to small leading to cut off json probably leading to this) [01:47:47] ffs it's still (slowly) increasing [01:47:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:48:06] seems like dispatching works, but extremely slowly, that makes (close to) no sense [01:53:02] argh crap [01:53:07] new invalid entries keep appearing [01:53:56] !log Deleted change 237365841 as well [01:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:56:09] Made the bug UBN [02:12:01] Two new problematic changes appeared by now [02:13:21] * one [02:14:05] *sigh* [02:17:09] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1520690 (10Dzahn) a:3Dzahn [02:17:29] hoo: do you think we should do something drastic like not add changes that hit the limit? [02:18:21] I hoped that we could avoid that [02:18:33] but it's not getting better [02:18:52] although I spawned a huge load of dispatchers at one point [02:19:15] I wonder why dispatching gets so strange, though [02:20:14] so it is not better even though there is no invalid entry in wb_changes that waits for dispatch? [02:20:28] Well, new ones keep appearing [02:20:37] but even deleting htem doesn't seem to help [02:20:56] the error message goes away, but it doesn't seem to improve (in a timely manner, at least) [02:23:05] !log l10nupdate Synchronized php-1.26wmf17/cache/l10n: l10nupdate for 1.26wmf17 (duration: 06m 20s) [02:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:20] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf17) at 2015-08-08 02:26:20+00:00 [02:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:39] hoo: I would assume it only improves if there is no new wb_changes entry with the probelm [02:27:07] Sounds plausible [02:27:12] two new entries appeared by now [02:27:18] let's hot "fix" that [02:29:59] jzerebecki: Do you know how php strlen maps to what will appear in the DB? [02:30:37] hoo: both should be bytes [02:31:06] because it is AFAIK not gettin reencoded [02:32:44] I'll use 65500 as a cut off to play it safe [02:32:57] k [02:39:25] Will deploy once I have a +2 [02:42:02] hoo: looks good, you need to +2 [02:42:13] oh right, will do [02:42:34] This feels *so* dirty [02:44:42] it is. though less than cutting off a serialized json string. [02:54:28] I still don't get why it would start acting so weird [02:56:25] and we have another cut off row [03:00:17] !log hoo Synchronized php-1.26wmf17/extensions/Wikidata/: Hack: Don't write change rows where LENGTH(change_info) > 65500 (duration: 00m 21s) [03:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:44] hoo: queried for long values: https://phabricator.wikimedia.org/T108130#1520724 [03:05:56] I'm constantly doing that as well [03:06:02] should we delete some of these to see if dispatch recovers? [03:06:36] It should already... I'm running more than twice as many dispatchers as we have usually [03:06:55] Seeing the warning all over the place, though [03:07:51] but I can see it improving now :) [03:09:17] what about the cutoff values that are pending dispatch? [03:10:46] that shouldn't matter [03:11:20] I skimmed the code and it doesn't care about that... just in case I have backups of them in my home on terbium, though [03:12:17] no i mean those that you did not yet delete, but are in wb_changes, thus were not skipped by the hack [03:12:31] Oh, these [03:13:52] Given we have wikis that are almost up to date, I think those are not going to block it all [03:14:37] but we thought that is the reason for the slowdown [03:15:06] Yeah, probably [03:15:19] so if we also deleted those 6 it should be faster if we are right [03:15:21] but I guess it will just overcome these, might take some time, but it will [03:16:56] ok, I'm out [03:46:13] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1413 bytes in 0.181 second response time [03:53:57] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1520769 (10Yurik) [03:54:55] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1435863 (10Yurik) [03:54:58] 6operations, 6Discovery, 10Maps, 6Services, and 2 others: Puppetize Kartotherian & Tilerator for deployment - https://phabricator.wikimedia.org/T105074#1520779 (10Yurik) [04:01:47] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant log file access to Yurik & Maxsem on maps-test200{1-4} - https://phabricator.wikimedia.org/T106629#1520788 (10Yurik) [04:10:43] (03PS1) 10Yuvipanda: quarry: Make the worker nodes use the celery module [puppet] - 10https://gerrit.wikimedia.org/r/230281 [04:10:57] (03CR) 10Yuvipanda: [C: 032 V: 032] quarry: Make the worker nodes use the celery module [puppet] - 10https://gerrit.wikimedia.org/r/230281 (owner: 10Yuvipanda) [04:14:02] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Add Redis to maps cluster - https://phabricator.wikimedia.org/T107813#1520811 (10Yurik) From irc, redis role: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/redis/manifests/init.pp How to test i... [04:20:44] !log issuing nodetool cleanup on restbase1008 [04:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:21:35] (03PS1) 10Yuvipanda: quarry: Fix typo I've been making for time immemorial [puppet] - 10https://gerrit.wikimedia.org/r/230282 [04:22:01] (03CR) 10Yuvipanda: [C: 032 V: 032] quarry: Fix typo I've been making for time immemorial [puppet] - 10https://gerrit.wikimedia.org/r/230282 (owner: 10Yuvipanda) [05:18:44] PROBLEM - BGP status on cr2-ulsfo is CRITICAL host 198.35.26.193, sessions up: 44, down: 1, shutdown: 0BRPeering with AS1273 not established - CWBR [05:30:23] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Aug 8 05:30:22 UTC 2015 (duration 30m 21s) [05:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:58:04] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [06:31:13] PROBLEM - puppet last run on wtp2008 is CRITICAL Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 1 failures [06:31:23] PROBLEM - puppet last run on cp3008 is CRITICAL puppet fail [06:32:23] PROBLEM - puppet last run on mw1086 is CRITICAL Puppet has 1 failures [06:32:23] PROBLEM - puppet last run on mw2207 is CRITICAL Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:32:25] PROBLEM - puppet last run on mw1158 is CRITICAL Puppet has 1 failures [06:32:34] PROBLEM - puppet last run on lvs1003 is CRITICAL Puppet has 1 failures [06:32:54] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:32:54] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 1 failures [06:33:05] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:33:05] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures [06:55:53] RECOVERY - puppet last run on mw1086 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:56:43] RECOVERY - puppet last run on wtp2008 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:56:43] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:53] RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:57:53] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:57:54] RECOVERY - puppet last run on mw1158 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] RECOVERY - puppet last run on lvs1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:24] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:24] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:58:43] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:44] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:53] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:05:53] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1419 bytes in 0.138 second response time [09:11:23] PROBLEM - Disk space on ms-be2003 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb1 is not accessible: Input/output error [09:12:05] PROBLEM - RAID on ms-be2003 is CRITICAL 1 failed LD(s) (Offline) [09:17:05] PROBLEM - puppet last run on ms-be2003 is CRITICAL Puppet has 1 failures [09:19:11] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1521143 (10Yurik) 5Open>3Resolved @bblack, re non-SSD -- I think there are a few smaller unused SSDs that might be used to upgrade... [09:19:36] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1521147 (10Yurik) [10:01:34] RECOVERY - Disk space on ms-be2003 is OK: DISK OK [11:05:35] (03PS1) 10Faidon Liambotis: Add AAAA for bast2001 [dns] - 10https://gerrit.wikimedia.org/r/230297 [11:06:01] (03CR) 10Faidon Liambotis: [C: 032] Add AAAA for bast2001 [dns] - 10https://gerrit.wikimedia.org/r/230297 (owner: 10Faidon Liambotis) [11:08:50] (03PS1) 10Hoo man: Set a higher lock-grace-interval for wikidata dispatchers [puppet] - 10https://gerrit.wikimedia.org/r/230298 [11:09:18] Anyone about to look at that? [11:09:54] Might help us with avoiding further alerts [11:14:50] I know nothing about that [11:15:07] I'd be willing to trust you and merge it, but let's do it when it's not a weekend? [11:15:17] it might be harder to find someone to revert if it goes bad [11:15:28] Well, it already kind of went bad [11:15:40] οη? [11:15:42] er [11:15:44] oh? :) [11:16:10] Well, the check is critical since 8am and has already been problematic this night [11:16:26] I've been up till almost 6am due to this and am again working on it [11:16:36] oh I had no idea [11:17:34] so should I just merge the above then? [11:17:53] Yeah, I'm not sure it will really help, but it seems saner [11:18:11] if something can take several minutes, we shouldn't assume the worker is dead if it takes more than 60s [11:19:01] (03CR) 10Faidon Liambotis: [C: 032] Set a higher lock-grace-interval for wikidata dispatchers [puppet] - 10https://gerrit.wikimedia.org/r/230298 (owner: 10Hoo man) [11:19:27] done [11:20:13] Thanks a lot [11:23:43] PROBLEM - puppet last run on mw2150 is CRITICAL puppet fail [11:30:21] no worries [11:43:23] 6operations, 6Discovery, 10Maps, 6Services, and 2 others: Puppetize Tilerator for deployment - https://phabricator.wikimedia.org/T105074#1521295 (10Yurik) [11:45:16] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1521299 (10Yurik) I think this task was done during Wikimania [11:45:28] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1521300 (10Yurik) 5Open>3Resolved a:3Yurik [11:51:41] (03PS3) 10Faidon Liambotis: Fix cr2-eqiad/cr1-esams GRE's PTR typos [dns] - 10https://gerrit.wikimedia.org/r/220775 [11:51:44] RECOVERY - puppet last run on mw2150 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:50] (03CR) 10Faidon Liambotis: [C: 032] Fix cr2-eqiad/cr1-esams GRE's PTR typos [dns] - 10https://gerrit.wikimedia.org/r/220775 (owner: 10Faidon Liambotis) [11:51:58] (03PS3) 10Faidon Liambotis: Add loopback IPs for cr1-eqord and cr1-eqdfw [dns] - 10https://gerrit.wikimedia.org/r/220776 [11:52:08] (03CR) 10Faidon Liambotis: [C: 032] Add loopback IPs for cr1-eqord and cr1-eqdfw [dns] - 10https://gerrit.wikimedia.org/r/220776 (owner: 10Faidon Liambotis) [11:59:22] Icinga should tell us that it's fine again any second. [11:59:40] but I doubt that will stay, once I stop kicking off workers per hand [12:00:15] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1420 bytes in 0.277 second response time [12:07:38] (03PS4) 10Faidon Liambotis: Allocate neighbor block for cr2-eqiad<->cr1-eqord [dns] - 10https://gerrit.wikimedia.org/r/220777 [12:07:40] (03PS3) 10Faidon Liambotis: Repurpose s/cr2-eqiad/cr1-eqord/ to link with codfw [dns] - 10https://gerrit.wikimedia.org/r/220811 [12:07:42] (03PS1) 10Faidon Liambotis: (WIP) Allocate neighbor block for crN-eqdfw<->cr1-eqdfw [dns] - 10https://gerrit.wikimedia.org/r/230303 [12:10:52] (03PS2) 10Paladox: Set $wgTitleBlacklistLogHits = true on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/138684 (https://phabricator.wikimedia.org/T68450) (owner: 10Legoktm) [12:12:39] (03CR) 10Paladox: "That patch was merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/138684 (https://phabricator.wikimedia.org/T68450) (owner: 10Legoktm) [12:13:14] (03CR) 10Glaisher: [C: 04-1] "This can't happen without the IP issue being resolved. Please see the associated task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/138684 (https://phabricator.wikimedia.org/T68450) (owner: 10Legoktm) [12:26:54] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant log file access to Yurik & Maxsem on maps-test200{1-4} - https://phabricator.wikimedia.org/T106629#1521371 (10Krenair) [13:03:03] (03PS2) 10Faidon Liambotis: Allocate neighbor blocks for cr1/2-codfw<->cr1-eqdfw [dns] - 10https://gerrit.wikimedia.org/r/230303 [13:03:05] (03PS1) 10Faidon Liambotis: Allocate neighbor block for cr1-ulsfo<->cr1-eqord [dns] - 10https://gerrit.wikimedia.org/r/230305 [13:03:15] PROBLEM - puppet last run on mw1055 is CRITICAL Puppet has 1 failures [13:04:15] (03PS2) 10Faidon Liambotis: Allocate neighbor block for cr1-ulsfo<->cr1-eqord [dns] - 10https://gerrit.wikimedia.org/r/230305 [13:04:17] (03PS4) 10Faidon Liambotis: Repurpose s/cr2-eqiad/cr1-eqord/ for link with codfw [dns] - 10https://gerrit.wikimedia.org/r/220811 [13:04:19] (03PS3) 10Faidon Liambotis: Allocate neighbor blocks for cr1/2-codfw<->cr1-eqdfw [dns] - 10https://gerrit.wikimedia.org/r/230303 [13:06:42] (03CR) 10Faidon Liambotis: [C: 032] Allocate neighbor block for cr2-eqiad<->cr1-eqord [dns] - 10https://gerrit.wikimedia.org/r/220777 (owner: 10Faidon Liambotis) [13:07:03] (03CR) 10Faidon Liambotis: [C: 032] Allocate neighbor block for cr1-ulsfo<->cr1-eqord [dns] - 10https://gerrit.wikimedia.org/r/230305 (owner: 10Faidon Liambotis) [13:07:23] (03CR) 10Faidon Liambotis: [C: 032] Allocate neighbor blocks for cr1/2-codfw<->cr1-eqdfw [dns] - 10https://gerrit.wikimedia.org/r/230303 (owner: 10Faidon Liambotis) [13:14:53] PROBLEM - puppet last run on mw2179 is CRITICAL puppet fail [13:17:04] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [13:18:45] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1521422 (10brion) Not until it works. [13:28:53] RECOVERY - puppet last run on mw1055 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:42:34] RECOVERY - puppet last run on mw2179 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:42:53] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:23:56] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1521473 (10Ironholds) 5Resolved>3Open Not done; to repeat my questions above: 1. Is this going to be integrated with the existing... [14:57:05] !log issuing nodetool cleanup on restbase1007 [14:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:54] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1427 bytes in 0.170 second response time [14:58:51] !log issuing nodetool cleanup on restbase1005 [14:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:21] (03PS1) 10Faidon Liambotis: Add cr1-eqord and cr1-eqdfw to monitoring tools [puppet] - 10https://gerrit.wikimedia.org/r/230309 [15:10:12] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1521523 (10BBlack) >>! In T105076#1521473, @Ironholds wrote: > Not done; to repeat my questions above: > > 1. Is this going to be inte... [15:24:14] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [15:34:33] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [16:17:34] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1521570 (10Ironholds) Gotcha; awesome! Long as the pipes get hooked up I'm all happy :). (We need a new task for that, or..?) [16:18:30] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1521571 (10BBlack) No idea, I'm going to ping @ottomata Monday and talk it out with him first and then see what we need to do. [16:34:35] 6operations, 10Traffic, 7HTTPS: Samsung GT-S3650 can't connect to Wikipedia - https://phabricator.wikimedia.org/T108298#1521588 (10Nemo_bis) I couldn't extract any detail on OS, indeed. The Samsung website only says "proprietary OS" and doesn't provide upgrades. I'm not sure about network provider and firmwa... [17:57:38] Hi, I just got [0e6ad688] 2015-08-08 17:56:46: Fatale fout van type "MWException" on mediawikiwiki when trying to translate a page [17:57:56] Please lookup the full stacktrace if you want [18:00:04] "CAS update failed on user_touched for user ID '871777' (read from slave); the version of the user to be saved is older than the current version." [18:00:34] uh? [18:01:38] I guess it's https://phabricator.wikimedia.org/T95839? [18:19:52] that's an annoying one [18:22:37] AaronSchulz, ^^^ [18:43:44] PROBLEM - puppet last run on mw2194 is CRITICAL puppet fail [19:07:44] PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=84%) [19:11:34] RECOVERY - puppet last run on mw2194 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:45:44] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 27.27% of data above the critical threshold [500.0] [19:55:34] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:55:36] 6operations, 10ops-eqiad, 7network: investigate ethernet errors: asw2-a5-eqiad port xe-0/0/36 - https://phabricator.wikimedia.org/T107635#1521744 (10Jdforrester-WMF) [20:23:36] bd808: I'm going to start doing puppet swats twice a week (monday and wednesday). just me / myself, etc. [20:23:42] but I'm on vacation next week [20:23:48] so will start doing it the week after [20:24:05] and then hopefully have people join in and what not [20:26:21] 6operations, 10ops-eqiad, 10Traffic: decom/reclaim cp104[34] - https://phabricator.wikimedia.org/T108281#1521755 (10Cmjohnson) okay, will there be any name changes? Create a task to update labels and racktables if there will be. Thanks [20:33:53] YuviPanda: what's a puppet swat? [20:34:07] (03CR) 10Smalyshev: [C: 031] Isolate wikidata.org cookies and CORS policies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230247 (https://phabricator.wikimedia.org/T108101) (owner: 10Legoktm) [20:34:35] bblack: it's like swat for puppet! People put up patches they want reviewed / merged beforehand, and I look through them and +2 them if appropriate, with the requestor present. [20:34:49] helps prevent decay in betacluster and probably other things [20:35:00] 6operations, 10ops-eqiad, 10Traffic: decom/reclaim cp104[34] - https://phabricator.wikimedia.org/T108281#1521781 (10BBlack) No name changes. They're still cp104[34], and now they're being used as the cache_maps cluster (until we come up with better hosts for that at some future date). [20:35:36] bblack: dunno yet how it'll work, so doing it as an experiment offering myself [20:36:20] ok [20:37:09] I mean, is the intent that we'd want most puppet patches in general merging through puppetswat windows? or it primarily targeted at ops' own patches, or others' patches, or ? [20:37:51] should focus on non-ops' patches imo [20:37:52] I don't think it's necessarily a bad idea. We'd still have the option of bypassing for time-critical things. [20:37:52] bblack: primarily at others' patches [20:38:06] bblack: opsen usually merge their own patches. [20:38:16] yeah but that's not necessarily ideal either :) [20:38:32] bblack: indeed, but that's a very large conversation I don't have a hope of starting :D [20:38:50] I think you just did! :) [20:38:53] bblack: this is mostly to encourage more people to write puppet and provide changes to ops/puppet outside of the ops team [20:39:07] bblack: hah! watch me evade having any meaningful conversation on that topic! :D [20:39:39] or atleast, it should be independent of puppet swat [20:39:46] which is addressing very different problem [20:40:13] yeah, at least initially [20:40:41] am writing out https://etherpad.wikimedia.org/p/puppetswat [20:41:22] I do a lot of unreviewed self-merged things. I still cringe about it, but I do it anyways. I don't think we have much of a process in place. It's ping people for review and hope someone takes the time to review. But what if nobody does, or it takes a week or two for someone to finally look? I have Things to get Done! :) [20:42:16] bblack: I think the other problem is also one of knowledge. I've no hope of reviewing any of the varnish patches... :) [20:42:21] It wouldn't necessarily be horrible to have some pre-defined windows for SWATing complex changes from within ops, too, with the goal of getting reviewers to show up and hit that list before/during and actually other-review them for merge on a predictable schedule. [20:43:24] so that's my view of puppetswat-as-reviewparty I guess [20:43:50] do ops actively watch operations/puppet for things which they could merge? [20:43:58] some do [20:44:01] bblack: right. [20:44:01] I don't very often, no. [20:44:33] I do watch my own list of things people have tagged me for review on, but it's a mess of old/problematic patches piled up there, too. [20:44:37] yup [20:44:44] Krenair: bd808 bblack https://etherpad.wikimedia.org/p/puppetswat [20:44:50] I am mostly done there, I think [20:44:51] valhallasw`cloud: ^ [20:45:40] YuviPanda: yesss [20:46:23] YuviPanda: for tool labs specifically, we could actually make this a once-weekly thing for breaking patches as 'maintenance windows' [20:46:41] I just hope (in all of the various possible cases) that puppetswat doesn't turn into rushed merges without good review. hopefully at least an informal process will build around those SWAT windows of submitters pushing more aggressively to get relevant deep +1 reviews before the window starts, so that they're ready in time. [20:46:44] valhallasw`cloud: oooh, yes. we could. do you think we should roll that into the puppetswat windows to begin with? [20:47:01] bblack: +1. I think review guidelines will come up afterwards based off how the first few go [20:47:12] bblack: I did an informal one a few days ago, merged a lot of bd808's patches. [20:47:35] adding a section on what kind of patches will go through swat [20:47:57] YuviPanda: the other side of that coin is that no patches should be merged outside that window that could affect stability [20:48:15] YuviPanda: which might not be completely realistic at the moment [20:48:16] valhallasw`cloud: yeah, that's not going to work I think. [20:48:18] yea [20:48:18] h [20:48:23] ideally, at least one real +1 before the window. then we know it's been vetted to some degree by someone. [20:48:35] but who knows, let's see how it evolves [20:49:01] arguably there will be a lot of patches whose reviews are trivial, and lumping them up in a window to get a quick and easy direct +2->merge isn't bad either. [20:51:07] bblack: +1 [20:52:44] PROBLEM - puppet last run on cp4017 is CRITICAL puppet fail [20:56:22] bblack: bd808 Krenair I emailed the ops@ list [20:56:33] do sign at the bottom of the etherpad if you think this is a good idea :) [21:13:21] (03PS3) 10Alex Monk: Add all groups to bast1001, empty bastiononly group [puppet] - 10https://gerrit.wikimedia.org/r/227327 [21:13:28] (03CR) 10jenkins-bot: [V: 04-1] Add all groups to bast1001, empty bastiononly group [puppet] - 10https://gerrit.wikimedia.org/r/227327 (owner: 10Alex Monk) [21:17:34] (03PS4) 10Alex Monk: Add all groups to bast1001, empty bastiononly group [puppet] - 10https://gerrit.wikimedia.org/r/227327 [21:18:43] RECOVERY - puppet last run on cp4017 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [22:27:30] 10Ops-Access-Requests, 6operations: stat1002 access for tgr - https://phabricator.wikimedia.org/T108417#1521910 (10bd808) manager approval: +1