[00:06:45] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 61 hours old. [00:10:05] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [00:21:44] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [00:56:19] (03PS13) 10Paladox: Rename $wmincClosedWikis to $wgWmincClosedWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [01:05:35] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (92458s 90000s) [01:30:50] (03PS1) 10Springle: repool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213772 [01:33:11] (03CR) 10Springle: [C: 032] repool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213772 (owner: 10Springle) [01:33:16] (03Merged) 10jenkins-bot: repool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213772 (owner: 10Springle) [01:35:09] !log springle Synchronized wmf-config/db-eqiad.php: repool db1026, warm up (duration: 00m 14s) [01:35:20] Logged the message, Master [02:24:12] !log l10nupdate Synchronized php-1.26wmf6/cache/l10n: (no message) (duration: 06m 44s) [02:24:17] Logged the message, Master [02:29:12] !log LocalisationUpdate completed (1.26wmf6) at 2015-05-26 02:28:08+00:00 [02:29:19] Logged the message, Master [02:50:00] (03CR) 10Ori.livneh: varnish: add Python library for iterating on log records (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/213293 (owner: 10Ori.livneh) [02:55:11] !log l10nupdate Synchronized php-1.26wmf7/cache/l10n: (no message) (duration: 09m 31s) [02:55:16] Logged the message, Master [03:02:15] !log LocalisationUpdate completed (1.26wmf7) at 2015-05-26 03:01:12+00:00 [03:02:20] Logged the message, Master [03:48:05] PROBLEM - puppet last run on analytics1025 is CRITICAL Puppet has 1 failures [03:48:26] PROBLEM - puppet last run on elastic1008 is CRITICAL Puppet has 1 failures [03:49:15] PROBLEM - puppet last run on mw2076 is CRITICAL Puppet has 1 failures [03:49:26] PROBLEM - puppet last run on mw1224 is CRITICAL Puppet has 1 failures [03:50:34] PROBLEM - puppet last run on mw2114 is CRITICAL Puppet has 1 failures [03:55:24] (03PS1) 10Tim Landscheidt: Tools: Do not require redundant package python-celery-with-redis [puppet] - 10https://gerrit.wikimedia.org/r/213777 (https://phabricator.wikimedia.org/T91874) [04:04:35] RECOVERY - puppet last run on mw1224 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [04:04:54] RECOVERY - puppet last run on analytics1025 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:05:15] RECOVERY - puppet last run on elastic1008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:06:04] RECOVERY - puppet last run on mw2076 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [04:07:24] RECOVERY - puppet last run on mw2114 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [04:39:34] PROBLEM - puppet last run on eventlog2001 is CRITICAL puppet fail [04:58:05] RECOVERY - puppet last run on eventlog2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:27:25] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (8758 90000s) [05:53:54] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue May 26 05:52:50 UTC 2015 (duration 52m 49s) [05:54:00] Logged the message, Master [06:29:25] PROBLEM - puppet last run on db2036 is CRITICAL puppet fail [06:29:55] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 67 hours old. [06:30:15] PROBLEM - puppet last run on cp1071 is CRITICAL puppet fail [06:30:24] PROBLEM - puppet last run on subra is CRITICAL Puppet has 1 failures [06:30:45] PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 1 failures [06:31:15] PROBLEM - puppet last run on multatuli is CRITICAL puppet fail [06:31:34] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [06:32:08] ^ganglia service stopped [06:32:18] not by me [06:32:45] RECOVERY - puppet last run on db2036 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:33:25] PROBLEM - puppet last run on mw2082 is CRITICAL Puppet has 1 failures [06:33:54] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:34:24] PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 1 failures [06:34:25] PROBLEM - puppet last run on wtp2012 is CRITICAL Puppet has 1 failures [06:34:34] PROBLEM - puppet last run on mw1046 is CRITICAL Puppet has 1 failures [06:34:45] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:35:25] PROBLEM - puppet last run on mw2096 is CRITICAL Puppet has 1 failures [06:35:55] PROBLEM - puppet last run on mw1042 is CRITICAL Puppet has 1 failures [06:37:41] (03PS1) 10Jcrespo: Scheduled maintenance for Parser Cache #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213780 [06:40:52] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: git.wikimedia.org replication from gerrit stopped or lags - https://phabricator.wikimedia.org/T99990#1311842 (10demon) Keep in mind also that Gerrit's not going to start replicating a repo again until the objects in it start changing (new commits, et... [06:45:44] RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:45:45] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:14] RECOVERY - puppet last run on mw1042 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on wtp2012 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on mw1046 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:35] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:47:04] RECOVERY - puppet last run on mw2082 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:14] RECOVERY - puppet last run on cp1071 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:47:24] RECOVERY - puppet last run on mw2096 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:48:04] RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:15] RECOVERY - puppet last run on multatuli is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:55:05] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 67 hours old. [06:56:45] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [07:12:43] (03CR) 10Alexandros Kosiaris: varnish: add Python library for iterating on log records (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/213293 (owner: 10Ori.livneh) [07:20:36] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 68 hours old. [07:23:55] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [07:32:53] (03CR) 10Muehlenhoff: [C: 032] Update to 3.19.7 [debs/linux] - 10https://gerrit.wikimedia.org/r/211701 (owner: 10Muehlenhoff) [07:33:22] (03CR) 10Muehlenhoff: [V: 032] Update to 3.19.7 [debs/linux] - 10https://gerrit.wikimedia.org/r/211701 (owner: 10Muehlenhoff) [07:33:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 3.19.8 [debs/linux] - 10https://gerrit.wikimedia.org/r/211702 (owner: 10Muehlenhoff) [07:40:45] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 68 hours old. [07:44:05] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [08:03:51] (03PS1) 10Jcrespo: Update of parsercache db servers to MariaDB 10 (#1) [puppet] - 10https://gerrit.wikimedia.org/r/213784 [08:04:36] (03CR) 10jenkins-bot: [V: 04-1] Update of parsercache db servers to MariaDB 10 (#1) [puppet] - 10https://gerrit.wikimedia.org/r/213784 (owner: 10Jcrespo) [08:07:58] (03PS2) 10Jcrespo: Update of parsercache db servers to MariaDB 10 (#1) [puppet] - 10https://gerrit.wikimedia.org/r/213784 [08:21:54] (03CR) 10Jcrespo: [C: 032] Scheduled maintenance for Parser Cache #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213780 (owner: 10Jcrespo) [08:22:50] 6operations: pc100[123] maintenance and upgrade - https://phabricator.wikimedia.org/T100301#1311918 (10jcrespo) [08:24:54] !log jynus Synchronized wmf-config/db-eqiad.php: depool pc1001 (duration: 00m 13s) [08:24:59] Logged the message, Master [08:38:21] replication error from s3 -> dbstore1002, trying to fix it [08:51:44] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 69 hours old. [08:53:25] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [09:24:05] PROBLEM - Disk space on analytics1015 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 80848 MB (4% inode=99%): /var/lib/hadoop/data/f 76057 MB (4% inode=99%): /var/lib/hadoop/data/h 80425 MB (4% inode=99%): /var/lib/hadoop/data/j 82720 MB (4% inode=99%): /var/lib/hadoop/data/a 74341 MB (4% inode=99%): /var/lib/hadoop/data/k 84560 MB (4% inode=99%): /var/lib/hadoop/data/c 74925 MB (3% inode=99%): /var/lib/hadoop/data/e [09:29:24] PROBLEM - High load average on labstore1001 is CRITICAL 100.00% of data above the critical threshold [24.0] [09:30:54] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: git.wikimedia.org replication from gerrit stopped or lags - https://phabricator.wikimedia.org/T99990#1312092 (10QChris) >>! In T99990#1311842, @demon wrote: > Keep in mind also that Gerrit's not going to start replicating a repo again until the objec... [09:32:15] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1312094 (10greg) >>! In T98676#1311391, @Krenair wrote: > @greg: can I please have a deployment window tomorrow after the morning SWAT?... [09:42:29] 6operations, 10ops-eqiad: dataset1001: add new disk array - https://phabricator.wikimedia.org/T99808#1312102 (10ArielGlenn) I assume it's here now. Because I'm waiting for the en wp dump run to finish up, or at least get through the current step, can we wait on this til that happens? Probably tomorrow (May 2... [10:22:05] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [10:36:43] (03CR) 10Yuvipanda: [C: 032] Tools: Do not require redundant package python-celery-with-redis [puppet] - 10https://gerrit.wikimedia.org/r/213777 (https://phabricator.wikimedia.org/T91874) (owner: 10Tim Landscheidt) [10:39:40] 6operations: pc100[123] maintenance and upgrade - https://phabricator.wikimedia.org/T100301#1312148 (10jcrespo) **Current state:** OS/Kernel/Package updated. Rebooted. MariaDB-WMF 10 installed but not started yet and mysql_upgrade not run. Old configuration in place (puppet agent disabled). pc1001 depooled from... [11:12:43] why is wikiversions.json just one line now? [11:17:20] (03CR) 10Glaisher: "This causes wikiversions.json to be just one line now. https://gerrit.wikimedia.org/r/#/c/209018/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208711 (https://phabricator.wikimedia.org/T98051) (owner: 10Ori.livneh) [11:22:44] (03PS7) 10Alex Monk: cn.wikimedia.org initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211103 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [11:28:18] (03CR) 10Glaisher: cn.wikimedia.org initial configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211103 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [11:29:12] Glaisher, can you post that on the task? [11:30:32] ok [11:30:35] Glaisher, although I think wgMetaNamespace gets set to 'Wikimedia' by default for affiliate/chapter wikis [11:31:24] oh right. we've an override in wmf-config [11:31:40] I was thinking about core's default :) [11:31:49] but either way, we should still ask about it, right? [11:33:06] right [11:34:18] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1312216 (10Glaisher) Will Wikimedia and Wikimedia_talk be fine for project and project talk namespace? Or should something else be used? [12:07:24] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 73 hours old. [12:10:05] PROBLEM - High load average on labstore1001 is CRITICAL 75.00% of data above the critical threshold [24.0] [12:12:25] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [12:13:25] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [12:43:03] (03Abandoned) 10BBlack: Revert director-level retries changes... [puppet] - 10https://gerrit.wikimedia.org/r/212550 (owner: 10BBlack) [12:46:59] 6operations, 10Traffic: Reboot caches for kernel 3.19.6 globally - https://phabricator.wikimedia.org/T96854#1312483 (10BBlack) [12:47:01] 6operations, 10Traffic, 5Patch-For-Review: Build a non-trunk 3.19 kernel for jessie - https://phabricator.wikimedia.org/T97411#1312482 (10BBlack) 5Open>3Resolved [12:47:28] 6operations, 10Traffic: Fix cpufrequtils issues on jessie - https://phabricator.wikimedia.org/T98203#1312488 (10BBlack) [12:47:30] 6operations, 10Traffic: Reboot caches for kernel 3.19.6 globally - https://phabricator.wikimedia.org/T96854#1312486 (10BBlack) 5Open>3Resolved a:3BBlack [12:53:22] (03PS1) 10BBlack: add strace to base package list [puppet] - 10https://gerrit.wikimedia.org/r/213822 [12:54:57] (03PS2) 10BBlack: cpufrequtils: ensure configure governor is in use [puppet] - 10https://gerrit.wikimedia.org/r/209049 (owner: 10Ori.livneh) [12:57:35] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 73 hours old. [12:59:57] (03CR) 10Alexandros Kosiaris: "I 've tested the change and it looks fine to me. I am a bit sceptical on the java integration though. How is java called from mathoid ? sh" [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [13:00:27] (03CR) 10BBlack: [C: 04-1] "This does more than the commit message claims: it's also essentially taking over any API name that even begins with "rest_v1", which for i" [puppet] - 10https://gerrit.wikimedia.org/r/213573 (https://phabricator.wikimedia.org/T99859) (owner: 10GWicke) [13:01:04] (03CR) 10BBlack: [C: 032] cpufrequtils: ensure configure governor is in use [puppet] - 10https://gerrit.wikimedia.org/r/209049 (owner: 10Ori.livneh) [13:01:04] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [13:03:12] 6operations, 10Traffic: Fix cpufrequtils issues on jessie - https://phabricator.wikimedia.org/T98203#1312508 (10BBlack) a:5BBlack>3ori [13:03:22] 6operations, 10Traffic: Fix cpufrequtils issues on jessie - https://phabricator.wikimedia.org/T98203#1312509 (10BBlack) 5Open>3Resolved [13:05:00] (03PS2) 10BBlack: add strace to base package list [puppet] - 10https://gerrit.wikimedia.org/r/213822 [13:05:08] (03CR) 10BBlack: [C: 032 V: 032] add strace to base package list [puppet] - 10https://gerrit.wikimedia.org/r/213822 (owner: 10BBlack) [13:15:34] bblack: you around? [13:16:42] (03CR) 10GWicke: "@bblack, would Varnish support a regexp like (\/|$)? That way we'd only match /rest_v1 or /rest_v1/." [puppet] - 10https://gerrit.wikimedia.org/r/213573 (https://phabricator.wikimedia.org/T99859) (owner: 10GWicke) [13:20:58] gwicke: kinda, but still not up to full caffienation [13:21:22] bblack: okay, I'll let you caffeinate properly then ;) [13:21:50] but yes, I think that's what we've done in these cases before, use something like (/|$) [13:22:18] well technically it would be ([/?]|$)? checking other examples... [13:24:16] I should probably test this in labs in any case [13:24:21] (03PS3) 10Jcrespo: Update of parsercache db servers to MariaDB 10 (#1) [puppet] - 10https://gerrit.wikimedia.org/r/213784 [13:25:11] gwicke: yeah I think you want ([/?]|$) [13:25:29] so that /api/rest_v1?foo=bar works too [13:26:05] that shouldn't work, actually [13:26:47] I think it's cleaner to require the slash in that case [13:27:01] so (\/|$) [13:27:19] it shouldn't work? [13:27:53] as in, I think it would be confusing to allow both /rest_v1?doc and /rest_v1/?doc [13:28:12] I guess perhaps you have no intention of using it, but I'd think for the general case of all APIs, it's a part of it [13:28:22] with just the trailing slash people have more of an expectation that the server will handle that transparently, often by redirecting [13:28:28] or whatever we're calling that level of indirection. Meta-APIs? :) [13:29:22] api api api [13:29:40] with just the right emphasis [13:29:42] the whole about whether, if foo/ exists, foo should redirect to foo/, is all about traversing FS directories. There's nothing about that implicitly in HTTP URI's themselves. [13:30:08] yeah, it's just expectations set by typical Apache behavior [13:30:19] but only if there are real directories involved [13:30:41] if a whole subspace of the URL-path is handled by some fastcgi or whatever, in the general case even Apache can't know when to do that transform and when not to. [13:30:54] it's also common in routers [13:31:11] as in http router modules, not the networking kind [13:32:20] an HTTP server implementation (apache, URL-router, whatever) might have a rule like "/foo/" -> "fastcgi:XXX", and that might cause it to redirect /foo to /foo/, but it has no information about whether /foo/bar should redirect to /foo/bar/ any deeper than that, is what I'm saying [13:32:25] so it's just not a general rule. [13:32:42] yeah, agreed [13:33:20] the more I think about it the less convinced I become that just accepting /resource as equivalent to /resource/ is a good idea [13:33:49] maybe it would be better to send a temporary redirect to /rest_v1/?doc instead [13:33:59] well it's a perfectly valid interpretation of assigning the "/api/resource" path subspace to a given API backend at the varnish-routing level [13:34:18] although we could add that in RB [13:34:20] we just have to decide whether, in the general case, we want to give them that or tell them they have to use the trailing slash [13:35:41] but also, I think we talked weeks about this but I've lost track: in the RB case, why are we transforming for /$domain/v1/? Couldn't RB pre-process that internally and use the Host header and the upstream public URL-path? [13:35:50] s/weeks/weeks ago/ [13:36:15] I see it more as a convenience thing for users manually typing the path, but also think that keeping the door open to claim /rest_v1 for something else (return the spec?) later would be prudent [13:37:17] I just don't want to get into a pattern, for all future varnish-level service deployments, of having to deploy custom regex-based URL transforms on arbitrary headers and whatnot. It would be nice to have a standard that can be expressed as a common rule for "/api/foo" -> send requests to the service servers for "foo", perhaps with the following standard transform. [13:37:45] that we can just template and have a list of meta-api-name -> service-server-hostname [13:38:57] I don't expect the number to be very large [13:39:05] it'll probably hover around two for a long time [13:39:55] but yeah, it's doable as a WMF-specific hack in RB [13:40:04] it's quite ugly though, imho [13:41:11] doing the full rewrite in Varnish feels cleaner to me, considering that it's only one line [13:41:13] well the Host header is a logical part of the URL. It would be more-standard to pay attention to the standard Host-header than to ask some frontend layer to transform that into the path-space. [13:41:41] we use the host header already [13:42:27] you mean to tell that it's restbase.svc.eqiad.wmnet vs restbase.wikimedia.org entrypoint? [13:42:49] well I guess they both ultimately flow through that, though. [13:42:49] IMHO it makes most sense to generally switch to host-based routing once we switch off rest.wikimedia.org [13:43:08] that'll avoid it being a WMF-specific hack [13:43:10] host-based as in Host-header-based? [13:43:16] yes [13:43:46] right now we only use it to detect whether the request is proxied, for the purpose of emitting the right paths in the documentation [13:44:00] but not in the router module itself [13:44:45] what about the /rest_v1/ -> /v1/ part? [13:45:14] from the POV of not knowing/caring about RB implementation details, it again seems arbitrary and seems like it doesn't belong. [13:45:16] lets see if that'll still be needed by then ;) [13:45:29] I guess I should be saying "/api/rest_v1/" -> "/v1/" [13:46:03] imho that kind of thing is okay to keep in Varnish [13:46:19] I guess in the general case for RB as a generic reuseable codebase people can insert into any infrastructure, the base path should be a configurable thing so that it can match the outer path namespacing of the site it's deployed within. [13:46:57] that's assuming that there is only one entry point, through Varnish [13:47:05] which doesn't hold in production [13:47:27] gwicke: not really okay if we can help it. We hack things into varnish where necessary to reach the right servers and deal with cache efficiency issues and metrics, but ideally we shouldn't be doing arbitrary unecessary work there. [13:47:50] this is a case where all the info the service really needs was already present and we're just doing extra work. [13:48:04] (03PS1) 10Ottomata: Increase YARN max-disk-utilization-per-disk-percentage [puppet] - 10https://gerrit.wikimedia.org/r/213826 [13:48:17] 6operations, 5Patch-For-Review, 7database: On a maintenance window, upgrade db1063 to 14.04 and its MariaDB package to 10.0.16 - https://phabricator.wikimedia.org/T99520#1312548 (10Cmjohnson) [13:48:19] 6operations, 10ops-eqiad: ssh connection to some management servers fails, a hard reset may be needed - https://phabricator.wikimedia.org/T99805#1312545 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson Reset the iDRAC's for both. Remote access has been restored. [13:48:25] gwicke: yeah but that's just because we changed deployment plans. Can it support more than one base path in the general case? [13:49:02] the goal should be that, to the degree possible, the varnish layer is transparent to the logical requests. [13:49:05] (03CR) 10Ottomata: [C: 032] Increase YARN max-disk-utilization-per-disk-percentage [puppet] - 10https://gerrit.wikimedia.org/r/213826 (owner: 10Ottomata) [13:49:18] we're nowhere near that goal today, but it makes life easier on so many levels the closer we can stick to that. [13:49:26] not cleanly in the router, but of course we can insert arbitrary code to manipulate the request before routing, or add a mechanism to generally hook up a handler to do so [13:49:33] the latter would avoid this being a total hack [13:50:54] btw, back to the api layout question: we should also set up a listing at /api/ [13:51:00] I think the counter-argument for the general case of the RB codebase would be: deploying RB similarly to how WMF does (embedded similarly within outer path scopes, multiple possible entry points, hostname handling, etc) should not require the use of any frontend handler rewriting things. You should be able to do the same stuff with any arbitrary frontend handling requests initially, or none a [13:51:07] t all. [13:52:15] realistically we'll have to limit the number of paths we expose in any service [13:52:29] (03CR) 10Mobrovac: "Good that you asked :)" [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [13:52:44] but supporting an additional path sounds doable [13:52:56] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1312551 (10AddisWang) It will be fine. [13:54:24] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1312554 (10chasemp) The security check box is only tangentially related to actual task security. We have a large amount of sensitive tasks here actually and unlike Bugzilla it is no... [13:55:01] re: /api/ do you have an idea what we want to do with that? I assume the content is essentially a small chunk of static HTML for now. We could route requests for /api([/?]|$]/ to a specific backend that knows how to handle that path and has the content. [13:55:28] (it could be a file deployed similarly to the /static/ stuff for now, even) [13:57:40] bblack: we could offer HTML and JSON, depending on the accept header [13:57:54] PROBLEM - Disk space on analytics1017 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/b 83049 MB (4% inode=99%): /var/lib/hadoop/data/c 84103 MB (4% inode=99%): /var/lib/hadoop/data/d 81395 MB (4% inode=99%): /var/lib/hadoop/data/e 73887 MB (3% inode=99%): /var/lib/hadoop/data/f 74392 MB (3% inode=99%): /var/lib/hadoop/data/g 79532 MB (4% inode=99%): /var/lib/hadoop/data/h 77203 MB (4% inode=99%): /var/lib/hadoop/data/i [13:59:32] whuuuuut [13:59:54] PROBLEM - Disk space on analytics1014 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 84016 MB (4% inode=99%): /var/lib/hadoop/data/c 80249 MB (4% inode=99%): /var/lib/hadoop/data/e 80967 MB (4% inode=99%): /var/lib/hadoop/data/f 82299 MB (4% inode=99%): /var/lib/hadoop/data/h 75090 MB (3% inode=99%): /var/lib/hadoop/data/l 84584 MB (4% inode=99%): /var/lib/hadoop/data/b 85850 MB (4% inode=99%): /var/lib/hadoop/data/k [14:00:05] oh doh [14:00:35] PROBLEM - nutcracker port on silver is CRITICAL - Socket timeout after 2 seconds [14:00:51] bblack: and yes, +1 for static html/json [14:01:47] HM. [14:01:56] that is ok actually. those disks are large [14:01:58] hm. [14:02:01] ok. [14:02:15] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [14:03:42] (03CR) 10Alexandros Kosiaris: "I think I can make it. Thanks for letting me know." [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:04:44] (03CR) 10Physikerwelt: "@Alex... that the reason why I voted -1 on one of the previous patch sets that would have enabled PNG generation. Currently PNG generation" [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:06:04] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 6016.87990838 [14:09:34] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1312591 (10jeremyb-phone) I don't know anything about the phab API but I wonder if you could use that in the meantime while waiting for a dump sanitization procedure. "Just VE" proba... [14:10:46] (03PS1) 10Jcrespo: depool db1063 for pending reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213829 (https://phabricator.wikimedia.org/T99520) [14:11:50] (03CR) 10Jcrespo: [C: 032] depool db1063 for pending reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213829 (https://phabricator.wikimedia.org/T99520) (owner: 10Jcrespo) [14:14:12] !log jynus Synchronized wmf-config/db-eqiad.php: depool db1063 (duration: 00m 12s) [14:14:19] Logged the message, Master [14:16:28] (03CR) 10Alexandros Kosiaris: "Thanks for the info @Physikerwelt. I am happy that we agree. As far as PNG generation goes, I am leaning towards making the PNG generatio" [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:21:38] (03CR) 1020after4: [C: 032] Don't commit interwiki.cdb anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188388 (https://phabricator.wikimedia.org/T75905) (owner: 10Reedy) [14:21:44] (03Merged) 10jenkins-bot: Don't commit interwiki.cdb anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188388 (https://phabricator.wikimedia.org/T75905) (owner: 10Reedy) [14:23:06] (03CR) 10Mobrovac: "Ok for the amend, but I am not sure it's ready for merging. @Physikerwelt , from what I gather SVG generation has changed from the last-de" [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:24:02] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1312605 (10Krenair) Is it really non-trivial to only include things with visibility policy 'Public'? [14:29:00] (03PS12) 10Alexandros Kosiaris: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:30:34] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1312614 (10ksmith) @jeremyb-phone It is somewhat of a pain to set up an app to actually use the Phab API, and after you do so, the API has some pretty significant limitations on what... [14:31:44] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [14:34:14] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1312617 (10mmodell) >>! In T99295#1312605, @Krenair wrote: > Is it really non-trivial to only include things with visibility policy 'Public'? That way you don't need to deal with cus... [14:35:34] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1312629 (10chasemp) Being a well defined relational schema some tables like comments related to transactions don't have an enforced security mode outside of their transaction associa... [14:37:54] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [14:39:43] (03CR) 10Mobrovac: mathoid to service::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:40:20] godog: do you happen to know if it is possible to adjust thresholds for disk alerts for certain nodes? [14:42:41] i'm still searching in puppet to find where these disk checks come from, and I can only really find sda and sdb checks [14:42:44] in base [14:43:47] ohhh no this is -A, found it. [14:43:48] hmmm [14:44:02] naw, can't adjust right now, HMmMmmmm [14:45:51] (03PS1) 10Jcrespo: Repool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213835 (https://phabricator.wikimedia.org/T99520) [14:47:14] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:47:36] hey paravoid, yt? [14:47:44] yup [14:47:46] what's up? [14:48:11] https://github.com/wikimedia/operations-puppet/commit/1ed33150637f7b150c3fdc53a60d24612040a28a [14:48:22] this means that sda and sdb 1-3 aren't checked on any host, right? [14:49:30] yeah :( [14:49:34] Hi. I've just added two patches for this morning SWAT. Sorry for the late addition. [14:49:40] oh /srv though [14:49:41] ? [14:49:46] oho ohoh [14:49:55] sorry, its ok. didn't read that part, just read the sd* stuff [14:49:59] that shoudl be fine in most places [14:50:03] /srv/sd[a-b][1-3] [14:50:06] yeah ok [14:50:20] that is pretty specific then [14:50:29] i am checking up on analytics disk host alerts, and noticed that. [14:50:39] some analytics nodes do need checks for sdb1 [14:50:40] yup, still horrible though :) [14:50:42] but, they are mounted elsewhere [14:50:43] yeah [14:50:46] but not affecting me :) [14:50:48] so uhhh [14:50:52] more generally though [14:50:55] i'm having a similar problem [14:51:00] i need to adjust thresholds for disk checks [14:52:04] 6% on the analytics disks is a lot [14:52:06] so is 3% even [14:52:12] Glaisher: I've seen your comment about the patches waiting deploy. Yes, I can take care of it, at this evening SWAT or tomorrow. [14:52:16] think I can hack that together via hiera somehow? [14:52:27] hiera lookup for thresholds with a default of 6% and 3$? [14:52:29] 3%? [14:54:20] !log restarted ganglia-monitor on all cp* (many were obviously-broken, probably most recently from bad startup after the reboots last week) [14:54:25] Logged the message, Master [14:55:47] (03CR) 10Alexandros Kosiaris: mathoid to service::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:57:15] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 28.57% of data above the critical threshold [500.0] [14:57:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 3 below the confidence bounds [14:59:32] (03CR) 10Physikerwelt: [C: 031] "with regard to merging... please have a look at math-preview.wmflabs.org/w/index.php?title=Help:Formula" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [15:00:03] 6operations: Spam solutions for Education-l mailing list - https://phabricator.wikimedia.org/T100428#1312654 (10Selsharbaty-WMF) 3NEW [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150526T1500). [15:01:22] (03CR) 10Physikerwelt: ";-)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [15:01:53] something bad is happening [15:01:56] bblack: that you? [15:02:07] https://gdash.wikimedia.org/dashboards/reqerror/ [15:07:07] ottomata: the 5xx.log is very unordered, is that expected? [15:08:00] paravoid: it is, for now, there's a ticket somewhere [15:08:19] unordered yes, but it shouldn't be way out of order [15:08:24] paravoid: I didn't do anything for reqerror, but perhaps related to ganglia-monitor restart/ [15:08:29] there is clearly an issue [15:08:46] https://phabricator.wikimedia.org/T99716 [15:08:48] ottomata: I think that's still the "some shard is running behind" thing [15:08:55] yes [15:09:10] i haven't looked into this yet. could be a big or small kafkatee problem :/ [15:09:11] not sure yet [15:09:32] paravoid, if you don't care about historical, and just want to see the 5xxes rigiht now, you can restart kafkatee [15:09:41] and it should fix itself and start consumign from the end of the stream [15:09:53] don't ask me why ganglia-monitor restart would cause a 5xx spike, though. I don't think reqerror graph's data pipline involves ganglia-monitor, right? [15:09:59] it seems that if it runs for a long time it starts to lag on some partitions, [15:10:21] bblack: afaik, no. reqstats as is is via udp2log [15:10:29] udp2log -> sqstat -> graphite [15:10:31] By the way, noone is there for SWAT? [15:10:41] well, udp2log isn't the start of that either [15:11:32] perhaps restarting ganglia-monitor hiccups something through some locking related to the VSL shm stuff? [15:14:25] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:15:40] (03CR) 10Alexandros Kosiaris: mathoid to service::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [15:17:19] Krenair: bblack: paravoid ottomata > it seems noone showed for SWAT, could one of you instead deploy two config changes patches? [15:17:24] oh [15:17:26] hi [15:17:34] ok [15:17:52] jynus, you doing anything on tin? [15:17:52] Dereckson: no [15:18:13] not right now, actually was waiting [15:18:15] Dereckson: we're not involved with the SWAT process [15:18:35] (03PS13) 10Alexandros Kosiaris: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [15:18:44] have to repool one node later, Krenair [15:18:55] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [15:19:58] (03PS10) 10Yuvipanda: ores: Initial module, with web class / role [puppet] - 10https://gerrit.wikimedia.org/r/213354 [15:20:59] Dereckson, hm [15:21:14] Dereckson, https://phabricator.wikimedia.org/T99490 - don't know if we should encourage importing mediawiki files.. [15:21:32] akosiaris: hey. Can you look into, https://gerrit.wikimedia.org/r/#/c/209177 when you've time? :) [15:22:03] ²q:: [15:22:09] I concur. [15:22:23] (03PS2) 10Alex Monk: National Heritage Day Santiago Editatón throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213257 (https://phabricator.wikimedia.org/T100051) (owner: 10Dereckson) [15:22:32] (03CR) 10Alex Monk: [C: 032] National Heritage Day Santiago Editatón throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213257 (https://phabricator.wikimedia.org/T100051) (owner: 10Dereckson) [15:22:44] (03Merged) 10jenkins-bot: National Heritage Day Santiago Editatón throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213257 (https://phabricator.wikimedia.org/T100051) (owner: 10Dereckson) [15:22:46] I'll recommend in the bug to as much as possible restrain the use of this feature to the Template: namespace. [15:23:26] !log krenair Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/213257/ (duration: 00m 14s) [15:23:31] Logged the message, Master [15:23:36] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [15:23:53] (03PS2) 10Alex Monk: Import sources on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213652 (https://phabricator.wikimedia.org/T99490) (owner: 10Dereckson) [15:23:58] (03CR) 10Alex Monk: [C: 032] Import sources on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213652 (https://phabricator.wikimedia.org/T99490) (owner: 10Dereckson) [15:24:02] Krenair, tell me when you are finished on tin [15:24:04] (03Merged) 10jenkins-bot: Import sources on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213652 (https://phabricator.wikimedia.org/T99490) (owner: 10Dereckson) [15:24:43] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/213652/ (duration: 00m 15s) [15:24:47] Logged the message, Master [15:24:58] Dereckson, ^ [15:25:14] Can't really test that, but wiki not broken. [15:25:24] (we could ask a steward) [15:25:46] jynus, done [15:25:49] Dereckson, I'm sure it's fine [15:25:54] Krenair, thank you! [15:26:41] (03PS2) 10Jcrespo: Repool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213835 (https://phabricator.wikimedia.org/T99520) [15:27:03] Thanks for the deploy. [15:27:27] (03CR) 10Jcrespo: [C: 032] Repool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213835 (https://phabricator.wikimedia.org/T99520) (owner: 10Jcrespo) [15:28:57] kart_: did a very minor review, I added mobrovac as a reviewer as he is way better as a choice on this one than me [15:29:17] (03CR) 10Physikerwelt: [C: 031] "I'm happy." [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [15:29:30] 213652 tested by Shanmugamp7, works. [15:32:00] !log jynus Synchronized wmf-config/db-eqiad.php: repool db1063 (warm period) (duration: 00m 13s) [15:32:04] Logged the message, Master [15:38:08] (03PS1) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [15:40:47] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1312720 (10BBlack) We probably want to bump up the shm_workspace as well (which already occasionally... [15:42:11] akosiaris: thanks! [15:42:40] mutante: ^ it is now your week right? [15:42:48] (i just read it off old meeting notes) [15:49:53] (03CR) 10Glaisher: [C: 031] "Local user says Wikimedia (talk) is fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211103 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [15:52:34] !log jynus Synchronized wmf-config/db-eqiad.php: repool db1063 (duration: 00m 14s) [15:52:40] Logged the message, Master [15:55:53] Dereckson: Cool, thanks. [15:56:15] You're welcome. [15:56:17] jouncebot_, next [15:56:17] In 0 hour(s) and 3 minute(s): User Group China wiki creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150526T1600) [15:59:13] (03PS1) 10Shanmugamp7: Enable Extension:NewUserMessage on ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213841 (https://phabricator.wikimedia.org/T100431) [15:59:39] yay, Glaisher ^ it works :) [15:59:48] :) [15:59:51] 6operations, 5Patch-For-Review, 7database: On a maintenance window, upgrade db1063 to 14.04 and its MariaDB package to 10.0.16 - https://phabricator.wikimedia.org/T99520#1312766 (10jcrespo) 5Open>3Resolved db1063 has been rebooted and repooled. Now running on the latest LTS and MariaDB-WMF version. [16:00:04] Krenair: Respected human, time to deploy User Group China wiki creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150526T1600). Please do the needful. [16:01:04] oh dear.. [16:01:21] [1bee8b74] [no req] MWException from line 4033 of /srv/mediawiki-staging/php-1.26wmf6/includes/db/Database.php: Could not open "/srv/mediawiki-staging/php-1.26wmf6/extensions/TitleKey/titlekey.sql". [16:01:22] hm [16:02:17] looks like that extension was undeployed at some point since the last wiki creation [16:02:22] but remains in the setup script [16:06:14] (03PS8) 10Alex Monk: cn.wikimedia.org initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211103 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [16:06:21] (03CR) 10Alex Monk: [C: 032] cn.wikimedia.org initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211103 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [16:06:43] (03Merged) 10jenkins-bot: cn.wikimedia.org initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211103 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [16:07:08] Krenair, I will stay a bit more to check some things on db side for this task, happy to help also [16:07:16] PROBLEM - Disk space on analytics1014 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 80926 MB (4% inode=99%): /var/lib/hadoop/data/c 81145 MB (4% inode=99%): /var/lib/hadoop/data/e 78251 MB (4% inode=99%): /var/lib/hadoop/data/f 81745 MB (4% inode=99%): /var/lib/hadoop/data/h 75087 MB (3% inode=99%): /var/lib/hadoop/data/l 82949 MB (4% inode=99%): /var/lib/hadoop/data/b 81522 MB (4% inode=99%): /var/lib/hadoop/data/k [16:07:39] !log krenair Synchronized w/static/images/project-logos/cnwikimedia.png: (no message) (duration: 00m 19s) [16:07:43] Logged the message, Master [16:07:58] !log krenair Synchronized database lists: (no message) (duration: 00m 15s) [16:08:01] Logged the message, Master [16:08:18] !log krenair Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 15s) [16:08:21] Logged the message, Master [16:11:05] bd808, you there? [16:11:32] (03PS2) 10Shanmugamp7: Enable Extension:NewUserMessage on ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213841 (https://phabricator.wikimedia.org/T100431) [16:11:39] or twentyafterfour [16:12:02] ohhh [16:12:08] right, my bad. okay [16:12:16] !log krenair rebuilt wikiversions.cdb and synchronized wikiversions files: add cnwikimedia [16:12:20] Logged the message, Master [16:15:44] ottomata: heh i see that disk space alart fired while we were in the meeting discussing it :) [16:16:07] (03PS1) 10Alex Monk: Fix syntax error in wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213847 [16:16:31] (03CR) 10Alex Monk: [C: 032] "Oops :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213847 (owner: 10Alex Monk) [16:16:37] (03Merged) 10jenkins-bot: Fix syntax error in wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213847 (owner: 10Alex Monk) [16:17:00] okay, so why does https://cn.wikimedia.org/ still error... hmm [16:17:22] i shushed that! [16:17:25] ohhh, but probably for just an hour, oops [16:17:25] PROBLEM - Disk space on analytics1014 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 82807 MB (4% inode=99%): /var/lib/hadoop/data/c 80711 MB (4% inode=99%): /var/lib/hadoop/data/e 79786 MB (4% inode=99%): /var/lib/hadoop/data/f 80551 MB (4% inode=99%): /var/lib/hadoop/data/h 74375 MB (3% inode=99%): /var/lib/hadoop/data/l 83071 MB (4% inode=99%): /var/lib/hadoop/data/b 82997 MB (4% inode=99%): /var/lib/hadoop/data/k [16:17:28] gah [16:17:30] shushing [16:18:19] (03PS3) 10Shanmugamp7: Enable Extension:NewUserMessage on ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213841 (https://phabricator.wikimedia.org/T100431) [16:18:42] Glaisher: hopefully i removed the whitespace, pls check once you are back:P [16:19:18] Krenair: ? [16:19:26] never mind, I figured out what I did [16:19:29] ok [16:19:35] stupid missing " [16:24:46] can't figure out why https://cn.wikimedia.org/ is not working yet [16:27:54] !log krenair Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 13s) [16:27:58] Logged the message, Master [16:32:16] it's in all.dblist [16:36:42] (03PS1) 10Ottomata: Allow base modules' check_disk options to be overridden by hiera [puppet] - 10https://gerrit.wikimedia.org/r/213848 [16:37:02] paravoid: ^ would you check that out? [16:40:37] krenair@mw1017:/srv/mediawiki$ grep cnwikimedia all.dblist [16:40:37] cnwikimedia [16:40:41] this does not make sense [16:41:05] why does https://cn.wikimedia.org/ not work? [16:41:41] did you resync the lists after the correction? [16:43:32] oh, I see ^ [16:43:39] wgDBname seems to be being set to null? [16:43:56] but it's in wikiversions.cdb [16:45:06] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1312812 (10JAufrecht) Let me see if I understand the options. I see all of these as temporary hacks, because we hope to get something built in to Phab eventually. Option 1: Grant J... [16:48:13] any ideas twentyafterfour? [16:49:00] hmm [16:51:43] it appears in $wgConf->wikis [16:53:16] I can use mwscript on it [16:53:24] PROBLEM - Disk space on analytics1014 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 84986 MB (4% inode=99%): /var/lib/hadoop/data/c 74386 MB (3% inode=99%): /var/lib/hadoop/data/e 81059 MB (4% inode=99%): /var/lib/hadoop/data/f 84093 MB (4% inode=99%): /var/lib/hadoop/data/h 76919 MB (4% inode=99%): /var/lib/hadoop/data/l 83777 MB (4% inode=99%): /var/lib/hadoop/data/b 83391 MB (4% inode=99%): /var/lib/hadoop/data/k [16:53:44] PROBLEM - puppet last run on mw2038 is CRITICAL puppet fail [16:57:58] strange. [16:58:12] I'm poking the code here to see if I can figure out any possible cause [16:58:42] thanks [16:59:27] i checked on a random appserver and the ServerAlias exists yet if you just curl localhost -H "Host:cn.wikimedia.org" it's also an unknown wiki [17:00:00] the docroot is the same for all chapter wikis [17:01:43] yeah it definitely gets through to php because our missing.php comes up [17:02:20] (03PS1) 10Tim Landscheidt: Tools: Do not require package python-sh [puppet] - 10https://gerrit.wikimedia.org/r/213849 (https://phabricator.wikimedia.org/T91874) [17:05:15] (03PS4) 10Jcrespo: Update of parsercache db servers to MariaDB 10 (#1) [puppet] - 10https://gerrit.wikimedia.org/r/213784 [17:06:14] dbList.php doesn't include "mediawiki" in the list of "wiki_projects" but I doubt that is the issue here. [17:06:29] I mean wikimedia [17:07:02] (03PS3) 10Dzahn: deployment server init should configure repo every time [puppet] - 10https://gerrit.wikimedia.org/r/211435 (owner: 10ArielGlenn) [17:07:16] twentyafterfour, think I've got ti [17:07:18] it* [17:07:21] yep [17:07:28] there we go, works on mw1017 [17:07:59] added cn here: https://github.com/wikimedia/operations-mediawiki-config/blob/master/multiversion/MWMultiVersion.php#L182-L183 [17:09:43] ah [17:09:54] (03PS1) 10Alex Monk: Fix new cnwikimedia site to actually work now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213852 [17:10:15] so just needed for chapter wikis which dont happen that often [17:10:25] right, need to document this [17:10:27] !log krenair Synchronized multiversion/MWMultiVersion.php: open cnwikimedia (duration: 00m 13s) [17:10:31] Logged the message, Master [17:10:48] but in the mean time: https://cn.wikimedia.org/wiki/%E9%A6%96%E9%A1%B5 [17:11:08] https://wikitech.wikimedia.org/wiki/Add_a_wiki [17:11:20] uhm [17:11:40] look in MWMultiVersion.php line 182 [17:11:52] (03PS10) 10BBlack: sslcert: generate chained certs automatically [puppet] - 10https://gerrit.wikimedia.org/r/197341 (owner: 10Faidon Liambotis) [17:11:53] oh [17:11:55] I'm too slow [17:12:10] that's a dirty hacky bunch of code in there [17:12:12] yeah [17:12:25] RECOVERY - puppet last run on mw2038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:13:04] heh I'm still proud that I found that, even if it was too late ;) [17:13:11] !log krenair Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 15s) [17:13:15] Logged the message, Master [17:13:19] heya _joe_, can I get a quick review from you? [17:13:22] hiera related [17:13:58] (03CR) 10Alex Monk: [C: 032] "already in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213852 (owner: 10Alex Monk) [17:14:04] (03Merged) 10jenkins-bot: Fix new cnwikimedia site to actually work now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213852 (owner: 10Alex Monk) [17:14:19] (03PS2) 10Ottomata: Allow base modules' check_disk options to be overridden by hiera [puppet] - 10https://gerrit.wikimedia.org/r/213848 [17:14:55] is it correct to move all "$cluster = 'misc'" out of site.pp and replace with "cluster: misc" in hiera? like https://gerrit.wikimedia.org/r/#/c/210835/1 [17:16:52] (03CR) 10Ryan Lane: [C: 031] deployment server init should configure repo every time [puppet] - 10https://gerrit.wikimedia.org/r/211435 (owner: 10ArielGlenn) [17:19:39] mutante: if it works, yeah imo :) [17:21:30] (03PS11) 10BBlack: sslcert: generate chained certs automatically [puppet] - 10https://gerrit.wikimedia.org/r/197341 (owner: 10Faidon Liambotis) [17:22:42] um [17:22:42] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1312873 (10Dzahn) >>! In T98676#1299386, @coren wrote: > Creating a wiki and having it replicated in Labs may be thought of as an indepe... [17:22:48] so clearly something is slightly wrong [17:23:00] https://cn.wikimedia.org/wiki/Special:%E7%94%A8%E6%88%B7%E7%99%BB%E5%BD%95/signup - wut? [17:23:27] that definitely exists [17:23:44] jynus, [17:23:50] JohnFLewis: heh, universal comment :) [17:25:12] mutante: universal comment for a universal question [17:25:15] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1312877 (10jcrespo) >>! In T98676#1312873, @Dzahn wrote: >>>! In T98676#1299386, @coren wrote: >> Creating a wiki and having it replicat... [17:26:07] mutante: am in search of reviewer: [17:26:07] https://gerrit.wikimedia.org/r/#/c/213848/1 [17:26:12] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1312879 (10Krenair) 5Resolved>3Open [17:26:35] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1274366 (10Krenair) Something is wrong. The database exists and yet: "Unknown database 'cnwikimedia' (10.64.16.20)" on the signup page [17:27:00] that's db1031.. [17:27:01] Krenair, so my comment? [17:27:07] *saw [17:27:12] yes [17:27:21] all good on db side [17:27:35] well clearly not... [17:27:38] ? [17:27:43] https://cn.wikimedia.org/wiki/%E9%A6%96%E9%A1%B5 now errors [17:27:50] it wasn't doing this just now [17:27:54] (03PS12) 10BBlack: sslcert: generate chained certs automatically [puppet] - 10https://gerrit.wikimedia.org/r/197341 (owner: 10Faidon Liambotis) [17:28:05] wow [17:28:17] the db exists on db1038... [17:28:42] 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1312882 (10RobH) Do we want to host mailman archives in a VM? While the process itself isn't that demanding, its just a lot of semi-static (they do require regeneration when posts/content is pulled) files for web viewing. [17:28:56] has it not replicated properly? [17:29:18] strange, it should have provoked an alert [17:29:56] 38 is the master [17:30:16] by definition you can only create it once [17:30:27] it replicated ok to the other hosts [17:30:29] so it worked for me earlier because I happened to hit master? [17:30:45] I was checking it while you did your thing [17:31:27] There was some awkwardness in the DB creation because part of the script relied on an extension that was undeployed [17:31:33] ok [17:31:37] had to comment out some bits that were already done so I could re-run the script [17:31:42] then it may have failed once [17:31:46] that would be normal [17:31:53] but is it failing it more? [17:31:59] it isn't for me [17:32:06] you don't get an error at https://cn.wikimedia.org/wiki/%E9%A6%96%E9%A1%B5 ? [17:32:24] not me [17:32:43] (but it may be a 1 in 10 thing) [17:33:00] hmm [17:33:20] so 1038 is the s3 master [17:33:23] are you logged in [17:33:24] ? [17:33:29] (03CR) 10Dzahn: "it looks mostly good but i'm not sure if the role-based hiera lookup will work or only host-based or regex-based. in another case it did n" [puppet] - 10https://gerrit.wikimedia.org/r/213848 (owner: 10Ottomata) [17:33:29] but the errors are from 31 and 29 [17:33:32] not to the wiki, no [17:33:38] yes [17:33:40] 31 and 29 are in x2 [17:33:43] ottomata: ^ well.. that.. i dont know if role-based lookup will work [17:33:43] x1* sorry [17:33:59] ok, then that is a different shard [17:34:11] which is good news [17:35:48] 6operations, 10Analytics-Cluster, 10procurement: Hadoop worker node procurement - 2015 - https://phabricator.wikimedia.org/T100442#1312883 (10Ottomata) 3NEW [17:35:55] mutante: ? [17:37:17] SHOW DATABASES like 'cn%'; Empty set (0.00 sec) on x2 [17:37:28] right, it was x1 [17:37:30] that was my mistake [17:37:48] sorry, on x1 [17:38:20] db1029 [17:38:46] I supposed that the script would create it? [17:39:51] jynus, https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/addWiki.php#L80 [17:39:54] oh, interesting, mutante, hm. [17:40:35] let me checkm maybe it is creating it on the wrong place [17:40:45] mutante: but a lot of other rolebased lookups exist, right? [17:41:31] ottomata: yes,it didnt work for ganglia_new and i'm not sure why [17:41:36] hm, mutante, is it beacuse the role isn't setting a local var for the role class? [17:41:42] i'm trying to set a var for an included class? [17:41:48] oh, and it is a define. hmmMMm [17:41:50] hm [17:41:53] oh, no [17:41:55] its a module car [17:41:56] var [17:41:58] sorry [17:42:00] yeah, hm [17:42:17] like, because the yaml file isn't setting a variable for the role it is named after? [17:42:17] worst case it just breaks the checks for them, so i think it would still be ok to just try [17:42:41] yeah, breaks checks for hadoop nodes, right? it wouldn't break anythign for other prod nodes, right? [17:42:51] since the same default is in the class parameter [17:42:53] right, the other ones should be getting the default value [17:42:57] ok, cool [17:42:58] that you took as it was [17:42:59] gonna try it, thanks [17:43:08] ok, sure [17:43:34] (03CR) 10Ottomata: [C: 032] "Mutante and I are not sure if this will work for the hadoop node override, but are pretty sure it won't break any other production checks." [puppet] - 10https://gerrit.wikimedia.org/r/213848 (owner: 10Ottomata) [17:45:28] Krenair, yes, that sould be run against db1029 or x1-master [17:45:35] RECOVERY - Disk space on analytics1015 is OK: DISK OK [17:46:05] works mutante! [17:46:11] -command[check_disk_space]=/usr/lib/nagios/plugins/check_disk -w 6% -c 3% -l -e -A -i "/srv/sd[a-b][1-3]" [17:46:11] \+command[check_disk_space]=/usr/lib/nagios/plugins/check_disk -w 6% -c 3% -l -e -A -i "/var/lib/hadoop/data" [17:48:20] can you rerun that part? [17:49:05] RECOVERY - Disk space on analytics1017 is OK: DISK OK [17:49:08] jynus, re-run just the create database on s3 master? what would that achieve? [17:49:24] RECOVERY - Disk space on analytics1014 is OK: DISK OK [17:49:35] it is not just the create- all tables are missing [17:49:47] ottomata: :) [17:50:13] jynus, on s3? [17:50:19] no no [17:50:29] s3 is ok [17:50:31] the script only runs against s3.. [17:50:38] mmm [17:51:05] yay! [17:51:43] problem is that it will fail, because one schema has been created [17:52:17] I can manually recreate the structure for x1 [17:52:26] but here be dragons [17:52:51] let me try [17:56:52] cmjohnson1: any word on 1028 board replacement? [17:57:22] just realized you're not in the other channel [17:57:24] ottomata: I had to call Dell due to the severity of the problem (an1028). New parts should be here tomorrow or Thursday. [17:57:52] ok thank you [17:57:58] oh ja signing in there [17:58:18] cmjohnson1: also, has paravoid talked to you about 1036? [17:58:26] Krenair, that *may* work [17:58:34] no errors now [17:58:36] what'd you do? [17:58:39] not yet [17:58:51] Krenair, private [17:58:53] https://phabricator.wikimedia.org/T99845 [17:59:44] i'm not sure what steps need to be taken cmjohnson1, i guess maybe changing switch ports? [17:59:46] paravoid: knows more [18:00:06] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150526T1800). Please do the needful. [18:00:33] okay, I will set up another port to try and whenever you're ready I will move the eth cable [18:01:14] cmjohnson1: that node is effectively down, so you may move at any time [18:01:18] maybe sync up with paravoid first [18:01:24] PROBLEM - puppet last run on neon is CRITICAL puppet fail [18:01:31] okay! [18:04:48] eh..Error: Cannot open config file '/etc/icinga/contacts.cfg' for reading: Permission denied [18:04:56] looks [18:05:16] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [18:05:47] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1312919 (10jcrespo) I fixed this by manually creating the database on x1-master, but I am not 100% sure this will work. It seems that th... [18:06:46] RECOVERY - puppet last run on neon is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:06:49] eh, ignore that, that was just needing sudo to run the check [18:08:58] mutante: i'm trying to check that too, seeing anything? [18:09:01] oh its ok now [18:09:04] my network is being weird [18:09:07] gonna run home soon [18:10:01] ottomata: yea, it was just a glitch, looks fine [18:10:20] oh, because i was looking at that file i think [18:11:46] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [18:15:32] (03PS1) 1020after4: Group1 wikis to 1.26wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213857 [18:16:42] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1312927 (10Dzahn) Personally i would vote for 1a) or a modified 1a). - get Joel a shell user to get on some internal work host with a mysql client (tin, mira?) - make a new, persona... [18:21:06] (03CR) 1020after4: [C: 032] Group1 wikis to 1.26wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213857 (owner: 1020after4) [18:21:12] (03Merged) 10jenkins-bot: Group1 wikis to 1.26wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213857 (owner: 1020after4) [18:21:49] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: Group1 wikis to 1.26wmf7 [18:21:56] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:21:57] Logged the message, Master [18:31:07] PROBLEM - puppet last run on mw2105 is CRITICAL Puppet has 1 failures [18:31:28] robh: re. you're comment on the sodium upgrade ticket: is that comment meant to just say 'do we really want this possibly intensive process running on a VM?' or otherwise? [18:32:01] questioning the intensive rebuild process and use of a vm for http(s) list archives [18:32:29] its not a lot of storage but still [18:32:35] (not outright against it either) [18:32:51] so if alex and mark think it'll be fine there, im cool with it. i just wanted to call the issues out specifically [18:33:50] the intensive process is a spike and not regular, so i imagine its fine [18:33:55] the rebuild process that is. [18:33:58] robh: are you going to nanog? [18:34:24] wasn't planning to nope [18:35:21] I realize that I should want to go to those events to professionally network, but large conference crowds creep me out. [18:35:23] I usually go when they are in town. [18:35:29] nanog is pretty small. [18:35:39] no, 4 folks is pretty small, 6 folks is medium [18:35:43] 7+ is large [18:35:43] robh: eh true. the comment earlier from Filippo (I believe) was if it doesn't need a public IP, a VM is best otherwise hardware so it depends on where that comment stands really for Mark and Alex [18:35:50] introvert.. :) [18:36:02] JohnLewis: well, i think we could put it behind misc-web for the serving of list traffic [18:36:09] for http(s) [18:36:22] but im not certain of feasibility of internal ip for the actual list mail routing and such [18:36:36] robh: yeah but the exim part is my only issue. [18:36:41] robh: Leslie will be there... [18:36:54] it'll be nice just to catch up with her. [18:36:59] cajoel: i hang out with leslie all the time, i would intentionally avoid her at the conference so not to take up her time ;D [18:37:12] see, if i hang with her there, im stealing leslie time from others [18:38:14] so paying money to go be really uncomfortable with a large group? I do wikimania every 2 years thanks ;D [18:38:53] thogh, i say that, and went to puppetconf, so im obviously a liar. [18:39:08] it just takes a lot (like for puppetconf we had a lot of the ops team there) [18:41:45] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [18:44:04] (03PS1) 1020after4: Add a --all option to updateBranchPointers to update all branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213859 [18:46:16] RECOVERY - puppet last run on mw2105 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:46:30] (03PS2) 1020after4: Add a --all option to updateBranchPointers to update all branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213859 [18:49:10] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1312963 (10Cmjohnson) I am not sure why and how the the same interfaces are in there 2 different ways. I think we should clean this up on the switch first and then see if this will correct the problem. I... [18:49:56] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:54:38] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1312984 (10Krenair) 5Open>3Resolved Seems to work, I created [[ https://cn.wikimedia.org/wiki/User:Krenair | my user page ]]. I've a... [19:05:16] (03CR) 10Ottomata: "CooOOl. If we can get this merged and working, I am happy to make my varnishstats work use this." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/213293 (owner: 10Ori.livneh) [19:07:42] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1313016 (10Cmjohnson) Chatted with mark in irc and he gave the +1 on committing that change. While I was at it, I added a secondary port to check if we need (ge-2/0/37) it but did not add to the vlan just... [19:17:26] twentyafterfour, officially https://gerrit.wikimedia.org/r/#/c/213846/ should be backported 1.26wmf7 [19:17:48] not much point unless someone wants to create a wiki within the next fortnight though [19:17:56] PROBLEM - puppet last run on lvs3003 is CRITICAL puppet fail [19:18:11] uh, week. group1 -> wmf7 today, right [19:22:52] aude, you there? [19:23:56] aude, I've some really weird thing to do with wikisource and wikidata and the site table population script [19:24:02] seen something* [19:34:45] RECOVERY - puppet last run on lvs3003 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:48:12] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1313088 (10chasemp) From my standpoint the only viable solution at the moment is 2a, and I have at least a semblance of reasoning for my pessimism and general poo-poo'ing which I wil... [19:56:17] YuviPanda: yt? [19:58:31] in phabricator, i would like to make it so that "updated subscribers" is not a notification, but "added comment" still is [20:01:05] you can exclude subscribers notifications [20:01:08] iirc [20:01:12] mutante, https://phabricator.wikimedia.org/settings/panel/emailpreferences/ [20:01:23] this^ [20:01:23] you want maniphest tasks - " A task's subscribers change." = "Ignore" [20:01:37] thanks guys, that's what i was looking for [20:02:13] (03PS1) 10Ottomata: Add icinga check for Hadoop YARN NodeManager Node-State [puppet] - 10https://gerrit.wikimedia.org/r/213874 [20:03:00] and don't ask what "Other task activity not listed above occurs." means [20:03:03] (03CR) 10Ottomata: "Yuvi, would appreciate a review from you here, as I haven't used the nagios_common module before, and I don't see that many uses of it in " [puppet] - 10https://gerrit.wikimedia.org/r/213874 (owner: 10Ottomata) [20:03:08] I got back "whatever we haven't thought of yet" [20:03:36] * mutante changes most things from email to notifications [20:03:56] chasemp: heh, ok:) [20:18:24] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1313148 (10JAufrecht) To clarify, I'm trying to produce a proof of concept of this kind of reporting, which means I'm trying to create a historical report with real data for at leas... [20:35:25] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1313172 (10chasemp) Could we be so simple as to select every `viewPolicy="public"` from maniphest_task and then associated transactions and edge relationships for the matching tasks?... [20:36:55] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [20:52:00] 7Blocked-on-Operations, 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1313195 (10Yurik) Excellent news! Over the weekend, I was shown a database in labs (I can't lookup the host... [20:53:46] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [21:00:04] rmoen, kaldari: Respected human, time to deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150526T2100). Please do the needful. [21:00:23] Lol [21:02:19] (03PS1) 10Dzahn: build 5.3.4 for jessie, remove old patches [debs/ruby-jsduck] - 10https://gerrit.wikimedia.org/r/213954 (https://phabricator.wikimedia.org/T95008) [21:04:46] PROBLEM - puppet last run on carbon is CRITICAL Puppet last ran 7 hours ago [21:05:08] Bsadowski1? [21:05:26] Oh, it was the "Respected human" part. [21:07:55] oh, yeah [21:07:56] that's normal [21:08:05] I think it has different names for some people [21:08:07] or was that logmsgbot [21:08:31] (03CR) 10Dzahn: "re: the patches that i'm removing:" [debs/ruby-jsduck] - 10https://gerrit.wikimedia.org/r/213954 (https://phabricator.wikimedia.org/T95008) (owner: 10Dzahn) [21:09:40] (03PS2) 10Dzahn: build 5.3.4 for jessie, remove old patches [debs/ruby-jsduck] - 10https://gerrit.wikimedia.org/r/213954 (https://phabricator.wikimedia.org/T95008) [21:11:08] (03PS3) 10Dzahn: build 5.3.4 for jessie, remove old patches [debs/ruby-jsduck] - 10https://gerrit.wikimedia.org/r/213954 (https://phabricator.wikimedia.org/T95008) [21:14:35] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1313235 (10Dzahn) http://people.wikimedia.org/~dzahn/ruby-jsduck/ [21:16:46] (03CR) 10Dzahn: [C: 031] "built with these changes: http://people.wikimedia.org/~dzahn/ruby-jsduck/" [debs/ruby-jsduck] - 10https://gerrit.wikimedia.org/r/213954 (https://phabricator.wikimedia.org/T95008) (owner: 10Dzahn) [21:17:29] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1313236 (10Dzahn) a:3Dzahn [21:18:33] 6operations, 10Analytics-Cluster, 10procurement: Hadoop worker node procurement - 2015 - https://phabricator.wikimedia.org/T100442#1313237 (10Dzahn) p:5Triage>3Normal [21:19:21] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1313239 (10Dzahn) p:5Triage>3Normal [21:19:35] 6operations: Requesting addition to researchers group on stat1003 - https://phabricator.wikimedia.org/T99798#1313241 (10Dzahn) a:3Dzahn [21:19:46] 10Ops-Access-Requests, 6operations: Requesting addition to researchers group on stat1003 - https://phabricator.wikimedia.org/T99798#1299294 (10Dzahn) [21:19:54] 10Ops-Access-Requests, 6operations: Requesting addition to researchers group on stat1003 - https://phabricator.wikimedia.org/T99798#1299294 (10Dzahn) p:5Triage>3Normal [21:21:13] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1313251 (10Dzahn) p:5Triage>3Normal [21:21:31] 6operations, 10Analytics-Cluster: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1313252 (10Dzahn) p:5Triage>3Normal [21:22:40] 6operations, 7Icinga, 7Monitoring: remove (or fix) passive checks for removed hosts - https://phabricator.wikimedia.org/T99012#1313256 (10Dzahn) 5Open>3declined a:3Dzahn @Jgreen ok, thanks. let's close it then. good enough [21:23:17] 6operations, 7Icinga, 7Monitoring: Remove monitoring alerts for "0 unmerged changes in mediawiki_config" - https://phabricator.wikimedia.org/T99001#1313259 (10Dzahn) p:5Triage>3Low [21:23:58] 6operations, 10ops-esams: Check power supply balance settings on cp3030+ - https://phabricator.wikimedia.org/T98984#1313260 (10Dzahn) p:5Triage>3Normal [21:24:21] 6operations, 7Icinga, 7Monitoring: Icinga RAID monitoring status "NRPE: Unable to read output " reported as OK - https://phabricator.wikimedia.org/T98978#1313262 (10Dzahn) p:5Triage>3Normal [21:24:56] RECOVERY - puppet last run on carbon is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:25:36] PROBLEM - puppet last run on ms-be1017 is CRITICAL Puppet has 1 failures [21:48:18] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1313269 (10JAufrecht) I also need phabricator_project.project and phabricator_project.column. Other than that, this sounds fine. [21:48:46] PROBLEM - puppet last run on wtp2018 is CRITICAL puppet fail [21:51:31] 6operations, 6Analytics-Engineering: Honor DNT header for access logs & varnish logs - https://phabricator.wikimedia.org/T98831#1313270 (10Dzahn) p:5Triage>3Normal [21:51:58] 6operations, 10Deployment-Systems: git fat/git deploy doesn't always unstub files [Trebuchet] - https://phabricator.wikimedia.org/T98962#1313272 (10Dzahn) p:5Triage>3Normal [21:53:29] 6operations, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch on jessie - https://phabricator.wikimedia.org/T98042#1313275 (10Dzahn) p:5Triage>3Normal 1.3.6 on jessie? i already see 1.3.9 on terbium and that is precise [21:54:05] 6operations, 7HHVM: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1313277 (10Dzahn) p:5Triage>3High [21:58:29] 6operations, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch on jessie - https://phabricator.wikimedia.org/T98042#1313285 (10MoritzMuehlenhoff) The version in standard jessie in 1.0.2, but there's 1.4.5 in experimental which could easily be backported, the cu... [21:59:57] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1313287 (10Dzahn) eh... and now i found T83282 . same thing? how did that get resolved when changelog of the package was st... [22:02:55] RECOVERY - puppet last run on ms-be1017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:03:23] 6operations, 10ops-requests: Update ruby-jsduck package to v5.3.4 - https://phabricator.wikimedia.org/T83282#1313300 (10Dzahn) [22:04:06] RECOVERY - puppet last run on wtp2018 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [22:09:32] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1313316 (10Dzahn) so http://apt.wikimedia.org/wikimedia/pool/main/r/ruby-jsduck/ is already here (from T83282) but not che... [22:13:50] (03PS1) 10Alexandros Kosiaris: Setup ganeti100X with RAID1 [puppet] - 10https://gerrit.wikimedia.org/r/213958 [22:18:57] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1177814 (10Dzahn) @akosiaris does that mean i am supposed to build "5.3.4-1wmfjessie1"? also, how about https://gerrit.wi... [22:50:38] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1313370 (10Dzahn) >>! In T85141#1281721, @Nemo_bis wrote: > The dump will be almost totally useless without full user IDs (i.e. email address) for votes, subscribers and reports, as t... [22:56:24] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1313371 (10Krenair) (The account was created and a steward assigned them sysop so other users can be created without sysadmin interventi... [22:58:36] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [22:59:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 4 below the confidence bounds [23:00:04] RoanKattouw, ^d: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150526T2300). Please do the needful. [23:10:36] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [23:28:43] hoo, other thing is I noticed that no newprojects email went out [23:29:22] oh **** [23:29:53] :P [23:29:58] well now there is one. [23:30:03] dammit: https://lists.wikimedia.org/pipermail/newprojects/2015-May/000098.html [23:31:40] huh, turns out that script is really simple. it just generates code to dump straight into eval.php to send the email [23:45:18] I do wonder why this didn't work earlier though [23:46:42] Krenair: I loved the first email 'New wiki: --help' [23:54:37] :/ [23:55:07] I sent something useful manually instead